nemo-evaluator
Use when evaluating LLMs, running benchmarks like MMLU/HumanEval/GSM8K, setting up evaluation pipelines, or asking about "NeMo Evaluator", "LLM benchmarking", "model evaluation", "MMLU", "HumanEval", "GSM8K", "benchmark harnesses"
Use when evaluating LLMs, running benchmarks like MMLU/HumanEval/GSM8K, setting up evaluation pipelines, or asking about "NeMo Evaluator", "LLM benchmarking", "model evaluation", "MMLU", "HumanEval", "GSM8K", "benchmark harnesses"
This skill should be used when the user asks about "Hugging Face", "HF Hub", "transformers", "model hub", or needs guidance on which Hugging Face capability to use. Acts as an entry-point that routes to specialized HF skills (cli, jobs, datasets, evaluation, model-trainer, paper-publisher, trackio, tool-builder) based on the task. Use for authentication setup, quick operations, and choosing the right specialized skill.
Build complex AI systems with declarative programming, optimize prompts automatically, create modular RAG systems and agents with DSPy - Stanford NLP's framework for systematic LM programming. Use when you need to build complex AI systems, program LMs declaratively, optimize prompts automatically, create modular AI pipelines, or build RAG systems and agents.
Use when "HuggingFace Transformers", "pre-trained models", "pipeline API", or asking about "text generation", "text classification", "question answering", "NER", "fine-tuning transformers", "AutoModel", "Trainer API"
Bayesian survival analysis models including exponential, Weibull, log-normal, and piecewise exponential hazard models with censoring support.
Model fine-tuning with PyTorch and HuggingFace Trainer. Covers dataset preparation, tokenization, training loops, TrainingArguments, SFTTrainer for instruction tuning, evaluation, and checkpoint management. Includes Unsloth recommendations.
Embedding model configurations and cost calculators
Use when "experiment tracking", "MLflow", "Weights & Biases", "wandb", "model registry", "hyperparameter logging", "ML experiments", "training metrics"
Production-grade data science specialist with TensorFlow 2.20.0, PyTorch 2.9.0, Scikit-learn 1.7.2 expertise. Master data processing, ML pipeline development, model deployment, and statistical analysis. Build end-to-end data science solutions with comprehensive experimentation and visualization.
This skill should be used when working with R tidymodels packages, including when the user asks to "create a tidymodels workflow", "build a recipe", "tune a model", "use parsnip", "set up resampling", "create a workflow_set", "compare models", "stack models", or mentions tidymodels packages like recipes, parsnip, workflows, workflowsets, tune, rsample, yardstick, or stacks. Provides ecosystem context before package-specific skills.
Instrument evaluation metrics, quality scores, and feedback loops
Foundational knowledge for writing BUGS/JAGS models including precision parameterization, declarative syntax, distributions, and R integration. Use when creating or reviewing BUGS/JAGS models.
Cross-validation configuration and fold management for this competition
Guide for designing Instance resources in OptAIC. Use when creating DatasetInstance, SignalInstance, ExperimentInstance, ModelInstance, PortfolioOptimizerInstance, or BacktestInstance. Covers definition references, config patterns, composition, flow execution pairing, and scheduling.
Performing full fine-tuning (FFT) in Unsloth with 100% exact weight updates and optimized gradient checkpointing. Triggers include fft, full fine-tuning, full_finetuning, exact fine-tuning, and weight updates.
Enterprise Machine Learning specialist with TensorFlow 2.20.0, PyTorch 2.9.0, Scikit-learn 1.7.2 expertise. Master AutoML, neural architecture search, MLOps automation, and production ML deployment. Build scalable ML pipelines with comprehensive monitoring and experiment tracking.
Use when "scikit-learn", "sklearn", "machine learning", "classification", "regression", "clustering", or asking about "train test split", "cross validation", "hyperparameter tuning", "ML pipeline", "random forest", "SVM", "preprocessing"
Train and deploy neural networks in distributed E2B sandboxes with Flow Nexus
Rigorous RL evaluation - statistical protocols, train/test discipline, metrics, generalization
Classifies skin conditions including Melanoma and Basal Cell Carcinoma using TF.js MobileNetV3
Create, validate, and debug YAML training configurations for axolotl-rs fine-tuning
See the main Model Explainability skill for comprehensive coverage of confidence scoring and calibration.