llm-evaluation
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Reinforcement learning training for CTF-AI. Use when training DQN agents, adjusting hyperparameters, debugging training issues, analyzing rewards, or improving AI performance.
Production deployment and operationalization of AI agents on Databricks. Use when deploying agents to Model Serving, setting up MLflow logging and tracing for agents, implementing Agent Evaluation frameworks, monitoring agent performance in production, managing agent versions and rollbacks, optimizing agent costs and latency, or establishing CI/CD pipelines for agents. Covers MLflow integration patterns, evaluation best practices, Model Serving configuration, and production monitoring strategies.
This skill should be used when the user asks to "run local LLMs", "use LM Studio", "configure local AI server", "estimate VRAM requirements", "load a model locally", or needs guidance on OpenAI-compatible local API usage, model quantization selection, GPU offload configuration, MCP server integration, or headless LLM server management. Covers local AI inference, CLI automation, SDK integration, and hardware optimization.
Train LangChain/LangGraph agents using Microsoft Agent-Lightning APO (Automatic Prompt Optimization). Use when user mentions APO training, prompt optimization, agent-lightning, training multi-agent systems, training single agents, or optimizing agent prompts.
Expert in building scalable ML systems, from data pipelines and model training to production deployment and monitoring.
Эксперт AutoML. Используй для automated machine learning, hyperparameter tuning и model selection.
Comprehensive LLM model evaluation and ranking system. Use when users ask to compare language models, find the best model for a specific task, understand model capabilities, get pricing information, or need help selecting between GPT-4, Claude, Gemini, Llama, or other LLMs. Provides benchmark-based rankings, cost analysis, and use-case-specific recommendations across reasoning, code generation, long context, multimodal, and other capabilities.
Choose the best Codex model (gpt-4 family, gpt-4o-mini, or legacy davinci) based on the workload described; use when the user asks for a model suggestion, wants to optimize quality vs cost/latency, or says “change model” for the current answer.
LLM fine-tuning 教練式引導工作流程 v2。 核心功能:主動探索使用者痛點、引導明確目標、多任務管理、資料來源追蹤、完整版本 lineage。 支援:LoRA/QLoRA/DoRA 微調、SFT/ORPO/DPO 對齊、資料準備、Benchmark 評估、HuggingFace 部署。 特色:教練式引導、可重現的資料管線、多任務版本追蹤。 觸發詞:「訓練模型」「fine-tune」「微調」「LoRA」「建立新任務」「改善模型」「優化準確率」「資料管線」「任務管理」
Automatically applies when evaluating LLM performance. Ensures proper eval datasets, metrics computation, A/B testing, LLM-as-judge patterns, and experiment tracking.
Expert in designing, optimizing, and evaluating prompts for Large Language Models. Specializes in Chain-of-Thought, ReAct, few-shot learning, and production prompt management. Use when crafting prompts, optimizing LLM outputs, or building prompt systems. Triggers include "prompt engineering", "prompt optimization", "chain of thought", "few-shot", "prompt template", "LLM prompting".
Embedding/vector caching for AI cost optimization
NVIDIA NIM (NVIDIA Inference Microservices) for deploying and managing AI models. Use for NIM microservices, model inference, API integration, and building AI applications with NVIDIA's inference infrastructure.
Create and train AI learning plugins with AgentDB's 9 reinforcement learning algorithms. Includes Decision Transformer, Q-Learning, SARSA, Actor-Critic, and more. Use when building self-learning agents, implementing RL, or optimizing agent behavior through experience.
World-class ML engineering skill for productionizing ML models, MLOps, and building scalable ML systems. Expertise in PyTorch, TensorFlow, model deployment, feature stores, model monitoring, and ML infrastructure. Includes LLM integration, fine-tuning, RAG systems, and agentic AI. Use when deploying ML models, building ML platforms, implementing MLOps, or integrating LLMs into production systems.
xAI Grok model selection and capabilities guide. Use when choosing the right Grok model for your task, comparing model features, or optimizing costs.
Expert in Machine Learning Operations bridging data science and DevOps. Use when building ML pipelines, model versioning, feature stores, or production ML serving. Triggers include "MLOps", "ML pipeline", "model deployment", "feature store", "model versioning", "ML monitoring", "Kubeflow", "MLflow".
Select and optimize embedding models for semantic search and RAG applications. Use when choosing embedding models, implementing chunking strategies, or optimizing embedding quality for specific domains.
Capture task outcomes, score performance, and derive rules as token priors for continual learning without model weight changes. Use for post-task feedback, experience capture, pattern extraction, and learning from mistakes. Achieves continual learning for $18 per 100 samples vs $10k fine-tune cost. Triggers on "learn from experience", "capture patterns", "post-task analysis", "continual learning", "experience extraction".
ML Engineer role: LLM APIs (OpenAI, Claude, Gemini), embeddings, RAG pipelines, fine-tuning, LangChain, LlamaIndex, vector databases (Pinecone, Chroma, Weaviate), prompt engineering, model evaluation, cost optimization, Agentic RAG, AI Agents, MCP, LLM observability. 30 methodologies.
Specialized skill for ML training workflows on cloud GPUs. Fine-tune LLMs with LoRA/QLoRA, train image LoRAs, build classifiers, and run custom training jobs. Generates production-ready training pipelines with checkpointing, logging, and optimal GPU selection.