model-serving-inference
Comprehensive guide to deploying and serving LLM models including optimization, batching, caching, and production infrastructure
Comprehensive guide to deploying and serving LLM models including optimization, batching, caching, and production infrastructure
This skill should be used when the user asks to "train a model", "fine-tune", "build NLP model", "create training task", "optimize model performance", "improve accuracy", "what model should I use", or expresses vague training needs like "I want to do sentiment analysis" or "help me with NER". Provides coaching-style guidance to clarify goals, diagnose pain points, and recommend optimal training approaches.
Runs ML models in the browser and Node.js with Transformers.js and Hugging Face Inference API. Use when adding local inference, embeddings, or calling hosted models without GPU servers.
Apple Foundation Models framework for on-device AI, @Generable macro, guided generation, tool calling, and streaming. Use when user asks about on-device AI, Apple Intelligence, Foundation Models, @Generable, LLM, or local machine learning.
Multi-Model Orchestration - Guide for orchestrating multi-model agents
Apply modern AI/LLM development best practices: staying current on models, prompt/context engineering, architecture patterns, stack decisions, evaluation, and production deployment. Use when building AI features, selecting models, writing prompts, reviewing LLM code, or discussing AI architecture.
Templates and patterns for common ML training scenarios including text classification, text generation, fine-tuning, and PEFT/LoRA. Provides ready-to-use training configurations, dataset preparation scripts, and complete training pipelines. Use when building ML training pipelines, fine-tuning models, implementing classification or generation tasks, setting up PEFT/LoRA training, or when user mentions model training, fine-tuning, classification, generation, or parameter-efficient tuning.
Expert in managing the "Memory" of AI systems. Specializes in Vector Databases (RAG), Short/Long-term memory architectures, and Context Window optimization. Use when designing AI memory systems, optimizing context usage, or implementing conversation history management.
Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.
Enterprise LLM Fine-Tuning with LoRA, QLoRA, and PEFT techniques
NVIDIA NeMo framework for building and training conversational AI models. Use for NeMo Retriever models, RAG (Retrieval-Augmented Generation), embedding models, enterprise search, and multilingual retrieval systems.
Direct Preference Optimization (DPO) for aligning models with preference data without separate reward models. Triggers: dpo, preference optimization, rlhf, ref_model=none, patchdpotrainer, dpotrainer.
Automatically applies when choosing LLM models and providers. Ensures proper model comparison, provider selection, cost optimization, fallback patterns, and multi-model strategies.
Expert AI Engineer role (10+ Years Exp). Focuses on production-grade GenAI, Agentic Systems, Advanced RAG, and rigorous Evaluation.
Эксперт ML API. Используй для model serving, inference endpoints, FastAPI и ML deployment.
Estimate and optimize AI/ML costs including token usage, context window management, batch processing, and caching strategies.
Design, optimize, and refactor AI agent systems based on Anthropic best practices and latest research. Guides you through architectural decisions with interactive questionnaire, loads current documentation, and launches specialized agent-architect for detailed analysis.
Use this skill when running, managing, or analyzing yanex experiments. Includes executing experiments via CLI, parameter sweeps, dependencies, querying experiment history, comparing results, and maintaining experiment logs. Invoke when users mention yanex, experiments, training runs, parameter sweeps, or need to track ML experiments.
Use this skill in the scenario of deep learning project development.
Fine-tune LLMs with Unsloth using GRPO or SFT. Supports FP8, vision models, mobile deployment, Docker, packing, GGUF export. Use when: train with GRPO, fine-tune, reward functions, SFT training, FP8 training, vision fine-tuning, phone deployment, docker training, packing, export to GGUF.
NVIDIA API documentation for integrating NVIDIA services. Use for NVIDIA NIM (NVIDIA Inference Microservices), LLM APIs, visual models, multimodal APIs, retrieval APIs, healthcare APIs, and CUDA-X microservices integration.
Evaluate skills by executing them across sonnet, opus, and haiku models using sub-agents. Use when testing if a skill works correctly, comparing model performance, or finding the cheapest compatible model. Returns numeric scores (0-100) to differentiate model capabilities.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.