home/categories/llm-ai/nb150301-voice-training-app-claude-skills-ai-multimodal-skill-md

llm-aidata-ai

ai-multimodal

Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image with Imagen 4, editing, composition, refinement), generate videos (text-to-video with Veo 3, 8-second clips with native audio). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images/videos from text prompts, or implementing multimodal AI features. Supports Gemini 3/2.5, Imagen 4, and Veo 3 models with context windows up to 2M tokens.

Посмотреть исходный код llm-ai

maintainer

nb150301

Обновлено 11/25/2025

Звёзды

Форки

quick start

Installation and usage

Установка

$ install --globalskills.sh

Использование

После установки вы можете использовать этот skill, выполнив следующую команду в терминале:

skills use ai-multimodal