smarthome-video-anomaly-benchmark
VLM evaluation suite for video anomaly detection in smart home camera footage
VLM evaluation suite for video anomaly detection in smart home camera footage
Use this skill when the task involves resizing, scaling, or compressing image files. Suitable for tasks like "resize these photos to 800px wide", "compress images to reduce file size", or "batch scale all JPEGs in a folder". Only relevant for image processing tasks — do NOT use for data files, text, or non-image tasks.
Generate/edit images with Nano Banana Pro (Gemini 3 Pro Image). Use for image create/modify requests incl. edits. Supports text-to-image + image-to-image; 1K/2K/4K; use --input-image.
Download videos, audio, subtitles, and clean paragraph-style transcripts from YouTube and any other yt-dlp supported site. Use when asked to “download this video”, “save this clip”, “rip audio”, “get subtitles”, “get transcript”, or to troubleshoot yt-dlp/ffmpeg and formats/playlists.
Generate captions (descriptions) for images, videos, and documents using ZhiPu GLM-V multimodal model series. Use this skill whenever the user wants to describe, caption, summarize, or interpret the content of images, videos, or files. Supports single/multiple inputs, URLs, local paths, and base64 (images only).
Merge, concatenate, sort, intersect, and subset VCF files using bcftools. Use when combining variant files, comparing call sets, or restructuring VCF data.
Process and generate multimedia content using Google Gemini API. Capabilities include analyze audio files (transcription with timestamps, summarization, speech understanding, music/sound analysis up to 9.5 hours), understand images (captioning, object detection, OCR, visual Q&A, segmentation), process videos (scene detection, Q&A, temporal analysis, YouTube URLs, up to 6 hours), extract from documents (PDF tables, forms, charts, diagrams, multi-page), generate images (text-to-image, editing, composition, refinement). Use when working with audio/video files, analyzing images or screenshots, processing PDF documents, extracting structured data from media, creating images from text prompts, or implementing multimodal AI features. Supports multiple models (Gemini 2.5/2.0) with context windows up to 2M tokens.
Process multimedia files with FFmpeg (video/audio encoding, conversion, streaming, filtering, hardware acceleration) and ImageMagick (image manipulation, format conversion, batch processing, effects, composition). Use when converting media formats, encoding videos with specific codecs (H.264, H.265, VP9), resizing/cropping images, extracting audio from video, applying filters and effects, optimizing file sizes, creating streaming manifests (HLS/DASH), generating thumbnails, batch processing images, creating composite images, or implementing media processing pipelines. Supports 100+ formats, hardware acceleration (NVENC, QSV), and complex filtergraphs.
Optimize Granola transcription accuracy, note quality, and processing speed. Use when improving transcription quality, reducing processing time, optimizing templates for better AI output, or tuning audio setup. Trigger: "granola performance", "granola accuracy", "granola quality", "improve granola", "granola transcription better".
Animate static images into video using Kling AI. Use when converting images to video, adding motion to stills, or building I2V pipelines. Trigger with phrases like 'klingai image to video', 'kling ai animate image', 'klingai img2vid', 'animate picture klingai'.
Optimize Deepgram API performance for faster transcription and lower latency. Use when improving transcription speed, reducing latency, or optimizing audio processing pipelines. Trigger: "deepgram performance", "speed up deepgram", "optimize transcription", "deepgram latency", "deepgram faster", "deepgram throughput".
Process images using object detection, classification, and segmentation. Use when requesting "analyze image", "object detection", "image classification", or "computer vision". Trigger with relevant phrases based on skill purpose.
Optimize TwinMind transcription accuracy and speed with Ear-3 model configuration, audio quality tuning, and caching strategies. Use when implementing performance tuning, or managing TwinMind meeting AI operations. Trigger with phrases like "twinmind performance tuning", "twinmind performance tuning".
Generate videos via LTX-2.3 API (ltx.video). Supports text-to-video, image-to-video, audio-to-video (lip-sync from audio + image), extend, and retake. Use when: generating AI video from text/image/audio, animating a portrait, creating lip-sync video from an existing image + audio recording.
【WORKFLOW SKILL】根据输入视频的音频信息进行口播粗剪。Rough cut based on audio information from the input video for narration.
Optimize 8-bit animations for smooth performance. Apply when creating animated pixel art, game UI effects, or any retro-styled animations.
Execute end-to-end Bilibili downloads with yutto. Use this whenever the user wants you to actually download a Bilibili 投稿视频、番剧、课程、收藏夹、稍后再看、合集、列表 or audio for them, or wants you to install/configure yutto and complete the download instead of merely explaining commands. This skill should verify installation and FFmpeg, check auth status, collect missing required inputs such as the link and download directory, then run the download.
视频后期处理与合成。当用户说"加背景音乐"、"合并视频"、"加片头片尾"、想为成片添加 BGM、或需要将多集视频拼接时使用。
为剧本场景生成视频片段。当用户说"生成视频"、"把分镜图变成视频"、想重新生成某个场景的视频、或视频生成中断需要续传时使用。支持整集批量、单场景、断点续传等模式。
Analyze images using GPT-4 Vision for detailed description, OCR text extraction, object recognition, and visual Q&A. Use when the user needs to understand image content, extract text from screenshots, identify objects in photos, or ask questions about images via OpenAI GPT-4 Vision API.
Get local file path of image sent by user. When user sends image, system auto-downloads it. When you need to process user's image or analyze image content.
Get local file path of voice message sent by user. When user sends voice message, system auto-downloads it. When you need to process user's voice message or transcribe voice to text.
Download YouTube videos with customizable quality and format options. Use this skill when the user asks to download, save, or grab YouTube videos. Supports various quality settings (best, 1080p, 720p, 480p, 360p), multiple formats (mp4, webm, mkv), and audio-only downloads as MP3.