home/categories/data-engineering

category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 skillsall categories

sorting

stars

current ordering strategy

query

all entries

refine the visible subset

data-engineering

1.8K

csv-wave-pipeline

Requirement planning to wave-based CSV execution pipeline. Decomposes requirement into dependency-sorted CSV tasks, computes execution waves, runs wave-by-wave via spawn_agents_on_csv with cross-wave context propagation.

catlog22

data-ai

open

data-engineering

1.8K

team-arch-opt

Unified team skill for architecture optimization. Uses team-worker agent architecture with role directories for domain logic. Coordinator orchestrates pipeline, workers are team-worker agents. Triggers on "team arch-opt".

catlog22

data-ai

open

data-engineering

1.8K

team-issue

Unified team skill for issue resolution. Uses team-worker agent architecture with role directories for domain logic. Coordinator orchestrates pipeline, workers are team-worker agents. Triggers on "team issue".

catlog22

data-ai

open

data-engineering

1.8K

team-tech-debt

Unified team skill for tech debt identification and remediation. Scans codebase for tech debt, assesses severity, plans and executes fixes with validation. Uses team-worker agent architecture with roles/ for domain logic. Coordinator orchestrates pipeline, workers are team-worker agents. Triggers on "team tech debt".

catlog22

data-ai

open

data-engineering

1.8K

team-ultra-analyze

Deep collaborative analysis team skill. All roles route via this SKILL.md. Beat model is coordinator-only (monitor.md). Structure is roles/ + specs/. Triggers on "team ultra-analyze", "team analyze".

catlog22

data-ai

open

data-engineering

1.7K

data-processor

Process and validate data inputs

cisco-ai-defense

data-ai

open

data-engineering

1.7K

add-migration

Creates a Flyway Java-based migration for schema changes. Handles table creation, column additions, tenant isolation, and ES reindex. Use when asked to modify the database schema.

OpenAEV-Platform

data-ai

open

data-engineering

1.6K

data-designer

Use when the user wants to create a dataset, generate synthetic data, or build a data generation pipeline.

NVIDIA-NeMo

data-ai

open

data-engineering

1.4K

go-to-production

Production readiness checklist for durable streams. Switch from dev server to Caddy binary, configure CDN caching with offset-based URLs, Cache-Control and ETag headers, Stream-Cursor for cache collision prevention, TTL and Stream-Expires-At for stream lifecycle, HTTPS requirement, request collapsing for fan-out, CORS configuration. Load before deploying durable streams to production.

durable-streams

data-ai

open

data-engineering

1.4K

writing-data

Writing data to durable streams. DurableStream.create() with contentType, DurableStream.append() for simple writes, IdempotentProducer for high-throughput exactly-once delivery with autoClaim, fire-and-forget append(), flush(), close(), StaleEpochError handling, JSON mode vs byte stream mode, stream closure. Load when writing, producing, or appending data to a durable stream.

durable-streams

data-ai

open

data-engineering

1.4K

state-schema

Defining typed state schemas for @durable-streams/state. createStateSchema() with CollectionDefinition (schema, type, primaryKey), Standard Schema validators (Zod, Valibot, ArkType), event helpers insert/update/delete/upsert, ChangeEvent and ControlEvent types, State Protocol operations, transaction IDs (txid) for write confirmation. Load when defining entity types, choosing a schema validator, or creating typed change events.

durable-streams

data-ai

open

data-engineering

1.4K

stream-db

Stream-backed reactive database with @durable-streams/state. createStreamDB() with schema and stream options, db.preload() lazy initialization, db.collections for TanStack DB collections, optimistic actions with onMutate and mutationFn, db.utils.awaitTxId() for transaction confirmation, control events (snapshot-start, snapshot-end, reset), db.close() cleanup, re-exported TanStack DB operators (eq, gt, and, or, count, sum, avg, min, max).

durable-streams

data-ai

open

data-engineering

1.4K

yjs-sync

Yjs CRDT sync over durable streams with @durable-streams/y-durable-streams. DurableStreamsProvider setup, document stream and awareness stream config, transport modes (SSE vs long-poll), provider lifecycle (connect, disconnect, destroy), synced/status/error events, lib0 VarUint8Array framing, awareness heartbeat. Requires yjs, y-protocols, lib0 peer dependencies. Load when integrating Yjs collaborative editing with durable streams.

durable-streams

data-ai

open

data-engineering

1.3K

data-exploration-visualization

自动化数据探索和可视化工具，提供从数据加载到专业报告生成的完整EDA解决方案。支持多种图表类型、智能数据诊断、建模评估和HTML报告生成。适用于医疗、金融、电商等领域的数据分析项目。

foryourhealth111-pixel

data-ai

open

data-engineering

1.3K

data-quality-checker

Data Quality Checker - Auto-activating skill for Data Pipelines. Triggers on: data quality checker, data quality checker Part of the Data Pipelines skill category.

foryourhealth111-pixel

data-ai

open

data-engineering

1.3K

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

foryourhealth111-pixel

data-ai

open

data-engineering

1.3K

vaex

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.

foryourhealth111-pixel

data-ai

open

data-engineering

1.3K

xan

High-performance CSV processing with xan CLI for large tabular datasets, streaming transformations, and low-memory pipelines.

foryourhealth111-pixel

data-ai

open

data-engineering

1.3K

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

foryourhealth111-pixel

data-ai

open

data-engineering

1.3K

dust-temporal

Step-by-step guide for creating Temporal workflows in Dust. Use when adding background jobs, async processing, durable workflows, or task queues.

dust-tt

data-ai

open

data-engineering

1.3K

atmos-packer

Packer orchestration: init/build/validate/inspect/output, machine image building, template management, source management

cloudposse

data-ai

open

data-engineering

1.2K

databricks-aibi-dashboards

Create Databricks AI/BI dashboards. Use when creating, updating, or deploying Lakeview dashboards. CRITICAL: You MUST test ALL SQL queries via execute_sql BEFORE deploying. Follow guidelines strictly.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-config

Manage Databricks workspace connections: check current workspace, switch profiles, list available workspaces, or authenticate to a new workspace. Use when the user mentions "switch workspace", "which workspace", "current profile", "databrickscfg", "connect to workspace", or "databricks auth".

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-dbsql

Databricks SQL (DBSQL) advanced features and SQL warehouse capabilities. This skill MUST be invoked when the user mentions: "DBSQL", "Databricks SQL", "SQL warehouse", "SQL scripting", "stored procedure", "CALL procedure", "materialized view", "CREATE MATERIALIZED VIEW", "pipe syntax", "|>", "geospatial", "H3", "ST_", "spatial SQL", "collation", "COLLATE", "ai_query", "ai_classify", "ai_extract", "ai_gen", "AI function", "http_request", "remote_query", "read_files", "Lakehouse Federation", "recursive CTE", "WITH RECURSIVE", "multi-statement transaction", "temp table", "temporary view", "pipe operator". SHOULD also invoke when the user asks about SQL best practices, data modeling patterns, or advanced SQL features on Databricks.

databricks-solutions

data-ai

open

Page 12 / 65