home/categories/data-engineering

category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 skillsall categories

sorting

stars

current ordering strategy

query

all entries

refine the visible subset

data-engineering

dask

Use when "Dask", "parallel computing", "distributed computing", "larger than memory", or asking about "parallel pandas", "parallel numpy", "out-of-core", "multi-file processing", "cluster computing", "lazy evaluation dataframe"

eyadsibai

data-ai

open

data-engineering

polars

High-performance DataFrame library usage. Covers Lazy API, Wrangling, Aggregation.

yonesuke

data-ai

open

data-engineering

polars

Lightning-fast DataFrame library written in Rust for high-performance data manipulation and analysis. Use when user wants blazing fast data transformations, working with large datasets, lazy evaluation pipelines, or needs better performance than pandas. Ideal for ETL, data wrangling, aggregations, joins, and reading/writing CSV, Parquet, JSON files.

silvainfm

data-ai

open

data-engineering

data-quality-audit

Comprehensive data quality assessment against defined business rules and constraints. Use when validating data against expected schemas, checking referential integrity across tables, or auditing data pipeline outputs before production use.

nimrodfisher

data-ai

open

data-engineering

wcdb

Use when working with wcdb

MemoryReload

data-ai

open

data-engineering

python-data-transform

Transform, clean, and reshape data using pandas and numpy for ETL and data preprocessing. WHEN: Manipulating DataFrames, cleaning datasets, reshaping data (pivot, melt), merging/joining tables, data normalization, CSV/Excel processing. WHEN NOT: Creating Excel files with formatting (use python-xlsx), building APIs (use python-backend), statistical modeling.

LounisBou

data-ai

open

data-engineering

convex-migration

guidance on how to properly do data migrations in Convex

ianwatts22

data-ai

open

data-engineering

lineage-and-provenance

See the main Data Lineage skill for comprehensive coverage of data lineage tracking and provenance.

AmnadTaowsoam

data-ai

open

data-engineering

polaris-catalog

ALWAYS USE when configuring Polaris catalog, managing namespaces, or setting up credentials in floe-platform. Use IMMEDIATELY when integrating DuckDB via dbt-duckdb plugin, configuring PyIceberg REST catalog, or debugging access control issues. Provides research steps for REST API, OAuth2 authentication, and multi-engine coordination with DuckDB, dbt, and Dagster.

Obsidian-Owl

data-ai

open

data-engineering

agentdb-state-manager

Persistent state management using AgentDB (DuckDB) for workflow analytics and checkpoints. Provides read-only analytics cache synchronized from TODO_*.md files, enabling: - Complex dependency graph queries - Historical workflow metrics - Context checkpoint storage/recovery - State transition analysis Use when: Data gathering and analysis for workflow state tracking Triggers: "analyze workflow", "query state", "checkpoint", "workflow metrics"

stharrold

data-ai

open

data-engineering

altinity-expert-clickhouse-replication

Diagnose ClickHouse replication health, Keeper connectivity, replica lag, and queue issues. Use for replication lag and readonly replica problems.

Altinity

data-ai

open

data-engineering

dbt-patterns

Comprehensive guide to dbt (data build tool) patterns, modeling best practices, testing strategies, and production workflows for modern data transformation

AmnadTaowsoam

data-ai

open

data-engineering

hive-scheduler

How to create scheduled jobs in Hive framework

paralect

data-ai

open

data-engineering

data-governance-and-quality

Data governance strategy, quality validation rules, and data dictionary management for vehicle insurance platform. Use when defining data quality standards, implementing validation rules, managing field mappings, resolving data conflicts, or establishing data governance processes. Covers data cleaning standards, quality metrics, and mapping management.

alongor666

data-ai

open

data-engineering

execplans

Write and maintain self-contained ExecPlans (execution plans) that a novice can follow end-to-end; use when planning or implementing non-trivial repo changes.

leynos

data-ai

open

data-engineering

analyzing-objectstar

Skill for understanding, editing, analyzing, and migrating TIBCO Objectstar (Object Service Broker) code used in mainframe OTP and batch applications. Activate when user is working with Objectstar rules, asks about mainframe modernization, or legacy 4GL code involving GET, FORALL, or EXCEPTION blocks.

JohnnyVicious

data-ai

open

data-engineering

executive-cdo

Executive CDO Agent. 데이터 전략, 데이터 거버넌스, AI/ML 전략을 담당합니다.

shaul1991

data-ai

open

data-engineering

test-data-generation-validation

Generate real Cassandra 5.0 test data using Docker containers, export SSTables with proper directory structure, validate parsing against sstabledump, and manage test datasets. Use when working with test data generation, dataset creation, SSTable export, validation, fixture management, or sstabledump comparison.

pmcfadin

data-ai

open

data-engineering

jpa-entity-creator

Creates JPA entities following best practices.

sivaprasadreddy

data-ai

open

data-engineering

hive-handler

How to create event handlers in Hive framework

paralect

data-ai

open

data-engineering

data-quality-checks-and-validation

Implementing comprehensive data quality checks across the data pipeline to ensure accuracy, completeness, and reliability.

AmnadTaowsoam

data-ai

open

data-engineering

csv-validator

Validates and fixes BOM CSV files for ECIR tool compatibility. Use when users need to check CSV files before running ECIR comparisons, fix CSV formatting issues, ensure required columns exist, or diagnose why ECIR tool fails to process a CSV file.