home/categories/data-engineering

category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 스킬all categories

sorting

stars

current ordering strategy

query

all entries

refine the visible subset

data-engineering

apache-spark-data-processing

Complete guide for Apache Spark data processing including RDDs, DataFrames, Spark SQL, streaming, MLlib, and production deployment

manutej

data-ai

open

data-engineering

Guidance for writing SPARQL queries against RDF/Turtle datasets, particularly for university or academic data. This skill should be used when tasks involve querying RDF data with SPARQL, filtering entities based on multiple criteria, aggregating results, or working with Turtle (.ttl) files.

letta-ai

data-ai

open

data-engineering

sparql-university

Guidance for writing and verifying SPARQL queries against RDF datasets, particularly university/academic ontologies. This skill should be used when tasks involve querying RDF data with SPARQL, working with academic datasets (students, professors, departments, courses), or performing complex graph pattern matching with filters and aggregations.

letta-ai

data-ai

open

data-engineering

kafka-stream-processing

Complete guide for Apache Kafka stream processing including producers, consumers, Kafka Streams, connectors, schema registry, and production deployment

manutej

data-ai

open

data-engineering

multi-source-data-merger

This skill provides guidance for merging data from multiple heterogeneous sources (JSON, CSV, Parquet, XML, etc.) into a unified dataset. Use this skill when tasks involve combining records from different file formats, applying field mappings, resolving conflicts based on priority rules, or generating merged outputs with conflict reports. Applicable to ETL pipelines, data consolidation, and record deduplication scenarios.

letta-ai

data-ai

open

data-engineering

reshard-c4-data

Guidance for data resharding tasks that involve reorganizing files across directory structures with constraints on file sizes and directory contents. This skill applies when redistributing datasets, splitting large files, or reorganizing data into shards while maintaining constraints like maximum files per directory or maximum file sizes. Use when tasks involve resharding, data partitioning, or directory-constrained file reorganization.

letta-ai

data-ai

open

data-engineering

dbt-data-transformation

Complete guide for dbt data transformation including models, tests, documentation, incremental builds, macros, packages, and production workflows

manutej

data-ai

open

data-engineering

manage-seeders

Manages Database Seeders with advanced support for JSON data sources, idempotency checks, and relationship mapping.

iurygdeoliveira

data-ai

open

data-engineering

tanstack-query-advanced

Advanced TanStack Query v5 patterns for infinite queries, optimistic updates, prefetching, gcTime, and queryOptions

yonatangross

data-ai

open

data-engineering

golden-dataset-management

Use when backing up, restoring, or validating golden datasets. Prevents data loss and ensures test data integrity for AI/ML evaluation systems.

yonatangross

data-ai

open

data-engineering

golden-dataset-validation

Use when validating golden dataset quality. Runs schema checks, duplicate detection, and coverage analysis to ensure dataset integrity for AI evaluation.

yonatangross

data-ai

open

data-engineering

gcs-data-catalog

Activates when querying Danish agricultural data from GCS. Use this skill for: data discovery, finding datasets, understanding schemas, querying parquet files, joining datasets on CVR/CHR/BFE identifiers. Keywords: data, catalog, datasets, GCS, parquet, schema, query, DuckDB, pyarrow

Klimabevaegelsen

data-ai

open

data-engineering

streamlit-development

Developing, testing, and deploying Streamlit data applications on Snowflake. Use this skill when you're building interactive data apps, setting up local development environments, testing with pytest or Playwright, or deploying apps to Snowflake using Streamlit in Snowflake.

sfc-gh-dflippo

data-ai

open

data-engineering

dbt-core

Managing dbt-core locally - installation, configuration, project setup, package management, troubleshooting, and development workflow. Use this skill for all aspects of local dbt-core development including non-interactive scripts for environment setup with conda or venv, and comprehensive configuration templates for profiles.yml and dbt_project.yml.

sfc-gh-dflippo

data-ai

open

data-engineering

snowflake-connections

Configuring Snowflake connections using connections.toml (for Snowflake CLI, Streamlit, Snowpark) or profiles.yml (for dbt) with multiple authentication methods (SSO, key pair, username/password, OAuth), managing multiple environments, and overriding settings with environment variables. Use this skill when setting up Snowflake CLI, Streamlit apps, dbt, or any tool requiring Snowflake authentication and connection management.

sfc-gh-dflippo

data-ai

open

data-engineering

exploratory-data-analysis

EDA toolkit. Analyze CSV/Excel/JSON/Parquet files, statistical summaries, distributions, correlations, outliers, missing data, visualizations, markdown reports, for data profiling and insights.

lifangda

data-ai

open

data-engineering

file-processing

Process and analyze CSV, JSON, and text files with data transformation, cleaning, analysis, and visualization capabilities

aws-samples

data-ai

open

data-engineering

polars

Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows.

lifangda

data-ai

open

data-engineering

data-lake-platform

Universal data lake and lakehouse patterns covering ingestion (dlt, Airbyte), transformation (SQLMesh, dbt), storage formats (Iceberg, Delta, Hudi, Parquet), query engines (ClickHouse, DuckDB, Doris, StarRocks), streaming (Kafka, Flink), orchestration (Dagster, Airflow, Prefect), and visualization (Metabase, Superset, Grafana). Self-hosted and cloud options.

vasilyu1983

data-ai

open

data-engineering

data-engineer

Expert data engineer specializing in building scalable data pipelines, ETL/ELT processes, and data infrastructure. Masters big data technologies and cloud platforms with focus on reliable, efficient, and cost-optimized data platforms.

zenobi-us

data-ai

open

data-engineering

ai-ml-data-science

End-to-end data science patterns (modern best practices): problem framing -> data -> EDA -> feature engineering (with feature stores) -> modelling -> evaluation -> reporting, plus SQL transformation (SQLMesh). Emphasizes MLOps integration, drift monitoring, and production-ready workflows.

vasilyu1983

data-ai

open

data-engineering

execplan

When writing complex features or significant refactors or user ask explicitly, use an ExecPlan from design to implementation.

tiann

data-ai

open

data-engineering

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

lifangda

data-ai

open

data-engineering

managing-bd-tasks

Use for advanced bd operations - splitting tasks mid-flight, merging duplicates, changing dependencies, archiving epics, querying metrics, cross-epic dependencies

withzombies

data-ai

open

Page 48 / 65