home/categories/data-engineering
category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 個技能all categories
sorting
stars
current ordering strategy
query
all entries
refine the visible subset
data-engineering
1K

supabase-data-handling

Implement Supabase PII handling, data retention, and GDPR/CCPA compliance patterns. Use when handling sensitive data, implementing data redaction, configuring retention policies, or ensuring compliance with privacy regulations for Supabase integrations. Trigger with phrases like "supabase data", "supabase PII", "supabase GDPR", "supabase data retention", "supabase privacy", "supabase CCPA".

jeremylongshore
jeremylongshore
data-ai
open
data-engineering
1K

sql-transform-helper

Sql Transform Helper - Auto-activating skill for Data Pipelines. Triggers on: sql transform helper, sql transform helper Part of the Data Pipelines skill category.

jeremylongshore
jeremylongshore
data-ai
open
data-engineering
987

typescript-bun-drizzle-quality

Build or review Bun fullstack TypeScript code with Drizzle-backed SQL. Use for backend or cross-layer changes touching API/domain logic, schema or query design, migrations, runtime/type debugging, and boundary validation between contracts, business rules, and persistence.

databuddy-analytics
databuddy-analytics
data-ai
open
data-engineering
972

analyze-spec

Socratic deep-interview analysis of a spec file to ensure zero ambiguity before implementation

a16z
a16z
data-ai
open
data-engineering
971

batch

Research and plan a large-scale change, then execute it in parallel across 5-30 isolated worktree agents that each open a PR. Use when the user wants to make a sweeping, mechanical change across many files (migrations, refactors, bulk renames) that can be decomposed into independent parallel units.

remorses
remorses
data-ai
open
data-engineering
953

intelligence-network-espionage

Use when building covert informant networks to gather intelligence on rival states. Covers agent placement, secure communication channels, and intelligence verification for strategic advantage.

baojie
baojie
data-ai
open
data-engineering
950

dnanexus-integration

DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution.

wu-yc
wu-yc
data-ai
open
data-engineering
950

lamindb

This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.

wu-yc
wu-yc
data-ai
open
data-engineering
950

dask

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

wu-yc
wu-yc
data-ai
open
data-engineering
950

export-experiment-data-to-excel

Exports any structured experimental data (JSON, tables, time series) to well-formatted Excel (.xlsx) files. Auto-names sheets (Raw Data, Growth Curves, Cell Counts, etc.), adds unit headers and annotation rows, applies consistent styling, and produces lab-ready spreadsheets for sharing, archival, or downstream analysis in R, pandas, or Excel.

wu-yc
wu-yc
data-ai
open
data-engineering
950

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

wu-yc
wu-yc
data-ai
open
data-engineering
950

vaex

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.

wu-yc
wu-yc
data-ai
open
data-engineering
950

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

wu-yc
wu-yc
data-ai
open
data-engineering
950

opentargets-database

Query Open Targets Platform for target-disease associations, drug target discovery, tractability/safety data, genetics/omics evidence, known drugs, for therapeutic target identification.

wu-yc
wu-yc
data-ai
open
data-engineering
946

jax-skills

High-performance numerical computing and machine learning workflows using JAX. Supports array operations, automatic differentiation, JIT compilation, RNN-style scans, map/reduce operations, and gradient computations. Ideal for scientific computing, ML models, and dynamic array transformations.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

erlang-otp-behaviors

Use when oTP behaviors including gen_server for stateful processes, gen_statem for state machines, supervisors for fault tolerance, gen_event for event handling, and building robust, production-ready Erlang applications with proven patterns.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

erlang-distribution

Use when erlang distributed systems including node connectivity, distributed processes, global name registration, distributed supervision, network partitions, and building fault-tolerant multi-node applications on the BEAM VM.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

usgs-data-download

Download water level data from USGS using the dataretrieval package. Use when accessing real-time or historical streamflow data, downloading gage height or discharge measurements, or working with USGS station IDs.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

senior-data-engineer

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, real-time streaming, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, Flink, Kinesis, and modern data stack. Includes data modeling, pipeline orchestration, data quality, streaming quality monitoring, and DataOps. Use when designing data architectures, building batch or streaming data pipelines, optimizing data workflows, or implementing data governance.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

parallel-processing

Parallel processing with joblib for grid search and batch computations. Use when speeding up computationally intensive tasks across multiple CPU cores.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

workload-balancing

Optimize workload distribution across workers, processes, or nodes for efficient parallel execution. Use when asked to balance work distribution, improve parallel efficiency, reduce stragglers, implement load balancing, or optimize task scheduling. Covers static/dynamic partitioning, work stealing, and adaptive load balancing strategies.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
946

data-cleaning

Clean messy tabular datasets with deduplication, missing value imputation, outlier handling, and text processing. Use when dealing with dirty data that has duplicates, nulls, or inconsistent formatting.

benchflow-ai
benchflow-ai
data-ai
open
data-engineering
923

memory

Persist important outcomes from this step to long-term storage.

tsinghua-fib-lab
tsinghua-fib-lab
data-ai
open
data-engineering
917

dataset-manager

Use this skill to generate benchmark datasets (TPC-H, TPC-DS, etc.). Trigger when the user needs test data at a specific scale factor for benchmarking or testing. Supports parquet and duckdb output formats.

sirius-db
sirius-db
data-ai
open
Previous
Page 15 / 65
Next