home/categories/data-engineering

category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 skillsall categories

sorting

stars

current ordering strategy

query

all entries

refine the visible subset

data-engineering

166

data-scientist

Data science methodology for Python research: EDA, validation, causal inference (IV, DiD, RD, synthetic control), clustering/PCA/UMAP, supervised ML, geospatial, visualization. Method selection guidance. For syntax, load tool-specific skills.

DAAF-Contribution-Community

data-ai

open

data-engineering

166

polars

Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.

DAAF-Contribution-Community

data-ai

open

data-engineering

165

data-cleaning-pipeline

Build robust processes for data cleaning, missing value imputation, outlier handling, and data transformation for data preprocessing, data quality, and data pipeline automation

aj-geddes

data-ai

open

data-engineering

165

event-sourcing

Implement event sourcing and CQRS patterns using event stores, aggregates, and projections. Use when building audit trails, temporal queries, or systems requiring full history.

aj-geddes

data-ai

open

data-engineering

165

ml-pipeline-automation

Build end-to-end ML pipelines with automated data processing, training, validation, and deployment using Airflow, Kubeflow, and Jenkins

aj-geddes

data-ai

open

data-engineering

162

integrations

External data sources, connectors, and custom data streams

alsk1992

data-ai

open

data-engineering

161

spring-ai-rag-media-pgvector

Build RAG pipelines for media-asset knowledge bases using Spring AI and PostgreSQL pgvector. Use when Codex needs to design database schema, ingestion/chunking/embedding workflow, and retrieval logic that prioritizes internal media knowledge before falling back to general model knowledge.

microwind

data-ai

open

data-engineering

160

commandkit-cache

Implement deterministic caching with @commandkit/cache. Use for 'use cache' directives, cacheTag/cacheLife strategy, revalidateTag invalidation, and provider setup for memory or Redis deployments.

neplextech

data-ai

open

data-engineering

158

connecting-streamlit-to-snowflake

Connecting Streamlit apps to Snowflake. Use when setting up database connections, managing secrets, or querying Snowflake from a Streamlit app.

streamlit

data-ai

open

data-engineering

157

ml-pipeline-workflow

Build end-to-end MLOps pipelines from data preparation through model training, validation, and production deployment. Use when creating ML pipelines, implementing MLOps practices, or automating model training and deployment workflows.

Microck

data-ai

open

data-engineering

157

dmir-compiler-analysis

Analyze DTVM's dMIR intermediate representation and compilation pipeline. Translates EVM bytecode sequences into dMIR pseudocode, then into x86 pseudocode, and evaluates performance cost at each stage. Use when the user asks about dMIR instructions, EVM-to-dMIR conversion, dMIR-to-x86 lowering, JIT compilation cost analysis, EVM opcode performance evaluation, or EVM->dMIR performance optimization.

DTVMStack

data-ai

open

data-engineering

157

molecular-docking-pipeline

Molecular Docking Pipeline - Complete docking workflow: retrieve protein structure, predict binding pockets, prepare receptor, and dock ligand. Use this skill for structural biology tasks involving retrieve protein data by pdbcode run fpocket convert pdb to pdbqt dock quick molecule docking. Combines 4 tools from 2 SCP server(s).

InternScience

data-ai

open

data-engineering

157

agentdb-advanced-features

Master advanced AgentDB features including QUIC synchronization, multi-database management, custom distance metrics, hybrid search, and distributed systems integration. Use when building distributed AI systems, multi-agent coordination, or advanced vector search applications.

Microck

data-ai

open

data-engineering

157

agentdb-performance-optimization

Optimize AgentDB performance with quantization (4-32x memory reduction), HNSW indexing (150x faster search), caching, and batch operations. Use when optimizing memory usage, improving search speed, or scaling to millions of vectors.

Microck

data-ai

open

data-engineering

157

bioservices

Primary Python tool for 40+ bioinformatics services. Preferred for multi-database workflows: UniProt, KEGG, ChEMBL, PubChem, Reactome, QuickGO. Unified API for queries, ID mapping, pathway analysis. For direct REST control, use individual database skills (uniprot-database, kegg-database).

Microck

data-ai

open

data-engineering

157

dnanexus-integration

DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution.

Microck

data-ai

open

data-engineering

157

data-engineering

数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。

telagod

data-ai

open

data-engineering

156

dask

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

lamm-mit

data-ai

open

data-engineering

156

dnanexus-integration

DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution.

lamm-mit

data-ai

open

data-engineering

156

drug-repurposing

ToolUniverse workflow — Drug Repurposing

lamm-mit

data-ai

open

data-engineering

156

drug-target-validation

ToolUniverse workflow — Drug Target Validation

lamm-mit

data-ai

open

data-engineering

156

lamindb

This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.

lamm-mit

data-ai

open

data-engineering

156

opentargets-database

Query Open Targets Platform for target-disease associations, drug target discovery, tractability/safety data, genetics/omics evidence, known drugs, for therapeutic target identification.

lamm-mit

data-ai

open

data-engineering

156

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

lamm-mit

data-ai

open

Page 41 / 65