home/categories/data-engineering
category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 skillsall categories
sorting
stars
current ordering strategy
query
all entries
refine the visible subset
data-engineering
166

data-scientist

Data science methodology for Python research: EDA, validation, causal inference (IV, DiD, RD, synthetic control), clustering/PCA/UMAP, supervised ML, geospatial, visualization. Method selection guidance. For syntax, load tool-specific skills.

DAAF-Contribution-Community
DAAF-Contribution-Community
data-ai
open
data-engineering
166

polars

Polars DataFrame library for high-performance data manipulation. Lazy/eager execution, expressions, I/O (CSV, Parquet, JSON), aggregations, joins, string/datetime ops, pandas interop. Use for Polars DataFrames or reading/writing Parquet files.

DAAF-Contribution-Community
DAAF-Contribution-Community
data-ai
open
data-engineering
165

data-cleaning-pipeline

Build robust processes for data cleaning, missing value imputation, outlier handling, and data transformation for data preprocessing, data quality, and data pipeline automation

aj-geddes
aj-geddes
data-ai
open
data-engineering
165

event-sourcing

Implement event sourcing and CQRS patterns using event stores, aggregates, and projections. Use when building audit trails, temporal queries, or systems requiring full history.

aj-geddes
aj-geddes
data-ai
open
data-engineering
165

ml-pipeline-automation

Build end-to-end ML pipelines with automated data processing, training, validation, and deployment using Airflow, Kubeflow, and Jenkins

aj-geddes
aj-geddes
data-ai
open
data-engineering
162

integrations

External data sources, connectors, and custom data streams

alsk1992
alsk1992
data-ai
open
data-engineering
161

spring-ai-rag-media-pgvector

Build RAG pipelines for media-asset knowledge bases using Spring AI and PostgreSQL pgvector. Use when Codex needs to design database schema, ingestion/chunking/embedding workflow, and retrieval logic that prioritizes internal media knowledge before falling back to general model knowledge.

microwind
microwind
data-ai
open
data-engineering
160

commandkit-cache

Implement deterministic caching with @commandkit/cache. Use for 'use cache' directives, cacheTag/cacheLife strategy, revalidateTag invalidation, and provider setup for memory or Redis deployments.

neplextech
neplextech
data-ai
open
data-engineering
158

connecting-streamlit-to-snowflake

Connecting Streamlit apps to Snowflake. Use when setting up database connections, managing secrets, or querying Snowflake from a Streamlit app.

streamlit
streamlit
data-ai
open
data-engineering
157

ml-pipeline-workflow

Build end-to-end MLOps pipelines from data preparation through model training, validation, and production deployment. Use when creating ML pipelines, implementing MLOps practices, or automating model training and deployment workflows.

Microck
Microck
data-ai
open
data-engineering
157

dmir-compiler-analysis

Analyze DTVM's dMIR intermediate representation and compilation pipeline. Translates EVM bytecode sequences into dMIR pseudocode, then into x86 pseudocode, and evaluates performance cost at each stage. Use when the user asks about dMIR instructions, EVM-to-dMIR conversion, dMIR-to-x86 lowering, JIT compilation cost analysis, EVM opcode performance evaluation, or EVM->dMIR performance optimization.

DTVMStack
DTVMStack
data-ai
open
data-engineering
157

molecular-docking-pipeline

Molecular Docking Pipeline - Complete docking workflow: retrieve protein structure, predict binding pockets, prepare receptor, and dock ligand. Use this skill for structural biology tasks involving retrieve protein data by pdbcode run fpocket convert pdb to pdbqt dock quick molecule docking. Combines 4 tools from 2 SCP server(s).

InternScience
InternScience
data-ai
open
data-engineering
157

agentdb-advanced-features

Master advanced AgentDB features including QUIC synchronization, multi-database management, custom distance metrics, hybrid search, and distributed systems integration. Use when building distributed AI systems, multi-agent coordination, or advanced vector search applications.

Microck
Microck
data-ai
open
data-engineering
157

agentdb-performance-optimization

Optimize AgentDB performance with quantization (4-32x memory reduction), HNSW indexing (150x faster search), caching, and batch operations. Use when optimizing memory usage, improving search speed, or scaling to millions of vectors.

Microck
Microck
data-ai
open
data-engineering
157

bioservices

Primary Python tool for 40+ bioinformatics services. Preferred for multi-database workflows: UniProt, KEGG, ChEMBL, PubChem, Reactome, QuickGO. Unified API for queries, ID mapping, pathway analysis. For direct REST control, use individual database skills (uniprot-database, kegg-database).

Microck
Microck
data-ai
open
data-engineering
157

dnanexus-integration

DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution.

Microck
Microck
data-ai
open
data-engineering
157

data-engineering

数据工程。Airflow、Dagster、Kafka Streams、Flink、dbt、数据管道、流处理、数据质量。当用户提到数据管道、ETL、流处理、数据质量时路由到此。

telagod
telagod
data-ai
open
data-engineering
156

dask

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

lamm-mit
lamm-mit
data-ai
open
data-engineering
156

dnanexus-integration

DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution.

lamm-mit
lamm-mit
data-ai
open
data-engineering
156

lamindb

This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.

lamm-mit
lamm-mit
data-ai
open
data-engineering
156

opentargets-database

Query Open Targets Platform for target-disease associations, drug target discovery, tractability/safety data, genetics/omics evidence, known drugs, for therapeutic target identification.

lamm-mit
lamm-mit
data-ai
open
data-engineering
156

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

lamm-mit
lamm-mit
data-ai
open
Previous
Page 41 / 65
Next