home/categories/data-engineering
category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 스킬all categories
sorting
stars
current ordering strategy
query
all entries
refine the visible subset
data-engineering
18.1K

dask

Distributed computing for larger-than-RAM pandas/NumPy workflows. Use when you need to scale existing pandas/NumPy code beyond memory or across clusters. Best for parallel file processing, distributed ML, integration with existing pandas code. For out-of-core analytics on single machine use vaex; for in-memory speed use polars.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
18.1K

dnanexus-integration

DNAnexus cloud genomics platform. Build apps/applets, manage data (upload/download), dxpy Python SDK, run workflows, FASTQ/BAM/VCF, for genomics pipeline development and execution.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
18.1K

lamindb

This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
18.1K

polars-bio

High-performance genomic interval operations and bioinformatics file I/O on Polars DataFrames. Overlap, nearest, merge, coverage, complement, subtract for BED/VCF/BAM/GFF intervals. Streaming, cloud-native, faster bioframe alternative.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
18.1K

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
18.1K

vaex

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
18.1K

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
17.6K

vaex

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that don't fit in memory.

davila7
davila7
data-ai
open
data-engineering
17.6K

polars

Fast DataFrame library (Apache Arrow). Select, filter, group_by, joins, lazy evaluation, CSV/Parquet I/O, expression API, for high-performance data analysis workflows.

davila7
davila7
data-ai
open
data-engineering
17.6K

senior-data-engineer

World-class data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, or implementing data governance.

davila7
davila7
data-ai
open
data-engineering
17.6K

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

davila7
davila7
data-ai
open
data-engineering
16.5K

data-pipeline

Data pipeline expert for ETL, Apache Spark, Airflow, dbt, and data quality

RightNow-AI
RightNow-AI
data-ai
open
data-engineering
16.5K

docker

Docker expert for containers, Compose, Dockerfiles, and debugging

RightNow-AI
RightNow-AI
data-ai
open
data-engineering
16.1K

major-task

Work heavyweight framework or library tasks with planning-first research, selective deep analysis, and rigorous handoff

udecode
udecode
data-ai
open
data-engineering
14.6K

cosmos-provider

Implementation details for the EF Core Azure Cosmos DB provider. Use when changing Cosmos-specific code.

dotnet
dotnet
data-ai
open
data-engineering
14.2K

abp-ef-core

ABP Entity Framework Core - DbContext, entity configuration, EfCoreRepository implementation, migrations (dotnet ef migrations add), data seeding. Use when working in EntityFrameworkCore projects, adding migrations, or implementing EF Core repositories.

abpframework
abpframework
data-ai
open
data-engineering
10.9K

data-loading

Optimize data loading pipeline to prevent GPU starvation. Use when setting up DataLoader or data preprocessing.

aiming-lab
aiming-lab
data-ai
open
data-engineering
10.4K

status

Show DAG state, agent progress, and branch status for an AgentHub session.

alirezarezvani
alirezarezvani
data-ai
open
data-engineering
10.4K

database-designer

Use when the user asks to design database schemas, plan data migrations, optimize queries, choose between SQL and NoSQL, or model data relationships.

alirezarezvani
alirezarezvani
data-ai
open
data-engineering
10.4K

senior-data-engineer

Data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, implementing data governance, or troubleshooting data issues.

alirezarezvani
alirezarezvani
data-ai
open
data-engineering
10.4K

snowflake-development

Use when writing Snowflake SQL, building data pipelines with Dynamic Tables or Streams/Tasks, using Cortex AI functions, creating Cortex Agents, writing Snowpark Python, configuring dbt for Snowflake, or troubleshooting Snowflake errors.

alirezarezvani
alirezarezvani
data-ai
open
Previous
Page 5 / 65
Next