skills.homescapability registry 検索

home/categories/data-engineering

category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 スキルall categories

sorting

stars

current ordering strategy

query

all entries

refine the visible subset

data-engineering

10.1K

electric-new-feature

End-to-end guide for adding a new synced feature with Electric and TanStack DB. Covers the full journey: design Postgres schema, set REPLICA IDENTITY FULL, define shape, create proxy route, set up TanStack DB collection with electricCollectionOptions, implement optimistic mutations with txid handshake (pg_current_xact_id, awaitTxId), and build live queries with useLiveQuery. Also covers migration from old ElectricSQL (electrify/db pattern does not exist), current API patterns (table as query param not path, handle not shape_id). Load when building a new feature from scratch.

electric-sql

data-ai

data-engineering

10.1K

electric-orm

Use Electric with Drizzle ORM or Prisma for the write path. Covers getting pg_current_xact_id() from ORM transactions using Drizzle tx.execute(sql) and Prisma $queryRaw, running migrations that preserve REPLICA IDENTITY FULL, and schema management patterns compatible with Electric shapes. Load when using Drizzle or Prisma alongside Electric for writes.

electric-sql

data-ai

data-engineering

9.6K

local-environment

Local development environment management for Polar using Docker

polarsource

data-ai

data-engineering

8.6K

bloblang-authoring

This skill should be used when users need to create or debug Bloblang transformation scripts. Trigger when users ask about transforming data, mapping fields, parsing JSON/CSV/XML, converting timestamps, filtering arrays, or mention "bloblang", "blobl", "mapping processor", or describe any data transformation need like "convert this to that" or "transform my JSON".

redpanda-data

data-ai

data-engineering

8.6K

pipeline-assistant

This skill should be used when users need to create or fix Redpanda Connect pipeline configurations. Trigger when users mention "config", "pipeline", "YAML", "create a config", "fix my config", "validate my pipeline", or describe a streaming pipeline need like "read from Kafka and write to S3".

redpanda-data

data-ai

data-engineering

8.5K

beam-concepts

Explains core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, and Runners. Use when learning Beam fundamentals or explaining pipeline concepts.

apache

data-ai

data-engineering

8.5K

io-connectors

Guides development and usage of I/O connectors in Apache Beam. Use when working with I/O connectors, creating new connectors, or debugging data source/sink issues.

apache

data-ai

data-engineering

8.5K

python-development

Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/.

apache

data-ai

data-engineering

8.5K

runners

Guides understanding and working with Apache Beam runners (Direct, Dataflow, Flink, Spark, etc.). Use when configuring pipelines for different execution environments or debugging runner-specific issues.

apache

data-ai

data-engineering

8.2K

smart-search

Construct optimized search URLs for major platforms and navigate to results with the browser. Replaces the built-in web_search tool for targeted, platform-specific searches.

TeamWiseFlow

data-ai

data-engineering

8.1K

spark-engineer

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

Jeffallan

data-ai

data-engineering

8.1K

pandas-pro

Performs pandas DataFrame operations for data analysis, manipulation, and transformation. Use when working with pandas DataFrames, data cleaning, aggregation, merging, or time series analysis. Invoke for data manipulation tasks such as joining DataFrames on multiple keys, pivoting tables, resampling time series, handling NaN values with interpolation or forward-fill, groupby aggregations, type conversion, or performance optimization of large datasets.

Jeffallan

data-ai

data-engineering

7.5K

image-preprocessing-pipeline

image preprocessing pipeline

kreuzberg-dev

data-ai

data-engineering

7.5K

ocr-caching-strategy

ocr caching strategy

kreuzberg-dev

data-ai

data-engineering

7.5K

priority-selection-system

priority selection system

kreuzberg-dev

data-ai

data-engineering

7.5K

extraction-pipeline-patterns

extraction pipeline patterns

kreuzberg-dev

data-ai

data-engineering

6.6K

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

K-Dense-AI

data-ai

data-engineering

6.6K

vaex

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.

K-Dense-AI

data-ai

data-engineering

6.6K

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

K-Dense-AI

data-ai

data-engineering

6.6K

deepspeed

Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention

Orchestra-Research

data-ai

data-engineering

6.6K

llamaindex

Data framework for building LLM applications with RAG. Specializes in document ingestion (300+ connectors), indexing, and querying. Features vector indices, query engines, agents, and multi-modal support. Use for document Q&A, chatbots, knowledge retrieval, or building RAG pipelines. Best for data-centric LLM applications.

Orchestra-Research

data-ai

data-engineering

6.6K

pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

Orchestra-Research

data-ai

data-engineering

6.6K

stable-diffusion-image-generation

State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.

Orchestra-Research

data-ai

data-engineering

6.1K

dse-loop

Autonomous design space exploration loop for computer architecture and EDA. Runs a program, analyzes results, tunes parameters, and iterates until objective is met or timeout. Use when user says \"DSE\", \"design space exploration\", \"sweep parameters\", \"optimize\", \"find best config\", or wants iterative parameter tuning.

wanshuiyin

data-ai

Page 6 / 65