home/categories/data-engineering
category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 スキルall categories
sorting
stars
current ordering strategy
query
all entries
refine the visible subset
data-engineering
10.1K

electric-new-feature

End-to-end guide for adding a new synced feature with Electric and TanStack DB. Covers the full journey: design Postgres schema, set REPLICA IDENTITY FULL, define shape, create proxy route, set up TanStack DB collection with electricCollectionOptions, implement optimistic mutations with txid handshake (pg_current_xact_id, awaitTxId), and build live queries with useLiveQuery. Also covers migration from old ElectricSQL (electrify/db pattern does not exist), current API patterns (table as query param not path, handle not shape_id). Load when building a new feature from scratch.

electric-sql
electric-sql
data-ai
open
data-engineering
10.1K

electric-orm

Use Electric with Drizzle ORM or Prisma for the write path. Covers getting pg_current_xact_id() from ORM transactions using Drizzle tx.execute(sql) and Prisma $queryRaw, running migrations that preserve REPLICA IDENTITY FULL, and schema management patterns compatible with Electric shapes. Load when using Drizzle or Prisma alongside Electric for writes.

electric-sql
electric-sql
data-ai
open
data-engineering
9.6K

local-environment

Local development environment management for Polar using Docker

polarsource
polarsource
data-ai
open
data-engineering
8.6K

bloblang-authoring

This skill should be used when users need to create or debug Bloblang transformation scripts. Trigger when users ask about transforming data, mapping fields, parsing JSON/CSV/XML, converting timestamps, filtering arrays, or mention "bloblang", "blobl", "mapping processor", or describe any data transformation need like "convert this to that" or "transform my JSON".

redpanda-data
redpanda-data
data-ai
open
data-engineering
8.6K

pipeline-assistant

This skill should be used when users need to create or fix Redpanda Connect pipeline configurations. Trigger when users mention "config", "pipeline", "YAML", "create a config", "fix my config", "validate my pipeline", or describe a streaming pipeline need like "read from Kafka and write to S3".

redpanda-data
redpanda-data
data-ai
open
data-engineering
8.5K

beam-concepts

Explains core Apache Beam programming model concepts including PCollections, PTransforms, Pipelines, and Runners. Use when learning Beam fundamentals or explaining pipeline concepts.

apache
apache
data-ai
open
data-engineering
8.5K

io-connectors

Guides development and usage of I/O connectors in Apache Beam. Use when working with I/O connectors, creating new connectors, or debugging data source/sink issues.

apache
apache
data-ai
open
data-engineering
8.5K

python-development

Guides Python SDK development in Apache Beam, including environment setup, testing, building, and running pipelines. Use when working with Python code in sdks/python/.

apache
apache
data-ai
open
data-engineering
8.5K

runners

Guides understanding and working with Apache Beam runners (Direct, Dataflow, Flink, Spark, etc.). Use when configuring pipelines for different execution environments or debugging runner-specific issues.

apache
apache
data-ai
open
data-engineering
8.2K

smart-search

Construct optimized search URLs for major platforms and navigate to results with the browser. Replaces the built-in web_search tool for targeted, platform-specific searches.

TeamWiseFlow
TeamWiseFlow
data-ai
open
data-engineering
8.1K

spark-engineer

Use when writing Spark jobs, debugging performance issues, or configuring cluster settings for Apache Spark applications, distributed data processing pipelines, or big data workloads. Invoke to write DataFrame transformations, optimize Spark SQL queries, implement RDD pipelines, tune shuffle operations, configure executor memory, process .parquet files, handle data partitioning, or build structured streaming analytics.

Jeffallan
Jeffallan
data-ai
open
data-engineering
8.1K

pandas-pro

Performs pandas DataFrame operations for data analysis, manipulation, and transformation. Use when working with pandas DataFrames, data cleaning, aggregation, merging, or time series analysis. Invoke for data manipulation tasks such as joining DataFrames on multiple keys, pivoting tables, resampling time series, handling NaN values with interpolation or forward-fill, groupby aggregations, type conversion, or performance optimization of large datasets.

Jeffallan
Jeffallan
data-ai
open
data-engineering
6.6K

polars

Fast in-memory DataFrame library for datasets that fit in RAM. Use when pandas is too slow but data still fits in memory. Lazy evaluation, parallel execution, Apache Arrow backend. Best for 1-100GB datasets, ETL pipelines, faster pandas replacement. For larger-than-RAM data use dask or vaex.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
6.6K

vaex

Use this skill for processing and analyzing large tabular datasets (billions of rows) that exceed available RAM. Vaex excels at out-of-core DataFrame operations, lazy evaluation, fast aggregations, efficient visualization of big data, and machine learning on large datasets. Apply when users need to work with large CSV/HDF5/Arrow/Parquet files, perform fast statistics on massive datasets, create visualizations of big data, or build ML pipelines that do not fit in memory.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
6.6K

zarr-python

Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines.

K-Dense-AI
K-Dense-AI
data-ai
open
data-engineering
6.6K

deepspeed

Expert guidance for distributed training with DeepSpeed - ZeRO optimization stages, pipeline parallelism, FP16/BF16/FP8, 1-bit Adam, sparse attention

Orchestra-Research
Orchestra-Research
data-ai
open
data-engineering
6.6K

llamaindex

Data framework for building LLM applications with RAG. Specializes in document ingestion (300+ connectors), indexing, and querying. Features vector indices, query engines, agents, and multi-modal support. Use for document Q&A, chatbots, knowledge retrieval, or building RAG pipelines. Best for data-centric LLM applications.

Orchestra-Research
Orchestra-Research
data-ai
open
data-engineering
6.6K

pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

Orchestra-Research
Orchestra-Research
data-ai
open
data-engineering
6.6K

stable-diffusion-image-generation

State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.

Orchestra-Research
Orchestra-Research
data-ai
open
data-engineering
6.1K

dse-loop

Autonomous design space exploration loop for computer architecture and EDA. Runs a program, analyzes results, tunes parameters, and iterates until objective is met or timeout. Use when user says \"DSE\", \"design space exploration\", \"sweep parameters\", \"optimize\", \"find best config\", or wants iterative parameter tuning.

wanshuiyin
wanshuiyin
data-ai
open
Previous
Page 6 / 65
Next