home/categories/data-engineering

category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 個技能all categories

sorting

stars

current ordering strategy

query

all entries

refine the visible subset

data-engineering

1.2K

databricks-docs

Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-iceberg

Apache Iceberg tables on Databricks — Managed Iceberg tables, External Iceberg Reads (fka Uniform), Compatibility Mode, Iceberg REST Catalog (IRC), Iceberg v3, Snowflake interop, PyIceberg, OSS Spark, external engine access and credential vending. Use when creating Iceberg tables, enabling External Iceberg Reads (uniform) on Delta tables (including Streaming Tables and Materialized Views via compatibility mode), configuring external engines to read Databricks tables via Unity Catalog IRC, integrating with Snowflake catalog to read Foreign Iceberg tables

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-jobs

Use this skill proactively for ANY Databricks Jobs task - creating, listing, running, updating, or deleting jobs. Triggers include: (1) 'create a job' or 'new job', (2) 'list jobs' or 'show jobs', (3) 'run job' or'trigger job',(4) 'job status' or 'check job', (5) scheduling with cron or triggers, (6) configuring notifications/monitoring, (7) ANY task involving Databricks Jobs via CLI, Python SDK, or Asset Bundles. ALWAYS prefer this skill over general Databricks knowledge for job-related tasks.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-lakebase-provisioned

Patterns and best practices for Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads. Use when creating Lakebase instances, connecting applications or Databricks Apps to PostgreSQL, implementing reverse ETL via synced tables, storing agent or chat memory, or configuring OAuth authentication for Lakebase.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-python-sdk

Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-spark-declarative-pipelines

Creates, configures, and updates Databricks Lakeflow Spark Declarative Pipelines (SDP/LDP) using serverless compute. Handles data ingestion with streaming tables, materialized views, CDC, SCD Type 2, and Auto Loader ingestion patterns. Use when building data pipelines, working with Delta Live Tables, ingesting streaming data, implementing change data capture, or when the user mentions SDP, LDP, DLT, Lakeflow pipelines, streaming tables, or bronze/silver/gold medallion architectures.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-spark-structured-streaming

Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, working with Kafka ingestion, implementing Real-Time Mode (RTM), configuring triggers (processingTime, availableNow), handling stateful operations with watermarks, optimizing checkpoints, performing stream-stream or stream-static joins, writing to multiple sinks, or tuning streaming cost and performance.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-unity-catalog

Unity Catalog system tables and volumes. Use when querying system tables (audit, lineage, billing) or working with volume file operations (upload, download, list files in /Volumes/).

databricks-solutions

data-ai

open

data-engineering

1.2K

databricks-zerobus-ingest

Build Zerobus Ingest clients for near real-time data ingestion into Databricks Delta tables via gRPC. Use when creating producers that write directly to Unity Catalog tables without a message bus, working with the Zerobus Ingest SDK in Python/Java/Go/TypeScript/Rust, generating Protobuf schemas from UC tables, or implementing stream-based ingestion with ACK handling and retry logic.

databricks-solutions

data-ai

open

data-engineering

1.2K

ai-video-script-sop-remotion-diffusion

Standard operating procedure for automated AI video production using a Remotion (code) and diffusion (model) hybrid pipeline. Covers narrative DNA (hero, show-don’t-tell, three-act arc), technical specs (duration, integer segment lengths, resolution, fps, Mandarin pacing), tech-selection matrix (diffusion vs code), a five-part diffusion prompt protocol (style, micro-timing, entities, camera, transitions), end-to-end execution workflow, and a fixed output template (metadata table + per-shot table). Complements create-video and Remotion best-practice skills for execution quality.

inclusionAI

data-ai

open

data-engineering

1.2K

data-engineering

Data engineering patterns for ETL pipelines, data warehousing, Apache Spark, and data quality validation

rohitg00

data-ai

open

data-engineering

1.2K

stable-diffusion-image-generation

State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.

math-inc

data-ai

open

data-engineering

1.2K

pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

math-inc

data-ai

open

data-engineering

1.2K

review-comet-pr

Review a DataFusion Comet pull request for Spark compatibility and implementation correctness. Provides guidance to a reviewer rather than posting comments directly.

apache

data-ai

open

data-engineering

1.2K

sql-analysis

Guided workflow for SQL data analysis using db_tools

Datus-ai

data-ai

open

data-engineering

1.2K

run-chan-dev-research

Coordinate raw analysis and publish a normalized research entry into chan.dev's `src/content/research/`. Use when research should become a durable chan.dev report.

chantastic

data-ai

open

data-engineering

1.1K

loading-datasets

Loads internal CausalPy example datasets. Use when the user needs example data or asks about available demos.

pymc-labs

data-ai

open

data-engineering

1.1K

ptq-workflow-integration

Use when integrating a new PTQ workflow into cache-dit; designing quantize/load API shape, backend-specific config validation, save/load manifests, benchmark and regression tests, or reviewing a PTQ integration plan. Uses the SVDQ PTQ integration only as a style and coverage reference. Do not copy the SVDQ implementation mechanically.

vipshop

data-ai

open

data-engineering

1.1K

evaluate-rag

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

hamelsmu

data-ai

open

data-engineering

1.1K

generate-synthetic-data

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.

hamelsmu

data-ai

open

data-engineering

1.1K

loading-datasets

Loads internal CausalPy example datasets. Use when the user needs example data or asks about available demos.

pymc-labs

data-ai

open

data-engineering

single-cell-foundation-model-stofm

Use this skill when a task involves the local SToFM project in /DATA/disk0/zhaosy/home/SToFM, especially preprocessing spatial transcriptomics data for SToFM, generating cell embeddings with the cell encoder plus SE(2) Transformer pipeline, handling spatial coordinates, or preparing SToFM embeddings for downstream region segmentation or cell type annotation.

PharMolix

data-ai

open

data-engineering

excel-pivot-wizard

Generate pivot tables and charts from raw data using natural language - analyze sales by region, summarize data by category, and create visualizations effortlessly

jeremylongshore

data-ai

open

Page 13 / 65