home/categories/data-engineering
category focus

Data Eng.

ETL pipelines and big data infrastructure.

1541 مهارةall categories
sorting
stars
current ordering strategy
query
all entries
refine the visible subset
data-engineering
1.2K

databricks-docs

Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-iceberg

Apache Iceberg tables on Databricks — Managed Iceberg tables, External Iceberg Reads (fka Uniform), Compatibility Mode, Iceberg REST Catalog (IRC), Iceberg v3, Snowflake interop, PyIceberg, OSS Spark, external engine access and credential vending. Use when creating Iceberg tables, enabling External Iceberg Reads (uniform) on Delta tables (including Streaming Tables and Materialized Views via compatibility mode), configuring external engines to read Databricks tables via Unity Catalog IRC, integrating with Snowflake catalog to read Foreign Iceberg tables

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-jobs

Use this skill proactively for ANY Databricks Jobs task - creating, listing, running, updating, or deleting jobs. Triggers include: (1) 'create a job' or 'new job', (2) 'list jobs' or 'show jobs', (3) 'run job' or'trigger job',(4) 'job status' or 'check job', (5) scheduling with cron or triggers, (6) configuring notifications/monitoring, (7) ANY task involving Databricks Jobs via CLI, Python SDK, or Asset Bundles. ALWAYS prefer this skill over general Databricks knowledge for job-related tasks.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-lakebase-provisioned

Patterns and best practices for Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads. Use when creating Lakebase instances, connecting applications or Databricks Apps to PostgreSQL, implementing reverse ETL via synced tables, storing agent or chat memory, or configuring OAuth authentication for Lakebase.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-python-sdk

Databricks development guidance including Python SDK, Databricks Connect, CLI, and REST API. Use when working with databricks-sdk, databricks-connect, or Databricks APIs.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-spark-declarative-pipelines

Creates, configures, and updates Databricks Lakeflow Spark Declarative Pipelines (SDP/LDP) using serverless compute. Handles data ingestion with streaming tables, materialized views, CDC, SCD Type 2, and Auto Loader ingestion patterns. Use when building data pipelines, working with Delta Live Tables, ingesting streaming data, implementing change data capture, or when the user mentions SDP, LDP, DLT, Lakeflow pipelines, streaming tables, or bronze/silver/gold medallion architectures.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-spark-structured-streaming

Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, working with Kafka ingestion, implementing Real-Time Mode (RTM), configuring triggers (processingTime, availableNow), handling stateful operations with watermarks, optimizing checkpoints, performing stream-stream or stream-static joins, writing to multiple sinks, or tuning streaming cost and performance.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-synthetic-data-gen

Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-unity-catalog

Unity Catalog system tables and volumes. Use when querying system tables (audit, lineage, billing) or working with volume file operations (upload, download, list files in /Volumes/).

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

databricks-zerobus-ingest

Build Zerobus Ingest clients for near real-time data ingestion into Databricks Delta tables via gRPC. Use when creating producers that write directly to Unity Catalog tables without a message bus, working with the Zerobus Ingest SDK in Python/Java/Go/TypeScript/Rust, generating Protobuf schemas from UC tables, or implementing stream-based ingestion with ACK handling and retry logic.

databricks-solutions
databricks-solutions
data-ai
open
data-engineering
1.2K

ai-video-script-sop-remotion-diffusion

Standard operating procedure for automated AI video production using a Remotion (code) and diffusion (model) hybrid pipeline. Covers narrative DNA (hero, show-don’t-tell, three-act arc), technical specs (duration, integer segment lengths, resolution, fps, Mandarin pacing), tech-selection matrix (diffusion vs code), a five-part diffusion prompt protocol (style, micro-timing, entities, camera, transitions), end-to-end execution workflow, and a fixed output template (metadata table + per-shot table). Complements create-video and Remotion best-practice skills for execution quality.

inclusionAI
inclusionAI
data-ai
open
data-engineering
1.2K

data-engineering

Data engineering patterns for ETL pipelines, data warehousing, Apache Spark, and data quality validation

rohitg00
rohitg00
data-ai
open
data-engineering
1.2K

stable-diffusion-image-generation

State-of-the-art text-to-image generation with Stable Diffusion models via HuggingFace Diffusers. Use when generating images from text prompts, performing image-to-image translation, inpainting, or building custom diffusion pipelines.

math-inc
math-inc
data-ai
open
data-engineering
1.2K

pinecone

Managed vector database for production AI applications. Fully managed, auto-scaling, with hybrid search (dense + sparse), metadata filtering, and namespaces. Low latency (<100ms p95). Use for production RAG, recommendation systems, or semantic search at scale. Best for serverless, managed infrastructure.

math-inc
math-inc
data-ai
open
data-engineering
1.2K

review-comet-pr

Review a DataFusion Comet pull request for Spark compatibility and implementation correctness. Provides guidance to a reviewer rather than posting comments directly.

apache
apache
data-ai
open
data-engineering
1.2K

sql-analysis

Guided workflow for SQL data analysis using db_tools

Datus-ai
Datus-ai
data-ai
open
data-engineering
1.2K

run-chan-dev-research

Coordinate raw analysis and publish a normalized research entry into chan.dev's `src/content/research/`. Use when research should become a durable chan.dev report.

chantastic
chantastic
data-ai
open
data-engineering
1.1K

loading-datasets

Loads internal CausalPy example datasets. Use when the user needs example data or asks about available demos.

pymc-labs
pymc-labs
data-ai
open
data-engineering
1.1K

ptq-workflow-integration

Use when integrating a new PTQ workflow into cache-dit; designing quantize/load API shape, backend-specific config validation, save/load manifests, benchmark and regression tests, or reviewing a PTQ integration plan. Uses the SVDQ PTQ integration only as a style and coverage reference. Do not copy the SVDQ implementation mechanically.

vipshop
vipshop
data-ai
open
data-engineering
1.1K

evaluate-rag

Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.

hamelsmu
hamelsmu
data-ai
open
data-engineering
1.1K

generate-synthetic-data

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.

hamelsmu
hamelsmu
data-ai
open
data-engineering
1.1K

loading-datasets

Loads internal CausalPy example datasets. Use when the user needs example data or asks about available demos.

pymc-labs
pymc-labs
data-ai
open
data-engineering
1K

single-cell-foundation-model-stofm

Use this skill when a task involves the local SToFM project in /DATA/disk0/zhaosy/home/SToFM, especially preprocessing spatial transcriptomics data for SToFM, generating cell embeddings with the cell encoder plus SE(2) Transformer pipeline, handling spatial coordinates, or preparing SToFM embeddings for downstream region segmentation or cell type annotation.

PharMolix
PharMolix
data-ai
open
data-engineering
1K

excel-pivot-wizard

Generate pivot tables and charts from raw data using natural language - analyze sales by region, summarize data by category, and create visualizations effortlessly

jeremylongshore
jeremylongshore
data-ai
open
Previous
Page 13 / 65
Next