data-engineer
Build ETL pipelines, data warehouses, and streaming architectures. Implements Spark jobs, Airflow DAGs, and Kafka streams. Use PROACTIVELY for data pipeline design or analytics infrastructure.
Build ETL pipelines, data warehouses, and streaming architectures. Implements Spark jobs, Airflow DAGs, and Kafka streams. Use PROACTIVELY for data pipeline design or analytics infrastructure.
Write efficient resolvers with DataLoader, batching, and N+1 prevention
Filter and search event datasets (logs) using OPAL. Use when you need to find specific log events by text search, regex patterns, or field values. Covers contains(), tilda operator ~, field comparisons, boolean logic, and limit for sampling results. Does NOT cover aggregation (see aggregating-event-datasets skill).
Comprehensive Python data engineering patterns for AWS Data Lake, including PySpark, Pandas, Apache Airflow, AWS Glue, ETL pipelines, data quality, schema management, performance optimization, FastAPI services, streaming with Kafka/Kinesis, data validation with Great Expectations, testing strategies, error handling, logging, and production deployment on AWS EMR and Glue.
Use when loading data into BigQuery from CSV, JSON, Avro, Parquet files, Cloud Storage, or local files. Covers bq load command, source formats, schema detection, incremental loading, and handling parsing errors.
Error handling patterns for ERPNext/Frappe database operations. Use when handling DoesNotExistError, DuplicateEntryError, transaction failures, and query errors. Covers retry patterns and data integrity. V14/V15/V16 compatible. Triggers: database error, DoesNotExistError, DuplicateEntryError, transaction failed, query error.
Transforms prediction market data to Nixtla format (unique_id, ds, y). Maps arbitrary column names to required schema. Validates date and numeric types. Use when preparing prediction market datasets for Nixtla forecasting tools. Trigger with "convert to Nixtla format", "schema mapping", "transform data".
Empirical validation workflow for blockchain data collection pipelines before production implementation. Use when validating data sources, testing DuckDB integration, building POC collectors, or verifying complete fetch-to-storage pipelines for blockchain data.
Data analysis workflows and patterns for exploring, transforming, and visualizing data. Use when working with data, creating reports, or when users mention "data analysis", "analyze data", "data exploration", or "reporting".
Executive-grade data analysis with pandas/polars and McKinsey-quality visualizations. Use when analyzing data, building dashboards, creating investor presentations, or calculating SaaS metrics.
Runs Exploratory Data Analysis (EDA) following the mandatory validation workflow. Use when performing data analysis, exploring datasets, validating data quality, or when the user mentions EDA, data exploration, sanity checks, or data validation. Always run before main analysis queries.
This skill should be used when the user asks to "create a BPA rule", "write a Best Practice Analyzer rule", "improve a BPA expression", "fix expression for BPA", "analyze BPA annotations", "check model for best practices", "audit BPA rules", "discover BPA rules", "list all BPA rules", "validate BPA rules", or mentions Tabular Editor BPA rules. Provides guidance for creating, improving, auditing, and understanding Best Practice Analyzer rules for Power BI semantic models.
See the main Data Freshness and Latency skill for comprehensive coverage of freshness monitoring and SLO tracking.
Techniques and tools for ensuring the accuracy, completeness, and reliability of data across the pipeline.
Mapping the flow of data from source to destination for transparency, impact analysis, and troubleshooting.
Perform an end-to-end process mining analysis via a command-line workflow that progressively ingests, profiles, cleans, mines and reports on event logs using PM4Py. The workflow generates stage-based artefacts (including versioned notebooks) and pauses at decision checkpoints so the user can validate findings and choose how to proceed.
Monitoring and optimizing how quickly data flows through pipelines and ensuring it meets timeliness requirements.
Expert guidance for Polars dataframe manipulation in Python. Use this skill when working with dataframes, data processing, ETL pipelines, or any task involving tabular data manipulation. Provides best practices, performance optimization patterns, and comprehensive API usage for the Polars library.
This skill should be used when performing local data exploration, profiling, quality analysis, or transformation tasks using DuckDB. It handles CSV, Parquet, and JSON files, provides automated data quality reports, supports complex JSON transformations, and generates interactive HTML reports for data analysis.
Analyze data files using SQL queries with DataQL. Use when working with CSV, JSON, Parquet, Excel files or when the user mentions data analysis, filtering, aggregation, or SQL queries on files.
Ensures accuracy for HTI compliance and performance dashboards through data validation, deduplication, normalization, and integrity checks. Critica...
Build semantic queries with measures, dimensions, filters, and time dimensions for Drizzle Cube.