⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

zero-plus-x/data-engineer-candidate-test

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Technical Assessment

Welcome to the data engineering technical assessment! This repository contains exercises designed to evaluate your data pipeline design skills, problem-solving ability, and production thinking.


📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Objective: Implement a complete ETL pipeline from raw data ingestion through staging transformations to production analytics tables.

What you'll do:

  • Ingest raw CSV data into BigQuery
  • Design and implement staging transformations (prepare data for analytics)
  • Create production aggregations for business analytics
  • Choose between SQL or Python implementation approach
  • Write pseudocode (for live technical interview)

Location: exercises/batch_level_1/ Instructions: See exercises/batch_level_1/README.md

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

Objective: Design an idempotent daily ETL pipeline that handles late-arriving and duplicate data.

What you'll do:

  • Design incremental data processing logic
  • Handle late-arriving and duplicate data
  • Implement deduplication strategy
  • Ensure idempotency
  • Choose between SQL or Python implementation approach

Location: exercises/batch_level_2/ Instructions: See exercises/batch_level_2/README.md


🚀 Getting Started

Prerequisites

  • Python 3.8+
  • Git

Repository Structure

├── .github/                         # Workflow configuration
├── config/                          # Config templates
├── data/                            # Sample data (if any)
├── docs/                            # Additional documentation
├── exercises/
│   ├── config.json
│   ├── batch_level_1/
│   │   ├── CANDIDATE_INSTRUCTIONS.md
│   │   ├── Dockerfile
│   │   ├── README.md
│   │   ├── README_CI_CD.md
│   │   ├── scripts/
│   │   ├── seeds/
│   │   ├── src/
│   │   ├── python/                  # Python approach (optional)
│   │   └── sql/                     # SQL approach (optional)
│   └── batch_level_2/
│       ├── Dockerfile
│       ├── README.md
│       ├── scripts/
│       ├── python/                 # Python approach
│       ├── sql/                    # SQL approach
│       ├── seeds/
│       └── src/
├── src/                             # Shared pipeline code
├── requirements.txt                 # Python dependencies
└── README.md                        # This file

Setup Instructions

  1. Clone this repository

    git clone https://github.com/dbarrios83/data-engineer-candidate-test.git
    cd data-engineer-candidate-test
  2. Set up Python environment

    Windows PowerShell:

    python -m venv .venv
    .venv\Scripts\Activate.ps1
    pip install -r requirements.txt

    Linux/Mac:

    python3 -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  3. Set your candidate ID (for Batch Level 1)

    Windows PowerShell:

    $env:CANDIDATE_ID = "candidate_yourname"

    Linux/Mac:

    export CANDIDATE_ID="candidate_yourname"

    This creates isolated BigQuery datasets: candidate_yourname_raw, candidate_yourname_staging, candidate_yourname_prod

  4. Verify setup

    cd exercises/batch_level_1/candidate_solution
    python scripts/test_bigquery_connection.py

📤 Submission Guidelines

Single Exercise Per Pull Request

Important: Each PR should address only one exercise at a time.

Why?

  • Cleaner code review
  • Independent evaluation feedback for each exercise
  • Easier to iterate and fix issues separately

How to structure your submissions:

✅ Correct approach:

PR #1: Implement Exercise 1 (batch_level_1)
  - Changes only in: exercises/batch_level_1/candidate_solution/

PR #2: Implement Exercise 2 (batch_level_2)
  - Changes only in: exercises/batch_level_2/

❌ Incorrect approach:

PR #1: Implement both exercises
  - Changes in: exercises/batch_level_1/candidate_solution/
  - AND exercises/batch_level_2/
  ❌ This PR will be rejected!

If you accidentally modify both exercises:

  1. Create a new branch with only Exercise 1 changes
  2. Create a separate branch with only Exercise 2 changes
  3. Open two separate PRs
  4. Close the original PR

🤖 Automated Code Evaluation

When you push your code and open a Pull Request, an automated evaluation workflow will:

  1. Clone your solution at the commit you specified
  2. Execute your code in an isolated Cloud Run environment
  3. Validate outputs against expected schemas and metrics
  4. Post results as a comment on your PR

What to expect:

  • Evaluation starts automatically when you open a PR
  • Results appear as a comment within a few minutes
  • Shows: ✅ Passed tests, ❌ Failed validations, 📊 Performance metrics
  • You can fix issues and push new commits for re-evaluation

How to trigger evaluation:

# Make changes to your solution
git add exercises/batch_level_*/candidate_solution/
git commit -m "feat: implement ETL pipeline"
git push origin your-branch

# Open a Pull Request on GitHub
# Evaluation starts automatically!

Viewing results:

  • Go to your Pull Request
  • Scroll to "Checks" or "Comments" section
  • See detailed feedback from the evaluator

📖 Working on Exercises

Batch Level 1 - Full Implementation

This exercise requires working code that runs in BigQuery:

cd exercises/batch_level_1/candidate_solution

# Choose your approach: SQL or Python (delete unused stubs)
# Option 1: SQL approach - Edit sql/*.sql files
# Option 2: Python approach - Edit python/*.py files

# Run the full pipeline
python -m src.pipeline full

# Verify results in BigQuery:
# - candidate_yourname_staging.users
# - candidate_yourname_staging.payments
# - candidate_yourname_prod.country_revenue

What to deliver:

  • Staging transformations (prepare raw data for analytics use)
  • Production aggregation (country-level revenue metrics)
  • Working pipeline that produces correct results

Batch Level 2 - Implementation or Pseudocode

This exercise evaluates both your implementation and design thinking:

cd exercises/batch_level_2

# Read the problem in README.md
# Implement SQL or Python transformations, or write pseudocode
# Focus on logic and correctness
# Explain your reasoning and trade-offs

Example approach:

FUNCTION process_incremental_data(run_date):
  // Step 1: Identify new files since last run
  new_files = list_gcs_files(path, after=last_processed_timestamp)
  
  // Step 2: Load and deduplicate
  events = load_csv_files(new_files)
            .deduplicate(on=event_id, keep=first)
  
  // Step 3: Merge into daily table (idempotent)
  MERGE INTO daily_metrics USING events
    ON daily_metrics.user_id = events.user_id AND daily_metrics.date = events.date
    WHEN MATCHED THEN UPDATE SET ...
    WHEN NOT MATCHED THEN INSERT ...
  
  // Step 4: Record processed files
  update_metadata(last_processed_timestamp = now())
END

// Trade-offs:
// - Using MERGE ensures idempotency (safe to rerun)
// - Deduplication by event_id handles resent files
// - Limited lookback window (7 days) balances completeness vs performance

📊 GCP Infrastructure Details

Project: data-engineer-recruitment Location: EU

Your Datasets (isolated per candidate):

  • candidate_yourname_raw - Raw ingested data
  • candidate_yourname_staging - Prepared data (staging)
  • candidate_yourname_prod - Production analytics tables

Seed Data (for Batch Level 1):

  • Location: gs://de-recruitment-test-seeds-2026/
  • Files: users.csv, payments.csv
  • Auto-loaded by ingestion script

💡 Assessment Guidelines

What We're Looking For

Technical Skills

  • Correct data combination logic (joins, lookups)
  • Proper null and missing data handling
  • Understanding of staging layer purpose (preparing data for analytics)
  • Understanding of idempotency and deduplication
  • Appropriate aggregation and window functions
  • Knowledge of BigQuery/SQL or pandas operations

Problem-Solving Approach

  • Asking clarifying questions
  • Identifying edge cases
  • Explaining trade-offs
  • Considering production concerns (scale, cost, monitoring)

Communication

  • Clear pseudocode with comments
  • Explaining reasoning behind design choices
  • Documenting assumptions

Tips for Success

  1. Read the problem carefully - Understand requirements before coding
  2. Ask questions - Clarify ambiguities (even for take-home exercises, document assumptions)
  3. Think production - Consider scale, failures, monitoring
  4. Explain your reasoning - Add comments about trade-offs
  5. Test your code - For Batch Level 1, verify your results

Pseudocode Guidelines

  • Syntax doesn't matter - SQL-like, Python-like, or plain English all work
  • Logic matters - Show clear data flow and transformations
  • Comment your reasoning - Explain why you chose this approach
  • Consider edge cases - What happens with nulls, duplicates, late data?

🔧 Troubleshooting

Can't connect to BigQuery

Check:

  1. Credentials file exists: config/data-engineer-recruitment.json
  2. CANDIDATE_ID environment variable is set
  3. Run test script: python scripts/test_bigquery_connection.py

Datasets not created

Solution: Datasets are created automatically when you run the pipeline. If you need to create them manually:

bq mk --location=EU data-engineer-recruitment:candidate_yourname_raw
bq mk --location=EU data-engineer-recruitment:candidate_yourname_staging
bq mk --location=EU data-engineer-recruitment:candidate_yourname_prod

Import errors

Solution: Make sure you're in the correct directory and virtual environment is activated:

cd exercises/batch_level_1/candidate_solution
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\Activate.ps1  # Windows

📞 Support

Questions or technical issues?

  • Email: [email protected]
  • Include: Your candidate ID, error messages, and what you've tried

🔐 Security & Credentials

No GCP credentials are required for this assessment. If remote/cloud resources are used during a live interview, temporary access will be provided by the interviewer during the session.

Your Data:

  • All datasets (if any) are isolated per candidate
  • No access to other candidates' data
  • Any temporary datasets or resources will be cleaned up after the assessment

Good luck! We're excited to see your approach to these problems. 🚀

Batch Level 2: Deduplication Pipeline

  • Input: Event data with duplicates and complex rules
  • Output: Cleaned dataset with business logic applied
  • Focus: Deduplication strategies, data quality, incremental processing

Evaluation Criteria

Your solutions will be evaluated on:

  • Correctness: Does it produce the expected results?
  • Code Quality: Clean, readable, well-documented code
  • Problem Solving: Efficient algorithms and data structures
  • Production Readiness: Error handling, scalability considerations, idempotency
  • Documentation: Clear explanations of your approach

GCP Resources Available

  • BigQuery for data warehousing
  • Cloud Storage for file storage
  • Service account with necessary permissions
  • Sample datasets pre-loaded

Getting Help

  • Review the problem statements carefully
  • Ask clarifying questions if requirements are unclear
  • Test with the provided sample data
  • Document your assumptions and decisions

Submission

When complete, commit your changes and create a pull request, or follow the submission instructions provided by your interviewer.


Good luck! We're excited to see your data engineering solutions. 🚀

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published