Data Engineering Technical Assessment

Welcome to the data engineering technical assessment! This repository contains exercises designed to evaluate your data pipeline design skills, problem-solving ability, and production thinking.

📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Objective: Implement a complete ETL pipeline from raw data ingestion through staging transformations to production analytics tables.

What you'll do:

Ingest raw CSV data into BigQuery
Design and implement staging transformations (prepare data for analytics)
Create production aggregations for business analytics
Choose between SQL or Python implementation approach
Write pseudocode (for live technical interview)

Location: exercises/batch_level_1/ Instructions: See exercises/batch_level_1/README.md

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

Objective: Design an idempotent daily ETL pipeline that handles late-arriving and duplicate data.

What you'll do:

Design incremental data processing logic
Handle late-arriving and duplicate data
Implement deduplication strategy
Ensure idempotency
Choose between SQL or Python implementation approach

Location: exercises/batch_level_2/ Instructions: See exercises/batch_level_2/README.md

🚀 Getting Started

Prerequisites

Python 3.8+
Git

Repository Structure

├── .github/                         # Workflow configuration
├── config/                          # Config templates
├── data/                            # Sample data (if any)
├── docs/                            # Additional documentation
├── exercises/
│   ├── config.json
│   ├── batch_level_1/
│   │   ├── CANDIDATE_INSTRUCTIONS.md
│   │   ├── Dockerfile
│   │   ├── README.md
│   │   ├── README_CI_CD.md
│   │   ├── scripts/
│   │   ├── seeds/
│   │   ├── src/
│   │   ├── python/                  # Python approach (optional)
│   │   └── sql/                     # SQL approach (optional)
│   └── batch_level_2/
│       ├── Dockerfile
│       ├── README.md
│       ├── scripts/
│       ├── python/                 # Python approach
│       ├── sql/                    # SQL approach
│       ├── seeds/
│       └── src/
├── src/                             # Shared pipeline code
├── requirements.txt                 # Python dependencies
└── README.md                        # This file

Setup Instructions

Clone this repository

git clone https://github.com/dbarrios83/data-engineer-candidate-test.git
cd data-engineer-candidate-test

Set up Python environment

Windows PowerShell:

python -m venv .venv
.venv\Scripts\Activate.ps1
pip install -r requirements.txt

Linux/Mac:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set your candidate ID (for Batch Level 1)

Windows PowerShell:
```
$env:CANDIDATE_ID = "candidate_yourname"
```
Linux/Mac:
```
export CANDIDATE_ID="candidate_yourname"
```
This creates isolated BigQuery datasets: candidate_yourname_raw, candidate_yourname_staging, candidate_yourname_prod

Verify setup

cd exercises/batch_level_1/candidate_solution
python scripts/test_bigquery_connection.py

📤 Submission Guidelines

Single Exercise Per Pull Request

Important: Each PR should address only one exercise at a time.

Why?

Cleaner code review
Independent evaluation feedback for each exercise
Easier to iterate and fix issues separately

How to structure your submissions:

✅ Correct approach:

PR #1: Implement Exercise 1 (batch_level_1)
  - Changes only in: exercises/batch_level_1/candidate_solution/

PR #2: Implement Exercise 2 (batch_level_2)
  - Changes only in: exercises/batch_level_2/

❌ Incorrect approach:

PR #1: Implement both exercises
  - Changes in: exercises/batch_level_1/candidate_solution/
  - AND exercises/batch_level_2/
  ❌ This PR will be rejected!

If you accidentally modify both exercises:

Create a new branch with only Exercise 1 changes
Create a separate branch with only Exercise 2 changes
Open two separate PRs
Close the original PR

🤖 Automated Code Evaluation

When you push your code and open a Pull Request, an automated evaluation workflow will:

Clone your solution at the commit you specified
Execute your code in an isolated Cloud Run environment
Validate outputs against expected schemas and metrics
Post results as a comment on your PR

What to expect:

Evaluation starts automatically when you open a PR
Results appear as a comment within a few minutes
Shows: ✅ Passed tests, ❌ Failed validations, 📊 Performance metrics
You can fix issues and push new commits for re-evaluation

How to trigger evaluation:

# Make changes to your solution
git add exercises/batch_level_*/candidate_solution/
git commit -m "feat: implement ETL pipeline"
git push origin your-branch

# Open a Pull Request on GitHub
# Evaluation starts automatically!

Viewing results:

Go to your Pull Request
Scroll to "Checks" or "Comments" section
See detailed feedback from the evaluator

📖 Working on Exercises

Batch Level 1 - Full Implementation

This exercise requires working code that runs in BigQuery:

cd exercises/batch_level_1/candidate_solution

# Choose your approach: SQL or Python (delete unused stubs)
# Option 1: SQL approach - Edit sql/*.sql files
# Option 2: Python approach - Edit python/*.py files

# Run the full pipeline
python -m src.pipeline full

# Verify results in BigQuery:
# - candidate_yourname_staging.users
# - candidate_yourname_staging.payments
# - candidate_yourname_prod.country_revenue

What to deliver:

Staging transformations (prepare raw data for analytics use)
Production aggregation (country-level revenue metrics)
Working pipeline that produces correct results

Batch Level 2 - Implementation or Pseudocode

This exercise evaluates both your implementation and design thinking:

cd exercises/batch_level_2

# Read the problem in README.md
# Implement SQL or Python transformations, or write pseudocode
# Focus on logic and correctness
# Explain your reasoning and trade-offs

Example approach:

FUNCTION process_incremental_data(run_date):
  // Step 1: Identify new files since last run
  new_files = list_gcs_files(path, after=last_processed_timestamp)
  
  // Step 2: Load and deduplicate
  events = load_csv_files(new_files)
            .deduplicate(on=event_id, keep=first)
  
  // Step 3: Merge into daily table (idempotent)
  MERGE INTO daily_metrics USING events
    ON daily_metrics.user_id = events.user_id AND daily_metrics.date = events.date
    WHEN MATCHED THEN UPDATE SET ...
    WHEN NOT MATCHED THEN INSERT ...
  
  // Step 4: Record processed files
  update_metadata(last_processed_timestamp = now())
END

// Trade-offs:
// - Using MERGE ensures idempotency (safe to rerun)
// - Deduplication by event_id handles resent files
// - Limited lookback window (7 days) balances completeness vs performance

📊 GCP Infrastructure Details

Project: data-engineer-recruitment Location: EU

Your Datasets (isolated per candidate):

candidate_yourname_raw - Raw ingested data
candidate_yourname_staging - Prepared data (staging)
candidate_yourname_prod - Production analytics tables

Seed Data (for Batch Level 1):

Location: gs://de-recruitment-test-seeds-2026/
Files: users.csv, payments.csv
Auto-loaded by ingestion script

💡 Assessment Guidelines

What We're Looking For

Technical Skills ✅

Correct data combination logic (joins, lookups)
Proper null and missing data handling
Understanding of staging layer purpose (preparing data for analytics)
Understanding of idempotency and deduplication
Appropriate aggregation and window functions
Knowledge of BigQuery/SQL or pandas operations

Problem-Solving Approach ✅

Asking clarifying questions
Identifying edge cases
Explaining trade-offs
Considering production concerns (scale, cost, monitoring)

Communication ✅

Clear pseudocode with comments
Explaining reasoning behind design choices
Documenting assumptions

Tips for Success

Read the problem carefully - Understand requirements before coding
Ask questions - Clarify ambiguities (even for take-home exercises, document assumptions)
Think production - Consider scale, failures, monitoring
Explain your reasoning - Add comments about trade-offs
Test your code - For Batch Level 1, verify your results

Pseudocode Guidelines

Syntax doesn't matter - SQL-like, Python-like, or plain English all work
Logic matters - Show clear data flow and transformations
Comment your reasoning - Explain why you chose this approach
Consider edge cases - What happens with nulls, duplicates, late data?

🔧 Troubleshooting

Can't connect to BigQuery

Check:

Credentials file exists: config/data-engineer-recruitment.json
CANDIDATE_ID environment variable is set
Run test script: python scripts/test_bigquery_connection.py

Datasets not created

Solution: Datasets are created automatically when you run the pipeline. If you need to create them manually:

bq mk --location=EU data-engineer-recruitment:candidate_yourname_raw
bq mk --location=EU data-engineer-recruitment:candidate_yourname_staging
bq mk --location=EU data-engineer-recruitment:candidate_yourname_prod

Import errors

Solution: Make sure you're in the correct directory and virtual environment is activated:

cd exercises/batch_level_1/candidate_solution
source .venv/bin/activate  # Linux/Mac
# or
.venv\Scripts\Activate.ps1  # Windows

📞 Support

Questions or technical issues?

Email: [email protected]
Include: Your candidate ID, error messages, and what you've tried

🔐 Security & Credentials

No GCP credentials are required for this assessment. If remote/cloud resources are used during a live interview, temporary access will be provided by the interviewer during the session.

Your Data:

All datasets (if any) are isolated per candidate
No access to other candidates' data
Any temporary datasets or resources will be cleaned up after the assessment

Good luck! We're excited to see your approach to these problems. 🚀

Batch Level 2: Deduplication Pipeline

Input: Event data with duplicates and complex rules
Output: Cleaned dataset with business logic applied
Focus: Deduplication strategies, data quality, incremental processing

Evaluation Criteria

Your solutions will be evaluated on:

Correctness: Does it produce the expected results?
Code Quality: Clean, readable, well-documented code
Problem Solving: Efficient algorithms and data structures
Production Readiness: Error handling, scalability considerations, idempotency
Documentation: Clear explanations of your approach

GCP Resources Available

BigQuery for data warehousing
Cloud Storage for file storage
Service account with necessary permissions
Sample datasets pre-loaded

Getting Help

Review the problem statements carefully
Ask clarifying questions if requirements are unclear
Test with the provided sample data
Document your assumptions and decisions

Submission

When complete, commit your changes and create a pull request, or follow the submission instructions provided by your interviewer.

Good luck! We're excited to see your data engineering solutions. 🚀

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Engineering Technical Assessment

📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

🚀 Getting Started

Prerequisites

Repository Structure

Setup Instructions

📤 Submission Guidelines

Single Exercise Per Pull Request

🤖 Automated Code Evaluation

📖 Working on Exercises

Batch Level 1 - Full Implementation

Batch Level 2 - Implementation or Pseudocode

📊 GCP Infrastructure Details

💡 Assessment Guidelines

What We're Looking For

Tips for Success

Pseudocode Guidelines

🔧 Troubleshooting

Can't connect to BigQuery

Datasets not created

Import errors

📞 Support

🔐 Security & Credentials

Batch Level 2: Deduplication Pipeline

Evaluation Criteria

GCP Resources Available

Getting Help

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github/workflows		.github/workflows
config		config
data		data
docs		docs
exercises		exercises
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

zero-plus-x/data-engineer-candidate-test

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Technical Assessment

📋 What's Included

Exercise 1: Batch Level 1 - Full ETL Pipeline Implementation (30-45 min)

Exercise 2: Batch Level 2 - Incremental Daily Build (20 min)

🚀 Getting Started

Prerequisites

Repository Structure

Setup Instructions

📤 Submission Guidelines

Single Exercise Per Pull Request

🤖 Automated Code Evaluation

📖 Working on Exercises

Batch Level 1 - Full Implementation

Batch Level 2 - Implementation or Pseudocode

📊 GCP Infrastructure Details

💡 Assessment Guidelines

What We're Looking For

Tips for Success

Pseudocode Guidelines

🔧 Troubleshooting

Can't connect to BigQuery

Datasets not created

Import errors

📞 Support

🔐 Security & Credentials

Batch Level 2: Deduplication Pipeline

Evaluation Criteria

GCP Resources Available

Getting Help

Submission

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages