Welcome to the data engineering technical assessment! This repository contains exercises designed to evaluate your data pipeline design skills, problem-solving ability, and production thinking.
Objective: Implement a complete ETL pipeline from raw data ingestion through staging transformations to production analytics tables.
What you'll do:
- Ingest raw CSV data into BigQuery
- Design and implement staging transformations (prepare data for analytics)
- Create production aggregations for business analytics
- Choose between SQL or Python implementation approach
- Write pseudocode (for live technical interview)
Location: exercises/batch_level_1/
Instructions: See exercises/batch_level_1/README.md
Objective: Design an idempotent daily ETL pipeline that handles late-arriving and duplicate data.
What you'll do:
- Design incremental data processing logic
- Handle late-arriving and duplicate data
- Implement deduplication strategy
- Ensure idempotency
- Choose between SQL or Python implementation approach
Location: exercises/batch_level_2/
Instructions: See exercises/batch_level_2/README.md
- Python 3.8+
- Git
├── .github/ # Workflow configuration
├── config/ # Config templates
├── data/ # Sample data (if any)
├── docs/ # Additional documentation
├── exercises/
│ ├── config.json
│ ├── batch_level_1/
│ │ ├── CANDIDATE_INSTRUCTIONS.md
│ │ ├── Dockerfile
│ │ ├── README.md
│ │ ├── README_CI_CD.md
│ │ ├── scripts/
│ │ ├── seeds/
│ │ ├── src/
│ │ ├── python/ # Python approach (optional)
│ │ └── sql/ # SQL approach (optional)
│ └── batch_level_2/
│ ├── Dockerfile
│ ├── README.md
│ ├── scripts/
│ ├── python/ # Python approach
│ ├── sql/ # SQL approach
│ ├── seeds/
│ └── src/
├── src/ # Shared pipeline code
├── requirements.txt # Python dependencies
└── README.md # This file
-
Clone this repository
git clone https://github.com/dbarrios83/data-engineer-candidate-test.git cd data-engineer-candidate-test -
Set up Python environment
Windows PowerShell:
python -m venv .venv .venv\Scripts\Activate.ps1 pip install -r requirements.txt
Linux/Mac:
python3 -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Set your candidate ID (for Batch Level 1)
Windows PowerShell:
$env:CANDIDATE_ID = "candidate_yourname"
Linux/Mac:
export CANDIDATE_ID="candidate_yourname"
This creates isolated BigQuery datasets:
candidate_yourname_raw,candidate_yourname_staging,candidate_yourname_prod -
Verify setup
cd exercises/batch_level_1/candidate_solution python scripts/test_bigquery_connection.py
Important: Each PR should address only one exercise at a time.
Why?
- Cleaner code review
- Independent evaluation feedback for each exercise
- Easier to iterate and fix issues separately
How to structure your submissions:
✅ Correct approach:
PR #1: Implement Exercise 1 (batch_level_1)
- Changes only in: exercises/batch_level_1/candidate_solution/
PR #2: Implement Exercise 2 (batch_level_2)
- Changes only in: exercises/batch_level_2/
❌ Incorrect approach:
PR #1: Implement both exercises
- Changes in: exercises/batch_level_1/candidate_solution/
- AND exercises/batch_level_2/
❌ This PR will be rejected!
If you accidentally modify both exercises:
- Create a new branch with only Exercise 1 changes
- Create a separate branch with only Exercise 2 changes
- Open two separate PRs
- Close the original PR
When you push your code and open a Pull Request, an automated evaluation workflow will:
- Clone your solution at the commit you specified
- Execute your code in an isolated Cloud Run environment
- Validate outputs against expected schemas and metrics
- Post results as a comment on your PR
What to expect:
- Evaluation starts automatically when you open a PR
- Results appear as a comment within a few minutes
- Shows: ✅ Passed tests, ❌ Failed validations, 📊 Performance metrics
- You can fix issues and push new commits for re-evaluation
How to trigger evaluation:
# Make changes to your solution
git add exercises/batch_level_*/candidate_solution/
git commit -m "feat: implement ETL pipeline"
git push origin your-branch
# Open a Pull Request on GitHub
# Evaluation starts automatically!Viewing results:
- Go to your Pull Request
- Scroll to "Checks" or "Comments" section
- See detailed feedback from the evaluator
This exercise requires working code that runs in BigQuery:
cd exercises/batch_level_1/candidate_solution
# Choose your approach: SQL or Python (delete unused stubs)
# Option 1: SQL approach - Edit sql/*.sql files
# Option 2: Python approach - Edit python/*.py files
# Run the full pipeline
python -m src.pipeline full
# Verify results in BigQuery:
# - candidate_yourname_staging.users
# - candidate_yourname_staging.payments
# - candidate_yourname_prod.country_revenueWhat to deliver:
- Staging transformations (prepare raw data for analytics use)
- Production aggregation (country-level revenue metrics)
- Working pipeline that produces correct results
This exercise evaluates both your implementation and design thinking:
cd exercises/batch_level_2
# Read the problem in README.md
# Implement SQL or Python transformations, or write pseudocode
# Focus on logic and correctness
# Explain your reasoning and trade-offsExample approach:
FUNCTION process_incremental_data(run_date):
// Step 1: Identify new files since last run
new_files = list_gcs_files(path, after=last_processed_timestamp)
// Step 2: Load and deduplicate
events = load_csv_files(new_files)
.deduplicate(on=event_id, keep=first)
// Step 3: Merge into daily table (idempotent)
MERGE INTO daily_metrics USING events
ON daily_metrics.user_id = events.user_id AND daily_metrics.date = events.date
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
// Step 4: Record processed files
update_metadata(last_processed_timestamp = now())
END
// Trade-offs:
// - Using MERGE ensures idempotency (safe to rerun)
// - Deduplication by event_id handles resent files
// - Limited lookback window (7 days) balances completeness vs performance
Project: data-engineer-recruitment
Location: EU
Your Datasets (isolated per candidate):
candidate_yourname_raw- Raw ingested datacandidate_yourname_staging- Prepared data (staging)candidate_yourname_prod- Production analytics tables
Seed Data (for Batch Level 1):
- Location:
gs://de-recruitment-test-seeds-2026/ - Files:
users.csv,payments.csv - Auto-loaded by ingestion script
Technical Skills ✅
- Correct data combination logic (joins, lookups)
- Proper null and missing data handling
- Understanding of staging layer purpose (preparing data for analytics)
- Understanding of idempotency and deduplication
- Appropriate aggregation and window functions
- Knowledge of BigQuery/SQL or pandas operations
Problem-Solving Approach ✅
- Asking clarifying questions
- Identifying edge cases
- Explaining trade-offs
- Considering production concerns (scale, cost, monitoring)
Communication ✅
- Clear pseudocode with comments
- Explaining reasoning behind design choices
- Documenting assumptions
- Read the problem carefully - Understand requirements before coding
- Ask questions - Clarify ambiguities (even for take-home exercises, document assumptions)
- Think production - Consider scale, failures, monitoring
- Explain your reasoning - Add comments about trade-offs
- Test your code - For Batch Level 1, verify your results
- Syntax doesn't matter - SQL-like, Python-like, or plain English all work
- Logic matters - Show clear data flow and transformations
- Comment your reasoning - Explain why you chose this approach
- Consider edge cases - What happens with nulls, duplicates, late data?
Check:
- Credentials file exists:
config/data-engineer-recruitment.json CANDIDATE_IDenvironment variable is set- Run test script:
python scripts/test_bigquery_connection.py
Solution: Datasets are created automatically when you run the pipeline. If you need to create them manually:
bq mk --location=EU data-engineer-recruitment:candidate_yourname_raw
bq mk --location=EU data-engineer-recruitment:candidate_yourname_staging
bq mk --location=EU data-engineer-recruitment:candidate_yourname_prodSolution: Make sure you're in the correct directory and virtual environment is activated:
cd exercises/batch_level_1/candidate_solution
source .venv/bin/activate # Linux/Mac
# or
.venv\Scripts\Activate.ps1 # WindowsQuestions or technical issues?
- Email: [email protected]
- Include: Your candidate ID, error messages, and what you've tried
No GCP credentials are required for this assessment. If remote/cloud resources are used during a live interview, temporary access will be provided by the interviewer during the session.
Your Data:
- All datasets (if any) are isolated per candidate
- No access to other candidates' data
- Any temporary datasets or resources will be cleaned up after the assessment
Good luck! We're excited to see your approach to these problems. 🚀
- Input: Event data with duplicates and complex rules
- Output: Cleaned dataset with business logic applied
- Focus: Deduplication strategies, data quality, incremental processing
Your solutions will be evaluated on:
- Correctness: Does it produce the expected results?
- Code Quality: Clean, readable, well-documented code
- Problem Solving: Efficient algorithms and data structures
- Production Readiness: Error handling, scalability considerations, idempotency
- Documentation: Clear explanations of your approach
- BigQuery for data warehousing
- Cloud Storage for file storage
- Service account with necessary permissions
- Sample datasets pre-loaded
- Review the problem statements carefully
- Ask clarifying questions if requirements are unclear
- Test with the provided sample data
- Document your assumptions and decisions
When complete, commit your changes and create a pull request, or follow the submission instructions provided by your interviewer.
Good luck! We're excited to see your data engineering solutions. 🚀