π οΈ Professional Experience
π₯ Population Data BC
ML Engineer Sep 2025 β Apr 2026
The Mission: Optimizing the bridge between massive provincial health datasets and academic research.
Key Contributions:
- Performance Engineering: Led the transition from legacy CSV ingestion to a high-performance Apache Parquet and Apache Arrow stack.
- Infrastructure & Automation: Re-engineered core pipelines into a container-ready architecture utilizing Airflow orchestration and DuckDB for lightning-fast, reproducible local processing.
- Intelligent Document Processing: Integrated OCR and Retrieval-Augmented Generation (RAG) via LlamaIndex to extract critical insights, generate key summaries, and flags application deficiencies of research requests.
- Statistical Modeling: Developed a high-fidelity synthetic data generator utilizing complex statistical distributions (including Log-Normal, Poisson, and Normal) to accurately model sensitive variable relationships for research requests.
Stack: Python LlamaIndex Airflow DuckDB Arrow Parquet OCR SciPy
π± Samsung R&D
Data Engineer Sep 2024 β Aug 2025
The Mission: Managing global-scale data infrastructure and privacy compliance.
Key Contributions:
- High-Volume Orchestration: Managed the real-time processing of 500M+ records per day. I deployed automated CI/CD pipelines via GitHub Actions that handled dynamic PII (Personal Identifiable Information) tagging across 10+ data streams.
- Data Reliability: Owned the orchestration of cross-region transfers for 200+ datasets, maintaining a strict 99% availability SLA for downstream analytics teams.
- Analytics Engineering: Leveraged dbt (data build tool) to transform raw FastAPI-backed application data into actionable insights, automating the workflows for over 50 executive Tableau dashboards.
Stack: AWS Redshift dbt Airflow GitHub Actions Tableau
π Vancouver Coastal Health
Software Developer Intern May 2024 β Aug 2024
The Mission: Enhancing data integrity for clinical informatics systems.
Key Contributions:
- Validation Frameworks: Developed a custom R package using the
testthatframework to enforce data quality at the ingestion layer. This prevented βdirty dataβ from entering downstream clinical pipelines. - Modern Storage: Spearheaded the migration of historical clinical records from unstructured flat files to a centralized SQLite-based storage system, enabling faster retrieval and more complex analytical queries.
- Pipeline Maintenance: Maintained and optimized critical analytical pipelines that support real-time clinical informatics, ensuring healthcare providers had access to validated data.
Stack: R testthat SQLite Clinical Informatics Git