I am a recent biophysics PhD graduate looking for new opportunities in Data Engineering, Big Data, and Software Engineering. I recently completed the Data Engineering Fellowship at Insight Data Science. You can find me on LinkedIn.
As a Data Engineering fellow, I developed “LinkRun” a pipeline to determine popularity of millions of websites (at a subdomain granularity) by analyzing the number of times they are referenced in a data set of over 2.5 billion web pages (>17 terabytes compressed data). I processed the data using Spark deployed on an AWS EMR cluster and loaded results into Postgres (>47 million rows).
I also created an interactive web application to allow users to view and query this website popularity database.
During my graduate studies I used biophysical and molecular biology methods such as NMR and fluorescence to study intrinsically disordered proteins. I also created data analysis tools and pipelines to automate, improve, and simplify various tasks. Specifically, my research was focused on understanding how phosphorylation regulates a disordered transcription factor.
This website contains links to scientific resources that I found useful during my graduate studies and a few data wrangling tools that I created. These tools were mostly for internal lab use, but each tool comes with example input data that can be used to demonstrate how the tool works.
If you have any questions or would like to access password protected content on this site feel free to contact me.
LinkRun – ETL pipeline to determine website popularity
ETL pipeline to determine and compare popularity of millions of websites across the web. Analyzed > 17 terabytes of compressed data (~2.6 billion webpages) using Spark. Calculated popularity data for > 25 million websites.
AWS: S3, EMR, RDS, EC2
Python, Spark, SQL.
Last updated: 2019 Oct 1