TrendSci – Some stuff with potential use for scientists and myself

About me

I am a recent biophysics PhD graduate looking for new opportunities in Data Engineering, Big Data, and Software Engineering. I recently completed the Data Engineering Fellowship at Insight Data Science. You can find me on LinkedIn.

Data Engineering

As a Data Engineering fellow, I developed “LinkRun” a pipeline to determine popularity of millions of websites (at a subdomain granularity) by analyzing the number of times they are referenced in a data set of over 2.5 billion web pages (>17 terabytes compressed data). I processed the data using Spark deployed on an AWS EMR cluster and loaded results into Postgres (>47 million rows).
I also created an interactive web application to allow users to view and query this website popularity database.

Graduate studies

During my graduate studies I used biophysical and molecular biology methods such as NMR and fluorescence to study intrinsically disordered proteins. I also created data analysis tools and pipelines to automate, improve, and simplify various tasks. Specifically, my research was focused on understanding how phosphorylation regulates a disordered transcription factor.

This website

This website contains links to scientific resources that I found useful during my graduate studies and a few data wrangling tools that I created. These tools were mostly for internal lab use, but each tool comes with example input data that can be used to demonstrate how the tool works.

If you have any questions or would like to access password protected content on this site feel free to contact me.

Projects

LinkRun – ETL pipeline to determine website popularity

ETL pipeline to determine and compare popularity of millions of websites across the web. Analyzed > 17 terabytes of compressed data (~2.6 billion webpages) using Spark. Calculated popularity data for > 25 million websites.

GitHub repository

Technologies/Tools

AWS: S3, EMR, RDS, EC2
Python, Spark, SQL.

Pipeline

LinkRun Pipeline

Last updated: 2019 Oct 1

Home