Here are some projects I've worked on over the past few years (in reverse chronological order). If you'd like to know more, you can contact me.

Tag Addict


Spring 2012

A lightweight session manager

Do you have 50+ tabs open at any given time? Do you fly into a rage when someone closes your browser? You may be a tab addict. Get help with my session management extension for Google Chrome. Click here to read moreless.

TabAddict allows you to...

  • Capture the current session--urls, in plain text format
  • Annotate sessions--keep track of what you were doing
  • Open saved sessions--it's as simple as drag and drop
  • Stay in control--your data can't get lost in the cloud

Download it from the Chrome Webstore, or check out the code on github.

Web Scraping

Fall 2010 - Spring 2012

Query inference for robust scrapers

No one likes tedious, repetitive tasks. That's what computers are for, after all. But when those repetitive tasks involve retrieving and processing data from a website, what's a programmer to do? In the past, I relied on brittle, hand-crafted scrapers which bombed every time the target website deleted a div. Not cool. So I came up with a new way to scrape. Click here to read moreless.

Can we build scrapers that figure the queries out for themselves?

That question has been on my mind since I tried to figure out what WWW::Mechanize did. Years later, when working on a freelance scraping gig, I finally came up with an answer. (Hint: It's "Yes".)

My implementation of this idea relies on the insight that semantics are stable, but format can change. The Rotten Tomatoes rating of BATTLEFIELD EARTH will always be 2%, but the xpath to that percentage could change at any time. So, instead of dealing with xpath directly, you can instead provide the "right answers" for a few sample pages, and the xpaths can be deduced from those. This also doubles as a testing framework, checking your queries against all the nasty corner-case pages that have gummed up your scraper in the past.

Anyhow, I thought it was a cool idea and a fun project!

Netflix Challenge

Fall 2009 - Spring 2010

SVD for collaborative filtering

In 2006, Netflix began an open competition to improve their movie recommendation system, offering a prize of one million dollars to the winners. During my senior year in college, I worked on a group senior project on just that problem. Click here to read moreless.

Unfortunately for us, someone won the prize a month before we started. Drat! There goes the million dollars!

Of course, that didn't stop us from working on the project. So far as senior projects go, it was a great challenge--putting the machine learning algorithms we had learned (and those we hadn't) into practice on a (big) real world data set, with all the messiness that entails.

My part of the project was to implement the SVD-based methods that had served the winning team so well. You can see my slides from the talk we gave at the completion of our project. Also, if you'd like to learn more about SVD and collaborative filtering problems, check out some of my answers on Cross-Validated, the stackexchange site for statistics and datamining. It's been a few years, but I still remember the math!

Bioinformatics Internship

Summer 2009

Applying NLP to medical records

As a medical researcher, wouldn't it be great if you had 20,000 participants in each of your studies? Well today, electronic health records and genetic sequencing allow the researchers at the Marshfield Clinic Personalized Medicine Research Project to do genetic studies at just such a scale. I had a chance to contribute in a small way to one of their genome wide association studies as a summer intern in the Bioinformatics Research Center. Click here to read moreless.

When conducting a medical study, it wouldn't seem like deciding which participants have the relevant condition and which don't would be a very difficult problem. Well, it sure is when you have 20,000 medical records to look through. So (partial) automation is a must.

But how can computers read medical records? What if you don't have a carefully coded database of patients (thankfully Marshfield is doing pretty well in this area)? Here are my slides for the presentation I gave at the end of the summer for how I approached some of the problems that come up when implementing a natural language processing pipeline for medical records.

If you're looking for projects related to Chinese, I'll be posting them here.