Sunday, April 07, 2013

Videolamer Search Project Redux

A little over a year ago (has it really been over a year?), I tried my hand at a personal coding project. I wanted to index all of the Videolamer archives into an Apache Solr instance, to allow for full text searching on all of the content. I got pretty far, far enough to get a basic webpage up. Unfortunately, the work stopped cold one day.

This happens to me all the time, and I'm getting tired of it. Just once I'd like to finish something I start, so I'm picking this project back up again, from scratch. I'm going to do things differently this time, to see if it makes a difference.

  1. I'm going to break up the work using the Pomodoro system. This will entail breaking the work upon into subtasks, as well as making time estimates.
  2. I'm going to try and document the process from the very start.
  3. I'm not going to be impatient. If there is one problem which I know has undone me in the past, it is sloppy results due to a lack of patience.
Here is my initial breakdown on the major tasks for this project (these will be broken down over into subtasks):

  • Export the data from the VL site.
  • Install a local Wordpress instance.
  • Import VL data into local Wordpress instance.
  • Cleanup VL articles.  This is something I learned from the last attempt - the formatting used in each article is not consistent.  This will be detrimental when it comes to indexing the articles in Solr.  I need to make sure they all have proper HTML markup, as well as making sure their intro paragraph has some sort of "intro" id that I can use for returning the opening paragraph in search results.
  • Index local SQL database (containing all the articles) into Solr.
  • Test article search using the Solr interface.  Tweak if necessary.
  • Prepare the website.  I will once again try and just use ajax-solr for this part.  If it doesn't work well enough, I'll consider switching over to a Rails app.
PS - This post took approx. less than 1 Pomodoro to write.

No comments: