Better late than never (A Curious Moon)

I keep saying that you never know when some knowledge would be useful. So there’s no such thing as useless knowledge. You’ve just not found usage for it yet. Some time ago I was praising The Imposter’s Handbook. I’ve liked it so much that I’ve bought and started reading A Curious Moon of the same author. I was doing some SQL at the time, but it wasn’t my major task. And I knew nothing about PostgreSQL, apart that it existed. But I’ve read good chunk of this book (sadly I’ve never finished it, but I will, someday, I promise) and decided that it’s cool, but that I probably won’t use this knowledge. My company is using MS SQL and won’t switch to PostgreSQL in any foreseeable future, or ever.

I was always interested in movies. I love them. And as I’m doing a lot of work from command line interface I wanted to be able to check on movies without leaving it. I already check my Twitter stream from CLI and as I follow One Perfect Shot I wanted to be able to check on movies they mention without launching IMDb app on my phone. I knew that IMDb provides dumps of its DB, but I’ve decided it’s too hard and cumbersome to use it (that was before I’ve read mentioned book). And soon I’ve forgotten all about it. But I’ve written a small app to utilise API. The service isn’t great however. For instance if you’ll search for “rambo”: you’ll also get things like “Syndicate Sadists” or “R… Rajkumar”. Well…

Recently I was reminded about IMDb dumps and with the knowledge gained from Curious Moon I’ve decided to give it a go. It didn’t take me long and I had an app, created database and proper tables, and even were able to update the data periodically (IMDb publishes dumps daily, but I don’t need it to be so fresh) via cron job. All thanks to Rob Conery and his little book. That’s really comprehensive take on technology. As it’s not only teaching you how to use PostgreSQL query wise (which isn’t so different from SQL), but also how to normalise data before import and how to automate the whole process of maintaining the database. Honestly I don’t know if I would ever think of using makefiles for databases. It’s one of these things which seem obvious once you know it, but wasn’t much before.

A word about normalisation. IMDb uses IDs like nm0000001 and tt0000001 in its dumps, which obviously are strings. But in DB it’s better to use IDs which are numbers. So one of the things I’m doing while normalising data before import is to convert these text IDs into numeric ones. This way I’ve learnt hard lesson of why you should use iterators rather than operating on array as a whole. Transforming 35,982,380 or more rows on my desktop wasn’t big issue, but it’s killed my VPS. It’s simply drained all memory:(

And that’s not the only usage I’ve found for PostgreSQL. Some time ago I’ve found that my government publishes data about surnames (among many other things) of people in my country. It’s in the form of CSV, so I could’ve used Excel to play with it. Or even create a small app. But why?! If I have a database engine already installed on my computer (and I don’t have Excel). OK, maybe I’m becoming a bit addicted. And actually analysing data on frequency of surnames in your country is really interesting. Not to mention it helped me to destroy false conviction about which names are popular and which aren’t. Don’t trust your guts on anything kids, check the data.

I’ve also learnt hard lesson about not pushing external data straight to “production”. At some point somebody at IMDb has made a mistake, so some dumps had duplicated rows. As I’m using unique keys in my tables which correspond to IDs in dump, obviously import failed. And as I’m pruning all my data first and only then importing dumps to fresh tables. Well… I had a day without access to movie database;)

A Curious Moon tells a story of female engineer who’s tasked with processing data from Cassini mission which gathered data about Enceladus moon. Which is quite interesting and gives another level to this book. It doesn’t feel like a technical book. Well, almost, as you still have the code parts. I highly recommend it.

BTW I’ve already bought The Imposter’s Handbook, Season 2. So far I’ve got a bit bored by chapters on boolean algebra and logic, maybe because I’ve read a lot on these subjects already. But I bet it gets better.