UPDATE: You can now get the DBIx::Class::CompressColumns module on CPAN here or via CPAN command-line shell.
I’m going to get geeky on you for a minute, but you should find this interesting. One of the challenges I’ve had with SocialToo recently has been the massive Social Graph data we’ve had to story and process and track. We cache a lot of the data so we don’t have to hit Twitter’s servers as often, and also to enable us to track new follows and unfollows regularly on behalf of our users.
If you are a SocialToo user you may have noticed that your data hasn’t been as accurate lately as it should. The reason for that is we have had a) 20,000+ users all wanting to auto-follow or have their follower base tracked, and b) all 20,000+ of those users have anywhere from 100 to near 1 million followers that we have to store and process. It’s not an easy task! And our database, set up in a relational manner of followers to users, just wasn’t cutting it in regards to being able to retrieve and process so many followers at a time.
So I took a cue from Bret Taylor and FriendFeed, who talks about how they denormalized their database, and now reference “bags” of data that they can then process in their code. I went for a hybrid model, and with each user entry I now have a single column on that table we reference, in BLOB format, which contains all the social graph data for that user. In Perl, I simply create a hash structure of the data, freeze it, and then store it in the database in our social graph column. To retrieve it, we pull it from the database, thaw it, and we have an entire social graph we can play around with and do with as we please.
The issue I was running into however is that plain text, stored in a single column, for a user with 1 million followers, gets to be quite a large amount of data we need to pull through the pipes. I needed an easy way to compress the data before inserting into the database, storing it in binary format, and decompress. I also wanted it to be automatic, so no coder would ever have to worry about this extra step – it would just happen magically.
So today I’m releasing DBIx::Class::CompressColumns for all you Perl coders out there. What this module does is it sits on top of Perl’s DBIx::Class database abstraction libraries and allows you to monitor a single column. Any inserts or updates into that column get compressed in Zlib format, and any selects/get_column calls to that data (you must use get_column) get de-compressed, meaning you don’t have to worry at all about that extra step, the data is a significantly smaller footprint, and your throughput is much less, causing much less load on the database. For one-million followers, I measured just 4 Megabytes in space taken that has to go in and out of MySQL.
Approaching Graph optimization in this manner has significantly sped up our processes, and I’m already seeing huge benefits from it. There is much less load on the database, it’s much faster to retrieve and process the data, and we’re getting through our users’ followers much faster now.
The module namespace is currently being applied for on CPAN at the moment, and I’ll post a link there as soon as it is approved, but for now you can download the Makefile-compatible gzipped library here. I hope some of you find this useful, and please feel free to modify or send me any updates or bugs you think I missed!
The link for the download is http://socialtoo.com/DBIx-Class-CompressColumns-0.01000.tar.gz
Oh, and TMTOWTDI so please if you have better ways of approaching this I’d love to hear your ideas!
Photo courtesy rp72