Monday, May 28, 2007

Data packrat

I'm finally moving forward with my plan to get all my life data into digital form.  The infrastructure is there (see earlier posts), with lots of redundant disk space and subversion repositories.  Now I just have to clean out the file cabinets.

All the various entities that send me monthly statements (banks, brokerages, utilities) are trying to get me onto some sort of "paperless" plan (and, IMO, going about it in a pretty stupid way) by offering some sort of online statements.  That's great going forward, but what about existing statements?  (And, how long should I really hold on to these records?  Can I throw away those tax returns from 1987?  Won't they be useful to my biographer?)  They all seem to offer some set of past statements in downloadable form; some going back only a few months, some going back seven years. 

Most will only let you see the older statements if you agree to let them stop mailing you statements.  (You can rescind that agreement at any time, so you know what I did.  I don't really _want_ the paper statements if I can have good PDFs, but e-mail is so unreliable that I hesitate to let them use e-mail to send me important notices that might be indicators of identity theft or other bad things.)  So I downloaded all the statements I could find, scanned some of the others I thought were worth having, and relegated the paper copies to a box in the basement that, if it got destroyed, I wouldn't be upset. 

Not one of the dozen banks, utilities, or brokerages has the statement download thing right.  None of them have a "download all my statements" feature, which make downloading seven years worth pretty annoying.  (And all are implemented in ways that prevent you from shortcutting around their bad UIs or scripting it yourself.)  None offers any sort of scriptable interface for downloading statements, so if you want to continue to gather statements, you have to visit twelve web sites.  (I'd like to have the PDFs delivered right into my Quicken; they've been talking about electronic bill presentment for years but I don't see it here yet.)  Some make it easier by offering an option to e-mail you the PDF monthly in addition to the physical delivery; some only offer that as an alternative to the physical delivery.  Some (Wells Fargo) won't even let you download any e-statements unless you consent to online-only delivery (and the online statements don't have the check images that the physical statements do.)  Guess I'll be "consenting" for them five minutes a year to get the past year's statements, yuck. 

Bulk scanning turns out to be not so easy with cheap consumer grade scanners.  I bought a Visioneer RoadWarrior for receipts and such, but use the scanning features of my HP LaserJet 3050 for bulk scanning because it has a document feeder.  But its still pretty slow, and the software sucks.  (I'm surprised that the throughput with the RoadWarrior is bound not by the physical scanning speed, but the software that turns it into the appropriate file format and drops it into a drop folder.)  So I ended up not scanning everything I thought I would, at least not in the first round.  Slowly migrating...

Subterranean data center, part II

Moving the server and all the network hardware to the basement was great -- it got it out of my closet -- but that presented a problem for the wireless, because the wireless router wasn't strong enough to get a signal up to the back bedroom on the second floor, where we have a squeezebox and need a steady supply of bits.  I have a couple of the cheap Linksys routers (I am always surprised at how useful it turns out to have various extra computer parts lying around.)  Unfortunately, the Linksys firmware doesn't do what I wanted -- which was to have one box do the gateway router stuff (NAT, DHCP, etc) and another act as a wireless access point.  They want you to buy the more expensive Access Point version of the box, which is identical except for the firmware.

So, I installed the DD-WRT firmware on one of my Linksys routers, which lets it act as an access point -- among many other things.  DD-WRT is a linux-based distribution for cheap hardware routers, which includes all sorts of networing software not supported by the out-of-the-box firmware (e.g., access point and access point client modes, ipv6, VPN, WPA (client and server), port forwarding, QoS management, SNMP, DMZ, etc.) 

As often happens, the road was bumpy but in the end everything worked fine.  There are half a dozen different versions of the popular WRTG54, so the instructions might not fit your version exactly.  (I learned this on the part where it says "pull firmly to remove the bezel", and my version had screws holding the circuit board to the bezel...and pulling firmly ripped them out.)  Despite following all the directions carefully, the first flashing attempt failed, and I had "bricked" my router.  I followed the various "debricking" instructions, and eventually had to resort to the most extreme, where you have to short a few pins on the flash chip to restore it to its default state...and eventually I got WRT downloaded into the box.  From there, it was smooth sailing, the web-based admin GUI was easy.

Once I had DD-WRT running, it was easy to configure it as an access point -- and if I need better coverage, can just add more.  Also, a cheap router + DD-WRT is the cheapest way to put a wired ethernet device onto a wireless network; run it in "access point client" mode.  Much cheaper than buying a device designed for this purpose...

Subterranean data center

I've been getting paranoid lately about data loss.  This was almost certainly prompted by a disk failure at my old business that caused some actual data loss.  As seems to happen a lot, the disk failure also disclosed a failure in our otherwise sensible-seeming backup program, with the result being that I lost several months of archived e-mail, among other things.  Disk failure rates seem to be on the rise; the combination of rising areal densities and the public's clear choice of "cheap" over "reliable" virtually guarantees it.  (See, for example, http://australianit.news.com.au/story/0,24897,21553519-15321,00.html.)  And with larger capacities, the negative consequences of a disk failure is that much greater.

So, about six months ago I embarked on a domestic data infrastructure program to reduce my risk.  This includes:

  • Relocating my server system to the basement, where the temperature is probably more to its liking (and where additional noise was not going to bother me).  This necessitated running lots of Cat 6 cable through the walls; I put a gigabit switch in the basement and ran cable runs to most of the rooms where data would be needed.  (Lesson learned: no matter how many cable runs you think you need, run more.  Pulling 2 wires is only marginally more expensive than pulling one...) 

  • Attaching a RAID array to the server system.

  • Getting a hosting provider with reasonable storage limits where I can put some of my data so it is accessible from off my own private network.

  • Migrating all critical data into SVN repositories.


For the RAID system, I put a 3Ware 9500S-4LP hardware RAID card in my Linux system (about $300). This has four SATA ports.  The reason I went with hardware RAID instead of motherboard RAID (sometimes called "fake hardware raid") is that the hardware solution seemed to offer more in the way of hot migration and upgrades.  Building my own RAID system turned out to be more of a hassle than expected, mostly because I ended up ordering parts from mutiple vendors because no one carried all the parts I needed.  I bought the RAID card and the drives (three 500G drives) for a total of $850.  I bought a four-bay enclosure from Addonics.  I opted to spring for "multilane SATA", which allows multiple SATA drives to be connected over a single cable; this required adapters at both the enclosure side and the system side, since both the enclosure and the system just had four regular internal SATA connectors.  (Running four cables from system to enclosure seemed like it was asking for trouble.)  The trickiest part turned out to be getting the right SATA multilane cable; turns out there are two different types of SATA multilane connectors (screw type and latch type), and many enclosures and adapters are vague about which kind they need.  So I ended up buying the wrong cable first, and then had to buy the right kind from sataparts.com.  Once I got the RAID system physically put together, it was pretty easy.  My Linux distro already had the right 3Ware driver installed, and the controller had a nice web interface that let me configure the volume set.  With RAID 5, the three 500G drives show up as a 1TB SCSI disk, which I partitioned using LVM. 

I could have bought a NAS box, but six months ago the choices were pretty weak. (I suspect this has gotten slightly better.)  Would have been less hassle to put together, and maybe cheaper, but I'm sure there would have been compromises too.  I'm pretty happy with the hardware RAID solution, and I've got a choice of upgrade paths.  (I could throw another 500G in, and have it rebalance the data across four drives giving me 1.5TB, or when the cheap 1TB drives come out, I can pull one 500G out, let the array run in "degraded mode", throw two 1TBs in, create a "degraded" RAID set from them, move the data, then pull the 500G drives and put the third TB drive in giving me 2TB.) 

My system is on my home network, which is connected to the internet using via a consumer-grade NAT firewall.  So getting out is easy, but getting in is hard.  I could have gone the dynamic DNS route, but I chose instead to get a hosting provider for files that I wanted access to from outside.  I set up a hosting account at www.textdrive.com, which is great.  They make it really easy to set up SVN, WebDAV, etc, so I set up two SVN repositories on my hosted system for files I need roving access to (such as presentation slides, in case I get to a conference and my laptop doesn't.)  I set up two because SVN doesn't have good support for actually removing things from repositories, so they tend to grow over time.  So there's a "permanent" and "transient" repository; the transient repository is for short-lived projects where after some point I won't need the history any more.  SVN turns out to be a reasonably nice solution for accessing the same file from multiple systems, since I tend to either be at home and use my desktop system exclusively, or be on the road and use my laptop exclusively. 

I decided to get all my data into SVN, after being inspired by this article from Jason Hunter.  Even for data that you don't think is ever going to change, like photos (hey, what about photoshop?), SVN turns out to be a pretty good solution.  If you get a new computer, you can just do one checkout and all your data is there.  Keeping an up-to-date checkout on your home and laptop systems (in addition to the server) mitigates a number of data loss scenarios.  I'm not there yet -- I'm still migrating, but I'm making progress. 

The big question mark now is the backup strategy -- backing up a terabyte is pretty hard.

Thursday, May 24, 2007

Squeezebox update: iTunes support redux

Well, the vbrfix program didn't work quite as advertised, and borked the mp3s I ran it on.  Fortunately, they had all been converted from FLAC, so restoring was simply a matter of re-running the flac-to-mp3 script (which does take a while to chew on 300G of music.)  But the "Fix MP3 Header" option in "foobar2000" does the trick, and now iTunes is happy with my MP3s.  But since fb2k is a Windows app, it means that the incremental conversion process when new FLAC files are added has a manual step, rather than one I can script.

Thursday, May 17, 2007

JCiP best seller at JavaOne 2007

For the second year in a row, Java Concurrency in Practice was the best selling book at the JavaOne bookstore...thanks everyone!