Adventures in LTO Tape Drives
Where this started...
I have a home NAS. It has around 38Tb of usable capacity backed by various ZFS pools, all with redundancy (so the native store is much more) - I'm no /r/datahoarder by any definition, but it's substantial.
A common backup approach is 3-2-1, three copies, two different media types and one off-site. With 38Tb this becomes difficult even to approach two copies, never mind one offsite.
So I compromised. Only about 2Tb of that data has sentimental value that couldn't be recovered (though I have some pretty hard-to-find media, like audio copies of Douglas Adams reading The Hitchhikers Guide to the Galaxy). That data is rcloned
to a cloud object store - I also ship a full index of everything else.
A long time ago...
Over twenty years ago, I had my first job as a lowly IT technician - I answered the phone, did basic tasks, reimaged PCs, and was a tape monkey. Every morning I would take the tape out of the DLT 20/40 drive and put it in the postroom to be sent to another building.
I kinda loved tape drives - they seemed like a relic of the past (even then) compared to spinning rust, never mind SSDs that would become common decades later. The software sucked like hell, though. ArcServe and BackupExec, I'm looking at you (if you still use them and think they suck, imagine what they were like 20 years ago).
I moved into a different part of the IT industry, but I was always aware that tape backup continued to exist - while in machine rooms working on networking, I occasionally caught the equivalent of 20 years younger me pulling tapes out of libraries and taking them away.
One evening, I had a fit of annoyance while looking at the pricing of Amazon S3, Backblaze B2, and Wasabi Storage, I decided to go eBay hunting for tape drives - what was the deal these days? There are rumours that Amazon S3 Glacier is partially built on tape, so it couldn't be that dead.
I came across LTO (Linear Tape Open), which appeared to be the most developed and high-capacity backup take system available. It has different generations of specifications, starting from LTO-1 (100Gb) to the latest LTO-9 (18Tb).
eBay in the UK tends to be full of LTO-5 (1.5Tb), with some LTO-4 (800Gb) and some expensive LTO-6 (2.5Tb). With most competition for selling in the LTO-5 family, it seems to be the best value for a hobbyist to get involved in.
Hunting For Drives
This isn't easy - I tried purchasing two different drives, both IBM models - one was an ex-tape library drive, the other an internal unit. Neither of them worked properly. They'd stop writing properly on the tape - even though they were being fed enough data.
smartd
might cause issues with LTO tape drives. I'll now need to retest to see if there's any truth in that.If you don't feed data to an LTO drive fast enough, it can cause it to end up seeking backwards and forwards over the head. It's known as shoe shining - this can cause wear on the heads and drive.
New tapes, old tapes, cleaning tapes - nothing made a difference. The project sat in a corner - dejected, I gave up.
Except I didn't - about six months later, I looked at a huge box of tapes, it taunted me and I got angry. Screw it - I'd give it one more go.
Failing straight out of the gate
I decided I'd give HP a shot. Both IBM and HP manufacture drives - I'd always gone for the IBM models as they have the smooth loading door.
I found an eBay listing, and I was excited - the picture shows the drive sitting on what appeared to be an original LTO static bag. It was apparently in good condition, though untested.
I bid - I won. I received it... it was not in good condition. There is no way the seller didn't know about this damage.
Anyone working on PCBs knows this isn't great - electrical connection pads had been ripped off the board. This drive had been abused - but only its connector. Otherwise, it seemed spotless.
Given my hobby of building PCBs - I thought I could fix this. I set out to see what the actual damage was. It turned out that while the mounting pads had both ripped up, the only functional pads that were damaged were the SAS differential data lines.
I spent an entire evening trying to repair the connector - in the end, I managed by solder bridges on top of jumper wire and replacement capacitors.
I checked my connections again, traced the data lines, verified there was a solid connection and then applied large quantities of hot glue - a temporary fix until I could find a matching replacement connector (then epoxy).
Success
I plugged it into the SAS bus, and it magically appeared - giddy with excitement, I wrote data to the drive... it wrote at 140Mb/s solidly for the whole 1.5Tb of the tape. I finally had an LTO-5 tape drive that worked.
It turns out the tape drive had barely been used. It had an hour of power on use when I first connected it. It was new - apart from the damaged connector. I can only assume some sysadmin installed or tested it on a bench and then broke the connector. That'd explain why it was still in the original bag - someone had just put it back in the bag and onto the shelf.
Since then, it's written terabytes of my data with zero write errors, no error reading back from the tape. I'm pretty happy that my temporary fix will work for a while.
So I've begun to start writing regular tape backups of my most important documents - it takes a little under three hours to fill a tape, but thankfully my documents are less than 1Tb, and my photos are less than 800Gb - so two tapes on separate nights.
Looking at my video collection, on the other hand, is now at 20Tb - or about 15 tapes. Swapping one per night would take over two weeks to complete the backup. Even if I could change the tape instantly when full, it'd be under two days.
I need a robot!
So my next idea was to purchase a tape robot, or more likely, a tape library. These are large boxes that hold tapes as well as drives, and there's a little transport mechanism that moves tapes between the drives and storage slots.
Off to eBay I go again - I manage to find one that's a reasonable price not too far from me (they are huge, and postage would be more than the unit).
Now this unit came with a tape drive already present, an LTO-4 drive - which I don't have much use for. I had made an educated guess from the documentation that I'd probably be able to swap the drives - though the documentation on that is very poor.
The dirty secret is that the chassis of the MSL2024 is resold by many different vendors, each with its own vendor firmware - so all the manufacturers' drives must be compatible to a point.
Tape libraries use the Automation Control Interface (ACI) to talk to tape drives. It's fundamentally a RS 422 serial port. It allows them to swap information and commands and exposes the library over the tape drives SAS interface. It also contains pins for negotiating power-on behaviour and tape drive presence for the library.
I'd previously been exposed to the serial port on the back of the LTO-5 drives, as I'd had to change an IBM drive to run in standalone mode - rather than library mode.
The problem is that the ACI connector was not standardised until LTO-5 - so the connector on the LTO-4 drive is only 14 pins, and the connector on the LTO-5 drive is 16 pins.
There is no public documentation on the layout of the LTO-4 ACI connector for this particular drive or, in fact, any drive I could find. So I had no option but to try and reverse engineer it.
The first thing to note on the board is the presence of a MAX3488 chip, which is also present on the LTO-5 drive. The chip is a RS 485/422 transceiver, and it converts the differential serial signal to TTL UART signals.
The AB and YZ pairs have meaning for RS 485, receiving and transmitting, respectively. It was easy to identify which connector pins connected to the transceiver and make observations about the other pins.
Pin | Purpose |
---|---|
1 | N/C |
2 | N/C |
3 | N/C |
4 | N/C |
5 | GND |
6 | 3.3V Pulled Up |
7 | N/C |
8 | 3.3V Pulled Up |
9 | GND |
10 | RS-422 Y |
11 | RS-422 Z |
12 | GND |
13 | RS-422 B |
14 | RS-422 A |
Using this list and referring to documentation of the ACI connector in an LTO-5 drive manual, I started to make some further assumptions. The GND and 3.3V Pulled Up lines seem to match up the additional lines documented in the LTO-5 manual, just in reverse order and not skipping a pin.
Testing
So I reassembled the connector in the layout I believed it should be, assembled the drive, slotted it into the library, hit power, and waited. After a lot of fan noise and a minute's wait, Drive 1 RDY.
appeared on the front panel! It seemed like the drive had at least negotiated correctly.
To test it, I ordered the library to insert a tape into the drive, which it did. I then ordered it to move the tape back, and it did. The latter test told me the library could instruct the tape drive to eject the tape.
The final thing to do was to connect to it via SAS and query - so the first minor problem is that if the library has not completely booted, then FreeBSD times out, waiting for the changer to come online.
But if I wait long enough for the library to perform a full inventory and then connect the SAS connector, everything appears correctly - I'm not the first to see this.
ses0: sa0,pass23 in Enclosure Services Controller Electronics 1, SAS Port: 1 physses0: phy 0: id 16 connector 0 other 8
ses0: phy 0: addr 50014380171db966
sa0 at mps0 bus 0 scbus16 target 30 lun 0
sa0: <HP Ultrium 5-SCSI Z6ED> Removable Sequential Access SPC-4 SCSI device
sa0: Serial Number HUE6331JKH
sa0: 600.000MB/s transfers
sa0: Command Queueing enabled
ch0 at mps0 bus 0 scbus16 target 30 lun 1
ch0: <HP MSL G3 Series 7.20> Removable Changer SPC-3 SCSI device
ch0: Serial Number MXA105Z059
ch0: 600.000MB/s transfers
ch0: Command Queueing enabled
ch0: 23 slots, 1 drive, 1 picker, 1 portal
Success!
root@argon:~ # chio status -v
picker 0: voltag: <:0>
slot 0: <ACCESS,FULL> voltag: <A18596L5:0>
slot 1: <ACCESS,FULL> voltag: <A23018L5:0>
slot 2: <ACCESS,FULL> voltag: <A22931L5:0>
slot 3: <ACCESS,FULL> voltag: <A24698L5:0>
slot 4: <ACCESS,FULL> voltag: <A04814L5:0>
slot 5: <ACCESS,FULL> voltag: <A23116L5:0>
slot 6: <ACCESS,FULL> voltag: <A22751L5:0>
slot 7: <ACCESS,FULL> voltag: <A24455L5:0>
slot 8: <ACCESS,FULL> voltag: <A24108L5:0>
slot 9: <ACCESS,FULL> voltag: <A23135L5:0>
slot 10: <ACCESS,FULL> voltag: <A22292L5:0>
slot 11: <ACCESS,FULL> voltag: <A21661L5:0>
slot 12: <ACCESS,FULL> voltag: <A21490L5:0>
slot 13: <ACCESS,FULL> voltag: <A20036L5:0>
slot 14: <ACCESS,FULL> voltag: <A21905L5:0>
slot 15: <ACCESS,FULL> voltag: <A18087L5:0>
slot 16: <ACCESS,FULL> voltag: <A20796L5:0>
slot 17: <ACCESS,FULL> voltag: <A20003L5:0>
slot 18: <ACCESS,FULL> voltag: <A19399L5:0>
slot 19: <ACCESS,FULL> voltag: <CLNU16L1:0>
slot 20: <ACCESS,FULL> voltag: <A20920L5:0>
slot 21: <ACCESS,FULL> voltag: <A19867L5:0>
slot 22: <ACCESS,FULL> voltag: <A19686L5:0>
portal 0: <INENAB,EXENAB,ACCESS> voltag: <:0>
drive 0: <ACCESS> voltag: <:0> serial number: <HP Ultrium 5-SCSI HUE6331JKH>
More success!
root@argon:~ # chio move slot 0 drive 1
chio: /dev/ch0: CHIOMOVE: Operation not supported by device
I'm sorry, what?! I couldn't help but feel defeated - had this been all for nothing?
Searching online, I found no references as to why this might occur. The only hint I found was another tool called mtx
- it seemed to work in a different way at the SAS level.
root@argon:~ # mtx -f /dev/pass24 status
Storage Changer /dev/pass24:1 Drives, 24 Slots ( 1 Import/Export )
Data Transfer Element 0:Empty
Storage Element 1:Full :VolumeTag=A18596L5
Storage Element 2:Full :VolumeTag=A23018L5
Storage Element 3:Full :VolumeTag=A22931L5
Storage Element 4:Full :VolumeTag=A24698L5
Storage Element 5:Full :VolumeTag=A04814L5
Storage Element 6:Full :VolumeTag=A23116L5
Storage Element 7:Full :VolumeTag=A22751L5
Storage Element 8:Full :VolumeTag=A24455L5
Storage Element 9:Full :VolumeTag=A24108L5
Storage Element 10:Full :VolumeTag=A23135L5
Storage Element 11:Full :VolumeTag=A22292L5
Storage Element 12:Full :VolumeTag=A21661L5
Storage Element 13:Full :VolumeTag=A21490L5
Storage Element 14:Full :VolumeTag=A20036L5
Storage Element 15:Full :VolumeTag=A21905L5
Storage Element 16:Full :VolumeTag=A18087L5
Storage Element 17:Full :VolumeTag=A20796L5
Storage Element 18:Full :VolumeTag=A20003L5
Storage Element 19:Full :VolumeTag=A19399L5
Storage Element 20:Full :VolumeTag=CLNU16L1
Storage Element 21:Full :VolumeTag=A20920L5
Storage Element 22:Full :VolumeTag=A19867L5
Storage Element 23:Full :VolumeTag=A19686L5
Storage Element 24 IMPORT/EXPORT:Empty
Okay - great, we're using /dev/pass24
instead of /dev/ch0
- they do both belong to the MSL2024, though. Time to try loading a tape this way.
root@argon:~ # mtx -f /dev/pass24 load 1
Loading media from Storage Element 1 into drive 0...done
root@argon:~ # sg_read_attr -f 0x0806 /dev/nsa0
Barcode: A18596L5
root@argon:~ # mtx -f /dev/pass24 unload 1
Unloading drive 0 into Storage Element 1...done
It looks like everything actually works!
Tweaking
It turns out that in the service mode of the MSL2024, an option allows you to disable the inventory scan on start.
This does actually allow the changer to be discovered by FreeBSD within its timeout, but it results in an "Incompatible Magazine" status. The first time you move a tape, the library is rescanned and never properly recovers.
Quiet down!
Unsurprisingly the tape library and drive are loud - they aren't built for home environments, even the utility room. Queue the replacement of its two fans, the first on the back of the tape drive.
The new tape drive fan is 7.5mm shorter than the original, so I 3D printed a shim to take up the slack. Then onto the second in the back of the power supply.
Both replacement fans have a reduced CFM value, but monitoring suggests very little difference to the internal temperatures of the drive and library. Both of the original fans were never driven at full speed (you could hear them throttle down after initialisation).
I suspect the replacement fans are running closer to their limits - though, being Noctua NF series, you'd never know.
Security
This thing has none - there is notionally a pin code for administrator access, 100 million combinations.
However, this is rendered useless as the MSL2024 has telnet listening - with hardcoded static credentials. I could accept a challenge and response system for their support teams, but this is ridiculous. I won't give away the passwords here for 7.2, but the users are tadmin
, tsupport
and secret
- the passwords aren't good.
What's left?
Pretty much the only thing left to do is set up a piece of backup software in a way that I like.
The modern trendy tool is Veeam, but as far as I know, they don't have a tape drive agent for FreeBSD.
Bacula or Amanda are the common go-to tools in the OSS community - but both have things I'm not hugely keen on. Now that I have the library, I need software that will let me easily span backups across multiple tapes.
We'll see what I decide on...