Adventures in LTO Tape Drives

Adventures in LTO Tape Drives
A box of use LTO-5 tapes, you can tell by the colour and the L5 at the end of the barcode.

Where this started...

I have a home NAS. It has around 38Tb of usable capacity backed by various ZFS pools, all with redundancy (so the native store is much more) - I'm no /r/datahoarder by any definition, but it's substantial.

šŸŽ–ļø
"Two is One and One is None" - Usually attributed to the US miliary.

A common backup approach is 3-2-1, three copies, two different media types and one off-site. With 38Tb this becomes difficult even to approach two copies, never mind one offsite.

So I compromised. Only about 2Tb of that data has sentimental value that couldn't be recovered (though I have some pretty hard-to-find media, like audio copies of Douglas Adams reading The Hitchhikers Guide to the Galaxy). That data is rcloned to a cloud object store - I also ship a full index of everything else.

A long time ago...

Over twenty years ago, I had my first job as a lowly IT technician - I answered the phone, did basic tasks, reimaged PCs, and was a tape monkey. Every morning I would take the tape out of the DLT 20/40 drive and put it in the postroom to be sent to another building.

I kinda loved tape drives - they seemed like a relic of the past (even then) compared to spinning rust, never mind SSDs that would become common decades later. The software sucked like hell, though. ArcServe and BackupExec, I'm looking at you (if you still use them and think they suck, imagine what they were like 20 years ago).

I moved into a different part of the IT industry, but I was always aware that tape backup continued to exist - while in machine rooms working on networking, I occasionally caught the equivalent of 20 years younger me pulling tapes out of libraries and taking them away.

One evening, I had a fit of annoyance while looking at the pricing of Amazon S3, Backblaze B2, and Wasabi Storage, I decided to go eBay hunting for tape drives - what was the deal these days? There are rumours that Amazon S3 Glacier is partially built on tape, so it couldn't be that dead.

I came across LTO (Linear Tape Open), which appeared to be the most developed and high-capacity backup take system available. It has different generations of specifications, starting from LTO-1 (100Gb) to the latest LTO-9 (18Tb).

eBay in the UK tends to be full of LTO-5 (1.5Tb), with some LTO-4 (800Gb) and some expensive LTO-6 (2.5Tb). With most competition for selling in the LTO-5 family, it seems to be the best value for a hobbyist to get involved in.

Hunting For Drives

This isn't easy - I tried purchasing two different drives, both IBM models - one was an ex-tape library drive, the other an internal unit. Neither of them worked properly. They'd stop writing properly on the tape - even though they were being fed enough data.

ā“
While reading up on a couple of points I'd forgotten for this article, there was a suggestion that smartd might cause issues with LTO tape drives. I'll now need to retest to see if there's any truth in that.

If you don't feed data to an LTO drive fast enough, it can cause it to end up seeking backwards and forwards over the head. It's known as shoe shining - this can cause wear on the heads and drive.

New tapes, old tapes, cleaning tapes - nothing made a difference. The project sat in a corner - dejected, I gave up.

Except I didn't - about six months later, I looked at a huge box of tapes, it taunted me and I got angry. Screw it - I'd give it one more go.

Failing straight out of the gate

Picture from the eBay listing.

I decided I'd give HP a shot. Both IBM and HP manufacture drives - I'd always gone for the IBM models as they have the smooth loading door.

I found an eBay listing, and I was excited - the picture shows the drive sitting on what appeared to be an original LTO static bag. It was apparently in good condition, though untested.

I bid - I won. I received it... it was not in good condition. There is no way the seller didn't know about this damage.

Damage showing the SAS connector having been lifted from the board... badly.

Anyone working on PCBs knows this isn't great - electrical connection pads had been ripped off the board. This drive had been abused - but only its connector. Otherwise, it seemed spotless.

Given my hobby of building PCBs - I thought I could fix this. I set out to see what the actual damage was. It turned out that while the mounting pads had both ripped up, the only functional pads that were damaged were the SAS differential data lines.

After "repair", I'll admit it doesn't look great.

I spent an entire evening trying to repair the connector - in the end, I managed by solder bridges on top of jumper wire and replacement capacitors.

I checked my connections again, traced the data lines, verified there was a solid connection and then applied large quantities of hot glue - a temporary fix until I could find a matching replacement connector (then epoxy).

Success

I plugged it into the SAS bus, and it magically appeared - giddy with excitement, I wrote data to the drive... it wrote at 140Mb/s solidly for the whole 1.5Tb of the tape. I finally had an LTO-5 tape drive that worked.

Device statistics page (ssc-3 and adc)
  Lifetime media loads: 9
  Lifetime cleaning operations: 0
  Lifetime power on hours: 25
  Lifetime media motion (head) hours: 11
  Lifetime metres of tape processed: 232880
  Lifetime power cycles: 8
Usage statistics from the tape drive about 24 hours after I fixed the tape.

It turns out the tape drive had barely been used. It had an hour of power on use when I first connected it. It was new - apart from the damaged connector. I can only assume some sysadmin installed or tested it on a bench and then broke the connector. That'd explain why it was still in the original bag - someone had just put it back in the bag and onto the shelf.

Since then, it's written terabytes of my data with zero write errors, no error reading back from the tape. I'm pretty happy that my temporary fix will work for a while.

So I've begun to start writing regular tape backups of my most important documents - it takes a little under three hours to fill a tape, but thankfully my documents are less than 1Tb, and my photos are less than 800Gb - so two tapes on separate nights.

Looking at my video collection, on the other hand, is now at 20Tb - or about 15 tapes. Swapping one per night would take over two weeks to complete the backup. Even if I could change the tape instantly when full, it'd be under two days.

I need a robot!

So my next idea was to purchase a tape robot, or more likely, a tape library. These are large boxes that hold tapes as well as drives, and there's a little transport mechanism that moves tapes between the drives and storage slots.

Off to eBay I go again - I manage to find one that's a reasonable price not too far from me (they are huge, and postage would be more than the unit).

My HP StorageWorks MSL2024, 2U holding "24" tapes - it's really 23 with a mailslot.

Now this unit came with a tape drive already present, an LTO-4 drive - which I don't have much use for. I had made an educated guess from the documentation that I'd probably be able to swap the drives - though the documentation on that is very poor.

An LTO drive that has been removed from a tape library, not mine specifically.

The dirty secret is that the chassis of the MSL2024 is resold by many different vendors, each with its own vendor firmware - so all the manufacturers' drives must be compatible to a point.

Tape libraries use the Automation Control Interface (ACI) to talk to tape drives. It's fundamentally a RS 422 serial port. It allows them to swap information and commands and exposes the library over the tape drives SAS interface. It also contains pins for negotiating power-on behaviour and tape drive presence for the library.

I'd previously been exposed to the serial port on the back of the LTO-5 drives, as I'd had to change an IBM drive to run in standalone mode - rather than library mode.

ACI interface on the back of the LTO-4 drive that came with the library.

The problem is that the ACI connector was not standardised until LTO-5 - so the connector on the LTO-4 drive is only 14 pins, and the connector on the LTO-5 drive is 16 pins.

Original wiring of library serial cable in the drive sled.

There is no public documentation on the layout of the LTO-4 ACI connector for this particular drive or, in fact, any drive I could find. So I had no option but to try and reverse engineer it.

The ACI port on the PCB of the LTO-4 tape drive.

The first thing to note on the board is the presence of a MAX3488 chip, which is also present on the LTO-5 drive. The chip is a RS 485/422 transceiver, and it converts the differential serial signal to TTL UART signals.

The AB and YZ pairs have meaning for RS 485, receiving and transmitting, respectively. It was easy to identify which connector pins connected to the transceiver and make observations about the other pins.

Pin Purpose
1 N/C
2 N/C
3 N/C
4 N/C
5 GND
6 3.3V Pulled Up
7 N/C
8 3.3V Pulled Up
9 GND
10 RS-422 Y
11 RS-422 Z
12 GND
13 RS-422 B
14 RS-422 A

Using this list and referring to documentation of the ACI connector in an LTO-5 drive manual, I started to make some further assumptions. The GND and 3.3V Pulled Up lines seem to match up the additional lines documented in the LTO-5 manual, just in reverse order and not skipping a pin.

https://docs.oracle.com/cd/E21419_04/en/LTO5_Vol1_E4/LTO5_Vol1_E4.pdf

Testing

So I reassembled the connector in the layout I believed it should be, assembled the drive, slotted it into the library, hit power, and waited. After a lot of fan noise and a minute's wait, Drive 1 RDY. appeared on the front panel! It seemed like the drive had at least negotiated correctly.

Success on the web interface.

To test it, I ordered the library to insert a tape into the drive, which it did. I then ordered it to move the tape back, and it did. The latter test told me the library could instruct the tape drive to eject the tape.

The final thing to do was to connect to it via SAS and query - so the first minor problem is that if the library has not completely booted, then FreeBSD times out, waiting for the changer to come online.

(ch0:mps0:0:30:1): MODE SENSE(6). CDB: 1a 00 1d 00 20 00
(ch0:mps0:0:30:1): SCSI sense: NOT READY asc:4,1 (Logical unit is in process of becoming ready)
(ch0:mps0:0:30:1): fatal error, failed to attach to device
FreeBSD is getting bored waiting for the library to come online.

But if I wait long enough for the library to perform a full inventory and then connect the SAS connector, everything appears correctly - I'm not the first to see this.

ses0: sa0,pass23 in Enclosure Services Controller Electronics 1, SAS Port: 1 physses0:  phy 0: id 16 connector 0 other 8
ses0:  phy 0: addr 50014380171db966
sa0 at mps0 bus 0 scbus16 target 30 lun 0
sa0: <HP Ultrium 5-SCSI Z6ED> Removable Sequential Access SPC-4 SCSI device
sa0: Serial Number HUE6331JKH
sa0: 600.000MB/s transfers
sa0: Command Queueing enabled
ch0 at mps0 bus 0 scbus16 target 30 lun 1
ch0: <HP MSL G3 Series 7.20> Removable Changer SPC-3 SCSI device
ch0: Serial Number MXA105Z059
ch0: 600.000MB/s transfers
ch0: Command Queueing enabled
ch0: 23 slots, 1 drive, 1 picker, 1 portal

Success!

root@argon:~ # chio status -v
picker 0:  voltag: <:0>
slot 0: <ACCESS,FULL> voltag: <A18596L5:0>
slot 1: <ACCESS,FULL> voltag: <A23018L5:0>
slot 2: <ACCESS,FULL> voltag: <A22931L5:0>
slot 3: <ACCESS,FULL> voltag: <A24698L5:0>
slot 4: <ACCESS,FULL> voltag: <A04814L5:0>
slot 5: <ACCESS,FULL> voltag: <A23116L5:0>
slot 6: <ACCESS,FULL> voltag: <A22751L5:0>
slot 7: <ACCESS,FULL> voltag: <A24455L5:0>
slot 8: <ACCESS,FULL> voltag: <A24108L5:0>
slot 9: <ACCESS,FULL> voltag: <A23135L5:0>
slot 10: <ACCESS,FULL> voltag: <A22292L5:0>
slot 11: <ACCESS,FULL> voltag: <A21661L5:0>
slot 12: <ACCESS,FULL> voltag: <A21490L5:0>
slot 13: <ACCESS,FULL> voltag: <A20036L5:0>
slot 14: <ACCESS,FULL> voltag: <A21905L5:0>
slot 15: <ACCESS,FULL> voltag: <A18087L5:0>
slot 16: <ACCESS,FULL> voltag: <A20796L5:0>
slot 17: <ACCESS,FULL> voltag: <A20003L5:0>
slot 18: <ACCESS,FULL> voltag: <A19399L5:0>
slot 19: <ACCESS,FULL> voltag: <CLNU16L1:0>
slot 20: <ACCESS,FULL> voltag: <A20920L5:0>
slot 21: <ACCESS,FULL> voltag: <A19867L5:0>
slot 22: <ACCESS,FULL> voltag: <A19686L5:0>
portal 0: <INENAB,EXENAB,ACCESS> voltag: <:0>
drive 0: <ACCESS> voltag: <:0> serial number: <HP      Ultrium 5-SCSI  HUE6331JKH>

More success!

root@argon:~ # chio move slot 0 drive 1
chio: /dev/ch0: CHIOMOVE: Operation not supported by device

I'm sorry, what?! I couldn't help but feel defeated - had this been all for nothing?

Searching online, I found no references as to why this might occur. The only hint I found was another tool called mtx - it seemed to work in a different way at the SAS level.

root@argon:~ # mtx -f /dev/pass24 status
  Storage Changer /dev/pass24:1 Drives, 24 Slots ( 1 Import/Export )
Data Transfer Element 0:Empty
      Storage Element 1:Full :VolumeTag=A18596L5
      Storage Element 2:Full :VolumeTag=A23018L5
      Storage Element 3:Full :VolumeTag=A22931L5
      Storage Element 4:Full :VolumeTag=A24698L5
      Storage Element 5:Full :VolumeTag=A04814L5
      Storage Element 6:Full :VolumeTag=A23116L5
      Storage Element 7:Full :VolumeTag=A22751L5
      Storage Element 8:Full :VolumeTag=A24455L5
      Storage Element 9:Full :VolumeTag=A24108L5
      Storage Element 10:Full :VolumeTag=A23135L5
      Storage Element 11:Full :VolumeTag=A22292L5
      Storage Element 12:Full :VolumeTag=A21661L5
      Storage Element 13:Full :VolumeTag=A21490L5
      Storage Element 14:Full :VolumeTag=A20036L5
      Storage Element 15:Full :VolumeTag=A21905L5
      Storage Element 16:Full :VolumeTag=A18087L5
      Storage Element 17:Full :VolumeTag=A20796L5
      Storage Element 18:Full :VolumeTag=A20003L5
      Storage Element 19:Full :VolumeTag=A19399L5
      Storage Element 20:Full :VolumeTag=CLNU16L1
      Storage Element 21:Full :VolumeTag=A20920L5
      Storage Element 22:Full :VolumeTag=A19867L5
      Storage Element 23:Full :VolumeTag=A19686L5
      Storage Element 24 IMPORT/EXPORT:Empty

Okay - great, we're using /dev/pass24 instead of /dev/ch0 - they do both belong to the MSL2024, though. Time to try loading a tape this way.

root@argon:~ # mtx -f /dev/pass24 load 1
Loading media from Storage Element 1 into drive 0...done
root@argon:~ # sg_read_attr -f 0x0806 /dev/nsa0
  Barcode: A18596L5
root@argon:~ # mtx -f /dev/pass24 unload 1
Unloading drive 0 into Storage Element 1...done

It looks like everything actually works!

Tweaking

It turns out that in the service mode of the MSL2024, an option allows you to disable the inventory scan on start.

Super secret options in the service user that absolutely doesn't have a pin of 42311324.

This does actually allow the changer to be discovered by FreeBSD within its timeout, but it results in an "Incompatible Magazine" status. The first time you move a tape, the library is rescanned and never properly recovers.

Quiet down!

Unsurprisingly the tape library and drive are loud - they aren't built for home environments, even the utility room. Queue the replacement of its two fans, the first on the back of the tape drive.

Noctua NF-A4x20 FLX retrofit to the back of the LTO-5 drive.

The new tape drive fan is 7.5mm shorter than the original, so I 3D printed a shim to take up the slack. Then onto the second in the back of the power supply.

Noctua NF-A6x25 FLX retrofit in the back of the library PSU.

Both replacement fans have a reduced CFM value, but monitoring suggests very little difference to the internal temperatures of the drive and library. Both of the original fans were never driven at full speed (you could hear them throttle down after initialisation).

I suspect the replacement fans are running closer to their limits - though, being Noctua NF series, you'd never know.

Security

This thing has none - there is notionally a pin code for administrator access, 100 million combinations.

However, this is rendered useless as the MSL2024 has telnet listening - with hardcoded static credentials. I could accept a challenge and response system for their support teams, but this is ridiculous. I won't give away the passwords here for 7.2, but the users are tadmin, tsupport and secret - the passwords aren't good.

What's left?

Pretty much the only thing left to do is set up a piece of backup software in a way that I like.

The modern trendy tool is Veeam, but as far as I know, they don't have a tape drive agent for FreeBSD.

Bacula or Amanda are the common go-to tools in the OSS community - but both have things I'm not hugely keen on. Now that I have the library, I need software that will let me easily span backups across multiple tapes.

We'll see what I decide on...