I’ve completed a fresh install of Ubuntu on an old box that use to contain solaris. The system disk died and upon installing a new (old) drive and then installing ubuntu, I found that two of the disks still in the computer were part of a zfs raid. I’m not sure how many disks were in the raid, but I’m curious as to what was on this file system that has been shutdown in inaccessible for more than 7-8 years. (I later confirmed that most files on the system are from 2008 and earlier).
There were only 4 available sata connectors to the board. Two were used for the drives in there, it didn’t take long to find the matching drives to plug in.
Installing ZFS
I began with installing zfs by following the instructions here:
Install ZFS on Ubuntu—Server as Code
Installing SSH Server
I also had to install an SSH Server because this is on a box located remotely. Follow any generic install of Open SSH Server. The one I used was SSH/OpenSSH/Configuring.
Because my server is behind a firewall and not publicly accessible, I haven’t worried too much about logon via SSH Keys. I have done it before, but this box is only going to be a temporary install, but I do recommend you do that. A couple of other related tutorials to passwordless ssh key access to servers:
https://help.ubuntu.com/community/SSH/OpenSSH/Keys << a good resource, includes troubleshooting
http://www.mccarroll.net/blog/rpi_cluster2/index.html
https://www.howtoforge.com/tutorial/ssh-and-scp-with-public-key-authentication/
https://www.raspberrypi.org/documentation/remote-access/ssh/passwordless.md
Installing Samba
I also installed samba, following these instructions How to Create a Network Share Via Samba Via CLI (Command-line interface/Linux Terminal) – Uncomplicated, Simple and Brief Way!
With SSH, Samba and the ZFS modules installed, configured and running… let’s try and rebuild this raid :)
Let’s have a look!
disks:
lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 111.8G 0 disk ├─sda1 8:1 0 243M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 111.6G 0 part ├─ubuntu--vg-root (dm-0) 252:0 0 110.6G 0 lvm / └─ubuntu--vg-swap_1 (dm-1) 252:1 0 1016M 0 lvm [SWAP] sdb 8:16 0 698.7G 0 disk ├─sdb1 8:17 0 698.6G 0 part └─sdb9 8:25 0 8M 0 part sdc 8:32 0 698.7G 0 disk ├─sdc1 8:33 0 698.6G 0 part └─sdc9 8:41 0 8M 0 part sdd 8:48 0 698.7G 0 disk ├─sdd1 8:49 0 698.6G 0 part └─sdd9 8:57 0 8M 0 part sde 8:64 0 698.7G 0 disk ├─sde1 8:65 0 698.6G 0 part └─sde9 8:73 0 8M 0 part sr0 11:0 1 1024M 0 rom
and for pools specifically:
$ sudo zpool import pool: solaraid id: 10786192747791980338 state: ONLINE status: The pool is formatted using a legacy on-disk version. action: The pool can be imported using its name or numeric identifier, though some features will not be available without an explicit 'zpool upgrade'. config: solaraid ONLINE raidz1-0 ONLINE ata-WDC_WD7500AACS-00C7B0_WD-WCASN ONLINE ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE
Ok, this PC only had 4x SATA drives and it appears I’ve found the correct drives. Things are looking good from the start.
Let’s do it!
:~$ sudo zpool import solaraid :~$ sudo zpool status pool: solaraid state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver in progress since Wed Nov 11 21:37:59 2015 11.5M scanned out of 1.63T at 1.15M/s, 412h58m to go 2.60M resilvered, 0.00% done config: NAME STATE READ WRITE CKSUM solaraid ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD7500AACS-00C7B0_WD-WCASN ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 2 (resilvering) errors: No known data errors
YOUCH! 412HOURS… That’s 17 days! I gave it a couple of seconds to stabilise and ran it again, and came up with an error:
:~$ sudo zpool status pool: solaraid state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: resilvered 2.60M in 0h0m with 0 errors on Wed Nov 11 21:38:24 2015 config: NAME STATE READ WRITE CKSUM solaraid ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD7500AACS-00C7B0_WD-WCASN ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 2 errors: No known data errors
Reboot and Ubuntu bootup fail
(long winded fluff about nothing, you can scroll down to “Resilvering” if you’d like to skip this)
I do recall I had issues with this raid, I could certainly go the route of upgrading it first and trying it again, but once I did a
tree -L 2 /solaraid/
and seeing things I had long thought were gone, I’m going to back this up first :)
The only problem is, the raid is installed in a box with only a 10/100 network card :(
I’ll let it run overnight taking off only what I need, and see how we go. This has been a good find
At this point I was operating in the house and the server is located off-site. I had several ssh/terminal windows open to the box and as I was working away I kept getting the message to reboot the system. I issued the relevant reboot command and set off a ping to tell me when it came back online… It didn’t come back online.
I went to the server and found it was still booting. This was after more than half an hour and eventually it gave up and crashed and restarted again.
For the next couple of hours I could not get the drive to boot and I was blaming the old boot drive, but after eventually getting into the “Try Ubuntu” mode from the DVD I found that one of the drives were not being reported in the system. Another was coming up as totally unknown and two were seen as part of a set. It took several hours to get to the bottom of it. Eventually thru the BIOS I could see one of the drives weren’t being detected.
A couple of sata cable changes and swapping power cables around and I was back in business.
Resilvering
:~$ sudo zpool status pool: solaraid state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub in progress since Thu Nov 12 03:12:54 2015 1.73G scanned out of 1.63T at 32.3M/s, 14h40m to go 12.5K repaired, 0.10% done config: NAME STATE READ WRITE CKSUM solaraid ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD7500AACS-00C7B0_WD-WCASN ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 13 (repairing) ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 errors: No known data errors
Over the next few minutes I kept polling the status and it was picking up speed.
scan: scrub in progress since Thu Nov 12 03:12:54 2015 24.3G scanned out of 1.63T at 84.5M/s, 5h31m to go 12.5K repaired, 1.46% done
scan: scrub in progress since Thu Nov 12 03:12:54 2015 32.1G scanned out of 1.63T at 97.0M/s, 4h47m to go 12.5K repaired, 1.93% done
scan: scrub in progress since Thu Nov 12 03:12:54 2015 56.9G scanned out of 1.63T at 102M/s, 4h29m to go 12.5K repaired, 3.42% done
scan: scrub in progress since Thu Nov 12 03:12:54 2015 184G scanned out of 1.63T at 137M/s, 3h4m to go 184K repaired, 11.01% done
It’s starting to slow down again, and we’re seeing more errors!
scan: scrub in progress since Thu Nov 12 03:12:54 2015 195G scanned out of 1.63T at 93.6M/s, 4h28m to go 354K repaired, 11.68% done
The next day and something had gone wrong. I’m still unsure what happened, but the whole `solaraid` drive became unresponsive.. Where it had got to the evening(/morning before) at 195GB in the resilvering is where it was when I checked it later today. And the drive was otherwise not responding. I remotely tried to reboot and again it hanged.
At this present time, I’m still putting it down to hardware, but it’s really unknown what’s at the root of the problem.
It’s been back up and running for a while now and it’s present status is:
scan: scrub in progress since Thu Nov 12 03:12:54 2015 1.41T scanned out of 1.63T at 160M/s, 0h23m to go 1.23M repaired, 86.89% done config: NAME STATE READ WRITE CKSUM solaraid ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD7500AACS-00C7B0_WD-WCASN ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 251 (repairing) ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 6 (repairing) errors: No known data errors
When I captured the above, I hadn’t realised the process was almost finished until I pasted and read over it.
scan: scrub in progress since Thu Nov 12 03:12:54 2015 1.62T scanned out of 1.63T at 158M/s, 0h0m to go 1.29M repaired, 99.46% done
As I write this the process has been running for exactly 24 hours. We have 0.5% left.
The final capture:
:/solaraid$ sudo zpool status pool: solaraid state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://zfsonlinux.org/msg/ZFS-8000-9P scan: scrub repaired 1.29M in 12h0m with 0 errors on Thu Nov 12 15:13:05 2015 config: NAME STATE READ WRITE CKSUM solaraid ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 ata-WDC_WD7500AACS-00C7B0_WD-WCASN ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 0 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 291 ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT ONLINE 0 0 12 errors: No known data errors
The resilvering has checked every checksum in 1.6TB of data and repaired 1.29MB. The process took exactly 12 hours (with a reboot thrown in there JUST to push the boundaries a little bit).
Next we’re to get any data off that we want… Let’s grab some directory information
To be continued…
Amazon has now introduced Glacier and it is preics at 1/10 the storage cost of EBS. If your cloud backup model is like ours in that it is there for DR purposes and really won’t be needed for any file level recovery, Glacier is a much more appealing processes. I haven’t test Glacier yet, but from what I can see it is driven by the Amazon API in Java or .NET. They provide simple Java and .NET code samples to push data up and retrieve data from Glacier. I would attempt to implement this for ZFS by choosing an upload repeat cycle based on the rate of data change I’m backing up. For example push a full ZFS send once a month, then daily increments. Since there is no ZFS receive on Glacier you would have one large archive and smaller daily archives. After a month cycle a full send would happen again, and when confirmed successful from Amazon the previous cycle gets deleted. If a restore is needed the whole batch of archives would need to be retrieved and the zfs pool reassembled from the archives. I would probably add a check before delete that the new data set is larger than the previous month to prevent something really bad from happening in an automated way.As they market this it is not good for regular retrievals of data, it will take 3 to 5 hours to get data ready to download. The architecture I designed for our ZFS folders and snapshot schedules so that all of our backups are in snapshots. Anything that needs to be kept for long periods of time are in folders that snapshots stick around for years or longer. This is kept separate from data with higher change rates with shorter snapshot deletion cycles. If lots of your data is changing frequently, Glacier is most likely not the backup solution for you.