Reset admin account login

Ive had to do this more than I care to mention. I should really know it by heart now, but the short version is:

  1. boot into the grub menu and select advanced options
  2. head into recover mode
  3. drop to root shell prompt
  4. you’ll need to remount the filesystem that is currently read only:
    mount -o rw,remount \
  5. change the password:
    passwd <your-username>
  6. exit
  7. you’re done, reboot.

For a guide with pictures,

How to reset your password in Ubuntu

Flush DNS cache in Ubuntu

According to quite a few users on Stack Exchange, Ubuntu doesn’t cache DNS queries, but I got into an awkward situation where I couldn’t hit an internal server because my installation was convinced my internal server was being hosted on an external IP. 

It wasn’t entirely incorrect, because the server is (sometimes) available externally, but it’s only as needed. I have a few web serviced that need to be reached from outside and although I could use different ports, for ease, I just change the rules according to what service I want to use that day. I know, it’s very awkward and all I really need to do is set up a reverse proxy, but I am yet to get that done. It’s on “the list”! lol

I am using pfsense and the way I have it set up, the services I develop on need to resolve internally and externally. That means when I type https://subdomain.example.com/ and I am inside my network, I need it to resolve to 10.0.0.10 (for example), but when I am not on the network I need it to resolve to my public IP address.

pfsense handles this perfectly.

What I didn’t realise, is that because I mis-configured something, enquiries locally were hitting my public IP address and being hosted that way. Meaning, my DNS request was being directed to my external IP address instead of my internal one. When I changed the order of the rules to allow a different service access to the HTTPS port, I could no longer access my original service. 

Long story, short: my internal server was being accessed by my computer via the public IP address instead of the internal one. 

I fixed the entries in the firewall and now if I did nslookups and digs on the URL, the correct internal IP was being returned. 

Cool!

But it didn’t work. I still couldn’t access the service I needed, because Ubuntu was still accessing it from the external IP address.

Not Cool!

I restarted the resolver in pfsense in case for some reason it didn’t stick. But that wasn’t it. As I said, diging the firewall was returning the correct IP address, but pinging it was returning the external one. This took some time for me to work out that it was Ubuntu that had cached the URL, and it wasn’t the firewall that was now stopping me from accessing that page. 

Let me show you by example:

subdomain2.example.url (all URLs and IP addresses have been changed to protect the innocent) is responding with an internal IP address.

dave@home:~$ dig subdomain2.example.url

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> subdomain2.example.url
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63440
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;subdomain2.example.url.			IN	A

;; ANSWER SECTION:
subdomain2.example.url.		1655	IN	A	192.168.1.78

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sun Aug 26 23:58:37 AEST 2018
;; MSG SIZE  rcvd: 57

subdomain1.example.url is returning a public IP address (that’s not desired).

dave@home:~$ dig subdomain1.example.url

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> subdomain1.example.url
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2906
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;subdomain1.example.url.		IN	A

;; ANSWER SECTION:
subdomain1.example.url.	7112	IN	A	69.68.67.66

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Sun Aug 26 23:58:48 AEST 2018
;; MSG SIZE  rcvd: 61

I knew this wasn’t quite right, but I couldn’t work out exactly why. So I asked my router where it thought the domain belongs:

dave@home:~$ dig @192.168.1.1 subdomain1.example.url

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> @192.168.1.1 subdomain1.example.url
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 1846
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;subdomain1.example.url.		IN	A

;; ANSWER SECTION:
subdomain1.example.url.	3600	IN	A	192.168.2.111

;; Query time: 0 msec
;; SERVER: 192.168.1.1#53(192.168.1.1)
;; WHEN: Mon Aug 27 00:08:02 AEST 2018
;; MSG SIZE  rcvd: 61

As you can see, the router knows the correct internal IP address, but Ubuntu is getting it from somewhere else. 

dave@home:~$ dig subdomain1.example.url

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> subdomain1.example.url
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 33482
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;subdomain1.example.url.		IN	A

;; ANSWER SECTION:
subdomain1.example.url.	6551	IN	A	69.68.67.66

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 27 00:08:09 AEST 2018
;; MSG SIZE  rcvd: 61

Then when I googled and found the above referenced Stack Exchange post and most answers were saying Ubuntu does not cache the DNS request, I just got confused. Not being confident of what I was looking at, I didn’t understand that the server IP address (127.0.0.53#53) was a loop back address for the DNS. I knew the port meant that it was a DNS, but I didn’t understand where it was being served from. Yes, I should have quickly realised it was only a loop back address, but I thought it was coming from the router (because of the DNS port entry).

hence my confusion. When I asked the router directly, it was reporting correct. But the computer was getting it from somewhere else.

And that TTL was killing me. I knew if I rebooted the computer it would more than likely fix the issue (it would have), but I had too many things open and I just didn’t want to go down that path. 

Eventually the light bulb turned on and I realised that Ubuntu simply must be caching it itself and I needed to identify the service and restart it. 

Scrolling right down the Stack post I found my answer. Although there’s several answers with differing methods of restarting the service, I went for the systemctl approach. 

sudo systemctl restart systemd-resolved.service

a ping and a dig confirms we’re good to go:

dave@redbox1804:~$ ping subdomain1.example.url
PING subdomain1.example.url (192.168.2.111) 56(84) bytes of data.
64 bytes from ubology.dav3 (192.168.2.111): icmp_seq=1 ttl=64 time=0.874 ms
64 bytes from ubology.dav3 (192.168.2.111): icmp_seq=2 ttl=64 time=0.295 ms
64 bytes from ubology.dav3 (192.168.2.111): icmp_seq=3 ttl=64 time=0.268 ms
^C
--- subdomain1.example.url ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2006ms
rtt min/avg/max/mdev = 0.268/0.479/0.874/0.279 ms
dave@redbox1804:~$ dig subdomain1.example.url

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> subdomain1.example.url
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21571
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;subdomain1.example.url.		IN	A

;; ANSWER SECTION:
subdomain1.example.url.	3589	IN	A	192.168.2.111

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 27 00:09:44 AEST 2018
;; MSG SIZE  rcvd: 61

dave@redbox1804:~$ 

Some other untested suggestions in the Stack post revolve around restarting the network manager, killing the process, flushing the cache using a command line switch (I probably should have tried that one, just out of curiosity), and restarting the service via init.d, and others. Posted for posterity and “next time” 

sudo service dns-clean
sudo service network-manager restart
sudo /etc/init.d/nscd restart
sudo kill -HUP $(pgrep dnsmasq)
sudo pkill -HUP $(pgrep dnsmasq)
sudo systemd-resolve --flush-caches
sudo systemctl restart systemd-resolved.service

Problem solved!

Now, for the next one:

I need to get a reverse proxy up and running. I did give it a go with squid, but I was obviously not getting something right. Further readings suggest HAProxy might be the way to go. That will probably be my next pfsense adventure.

tar

tar tips

You’d think I’d just learn how to use this, but I suppose I use it so infrequently that I just can’t remember it. So here are my quick “go-to” tar references.

uncompressing

tar -xzvf filename.tar.gz

x:- eXtract
z:- parse through gzip
v:- verbose (show files)
f:- file archive

archiving/compressing

tar -cf archive.tar file1 file2 dir1
tar -czf archive.tar.gz file1 file2 dir1

notes to above:

  • you don’t HAVE to run the archive through gzip, although there’s no real reason not to. If you chose just to archive without compression, it merely means your file will be larger. This may or may not be a big deal
  • you [b]must[/b] specify a file archive (-f option), not specifying a file is an unrecoverable error and tar will exit
  • recursion is on by default, so if a directory name is specified, recursion will occur. To override that option specify (–no-recursion). Alternatively, if recursion is turned off within the environment it can be re-instated by stating (–recurse).

show me the contents

What if you just want to look at what’s in the archive?

Do a test (-t / –list) run.

tar -tvf archive.tar

This will output (list) the files to stdout without extracting the contents. Useful to see what’s in the archive.

Other useful options:

Here are just a few other useful options and command line usage options to tar that I find useful.

C:- Change to directory and extract at that location
–strip-components=1 :- use this if you need to remove the baseline directory from the archive

using variables on the command line

Following on from my last example of copying a SSH public key to a remote computer, this is something I need to do when setting up a new computer. Setting up private/public keys for SSH just make logging in that little bit smoother.

When you need to rerun the command, you need to load it up, edit it and resubmit it. Unfortunately (although it’s probably possible) I don’t know an easy way to bring up a previous command and edit it in-line so that I can send it again without actually sending the command again before doing so.

Instead, Load a variable into the command line and change it next time.

-- 11:03:01 -- MBP:~ madivad$ ssh minixbmc
Password:
Last login: Mon Apr 25 18:23:18 2016
minixbmc:~ madivad$ exit
logout
Connection to minixbmc closed.
-- 11:03:17 -- MBP:~ madivad$ remote=minixbmc
-- 11:03:26 -- MBP:~ madivad$  history | grep remote
  439  remote=he1000
  440  cat ~/.ssh/id_rsa.pub | ssh madivad@$remote "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
  502  remote=minixbmc
-- 11:03:34 -- MBP:~ madivad$ !440
cat ~/.ssh/id_rsa.pub | ssh madivad@$remote "mkdir -p ~/.ssh && cat >> ~/.ssh/authorized_keys"
Password:
-- 11:03:40 -- MBP:~ madivad$ ssh minixbmc
Last login: Tue Apr 26 11:03:12 2016 from mbp.fritz.box
minixbmc:~ madivad$

For example, in the above session, for simple commands, I would being the history file up, reissue line 440, then edit, then issue it again. In this situation, it would have the effect of loading the key again, and that’s not what I want to do.

  • Breaking it down, I logged into the remote machine and realised a password was needed,
  • I logged out,
  • I set the “remote” variable,
  • looked for the relevant history command (I knew it had the word “remote” on it),
  • I re-issued that line, and
  • then tested the login.
  • No password was needed, the command was a success.

This could be done with other things as well where you’re always changing one element on the line (or multiple elements, and use multiple variables).

For a more simple and silly example, let’s create a quick update and install script for ubuntu:

upstall=’htop multiwatch’
sudo apt update && sudo apt install $upstall

Instead of typing the whole line next time, I can just type the new apps to install in the “upstall” variable and reissue the command (in this case, using arrow up a couple of times, or grabbing the index from the history file).

$ sudo apt update && sudo apt install $upstall
[sudo] password for madivad:
Hit:1 http://au.archive.ubuntu.com/ubuntu xenial InRelease
Get:2 http://au.archive.ubuntu.com/ubuntu xenial-updates InRelease [92.2 kB]
Hit:3 http://au.archive.ubuntu.com/ubuntu xenial-backports InRelease
Get:4 http://security.ubuntu.com/ubuntu xenial-security InRelease [92.2 kB]
Fetched 184 kB in 1s (101 kB/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree
Reading state information... Done
byobu is already the newest version (5.106-0ubuntu1).
htop is already the newest version (2.0.1-1).
multiwatch is already the newest version (1.0.0-rc1+really1.0.0-1).
0 to upgrade, 0 to newly install, 0 to remove and 0 not to upgrade.

If I then later want do another update and install something else, I can re-set the “upstall” variable and arrow up or grab it out of history.

11:53:44 madivad@he1000:~$ upstall=jq
12:03:44 madivad@he1000:~$ sudo apt update && sudo apt install $upstall
Hit:1 http://au.archive.ubuntu.com/ubuntu xenial InRelease
Get:2 http://au.archive.ubuntu.com/ubuntu xenial-updates InRelease [92.2 kB]
Hit:3 http://au.archive.ubuntu.com/ubuntu xenial-backports InRelease
Get:4 http://security.ubuntu.com/ubuntu xenial-security InRelease [92.2 kB]
Fetched 184 kB in 2s (91.0 kB/s)
Reading package lists... Done
Building dependency tree
Reading state information... Done
All packages are up to date.
Reading package lists... Done
Building dependency tree
Reading state information... Done
jq is already the newest version (1.5+dfsg-1).
0 to upgrade, 0 to newly install, 0 to remove and 0 not to upgrade.
I'm a simple man, I like simplicity. And although there are probably better ways to do this, for the time being, this is how I'm getting the job done. It works well for me, but I'm open to any suggestions and/or improvements.

As I said, not the best example, but hopefully you get the idea.

What files is my program trying to access?

In this example I’m using hashdeep. I’m redirecting the output of two hash sets to two different files. I am doing that with the following command:

hashdeep -rj0 /path-to-drive-1 > hashes.drive1

and

hashdeep -rj0 /path-to-drive-2 > hashes.drive2

I have those running in their own terminal windows. I then optionally have another two windows open running a tail on them so I can monitor the files:

tail -f hashes.drive1

The hard drives are located in an external multi-bay enclosure and all hard drive LEDs are flashing away like mad. A good sign. But every now and then I’ll run an ‘ls’ to see where the files are at (checking file size) or alternatively (and usually better but more resource intensive) a line count of the hash files. Given I know how many files there should be, the line count gives a fair indication of the progress of the whole process.

wc -l hashes.drive*

In today’s example I was simply doing a file size comparison of the two hashes vs a known hashset of one of the drives that was a month old. The sizes should be relatively similar. I was getting results similar to:

madivad@server:~$ ls -al hashes*
-rw-rw-r-- 1 madivad madivad 330483319 Feb 11 09:26 hash.drive1.1602
-rw-rw-r-- 1 madivad madivad 341570757 Mar 23 12:09 hash.drive1.1603
-rw-rw-r-- 1 madivad madivad 243344728 Mar 23 11:18 hash.drive2.1603

The fact that drive1.1603 is larger is of no consequence, there are just more files to consider.

After running the above check for sometime, I realised that one of the files (in this case drive1.1603) had stalled for several hours. I’m not exactly sure when it seemed to stop growing, but doing a tail of the file confirmed it was stopped. The last output was an inconsequential .DS_Store file roughly 6K in size. After physically monitoring it for some time I began to get concerned about this. I could see the all 4 RAID drives getting activity, but nothing was being recorded. The 5th drive, the backup, was hashing away without a problem and the log file was growing as expected.

After some quick research I came across this stack exchange Q&A ( How do I know which file a program is trying to access? )

The first answer provided a solution that worked best with my scenario:

lsof -c hashdeep

I’d never seen this output before but very quickly I could see the important pieces of information it had dumped out. Namely:

madivad@server:~$ lsof -c hashdeep
COMMAND  PID  USER    FD TYPE DEVICE     SIZE/OFF  NODE       NAME
hashdeep 2539 madivad 1w REG  252,0     243344728  5535319    /home/madivad/hash.drive1.1603
hashdeep 2539 madivad 3r REG  259,0  499418030080  113639426  /path1/largeFiles/a-very-big-image-of-500GB.img
hashdeep 2552 madivad 1w REG  252,0     341611062  5535320    /home/madivad/hash.drive2.1603
hashdeep 2552 madivad 3r REG  8,33     3152347139  126025746  /path2/misc/random.file

The ‘w’ of FD with ‘1w’ signifies the file is being written and that the file being written was hash.drive1.1603

The ‘r’ of FD with ‘3r’ signifies the file is being read for hashing purposes, and that file is a very large file that I know is around 500GB. Running the command again shows me the second file being read in had changed, yet the first had stayed the same.

Given the file is very large and will take considerable time to hash and that the hard drive LEDs are flashing, I realised all was good in the world and I could move on with the days activities.

UPDATE: after reading the man page on lsof I found a better way to monitor the continual progress of it was to run it with the -r “repeat” switch which defaults to 15 seconds, which could be updated more or less frequently by adding a numerical component:

lsof -r 5 -c hashdeep

How to setup BASH custom prompt in Ubuntu

I wanted two things:

  • the time in my prompt
  • a colour prompt

Basics:

  • it’s located in: ~/.bashrc
  • uses Environment Variable: PS1
  • time is inserted using: \t

time

I started here: http://www.cyberciti.biz/tips/howto-linux-unix-bash-shell-setup-prompt.html

colour

http://www.cyberciti.biz/faq/bash-shell-change-the-color-of-my-shell-prompt-under-linux-or-unix/

PixelBeat discussion on coloured command prompts

http://www.pixelbeat.org/docs/terminal_colours/

how to guid for customing the command prompt

http://tldp.org/HOWTO/Bash-Prompt-HOWTO/

ZFS on Ubuntu 14.04

I’ve completed a fresh install of Ubuntu on an old box that use to contain solaris. The system disk died and upon installing a new (old) drive and then installing ubuntu, I found that two of the disks still in the computer were part of a zfs raid. I’m not sure how many disks were in the raid, but I’m curious as to what was on this file system that has been shutdown in inaccessible for more than 7-8 years. (I later confirmed that most files on the system are from 2008 and earlier).

There were only 4 available sata connectors to the board. Two were used for the drives in there, it didn’t take long to find the matching drives to plug in.

Installing ZFS

I began with installing zfs by following the instructions here:
Install ZFS on Ubuntu—Server as Code

Installing SSH Server

I also had to install an SSH Server because this is on a box located remotely. Follow any generic install of Open SSH Server. The one I used was SSH/OpenSSH/Configuring.

Because my server is behind a firewall and not publicly accessible, I haven’t worried too much about logon via SSH Keys. I have done it before, but this box is only going to be a temporary install, but I do recommend you do that. A couple of other related tutorials to passwordless ssh key access to servers:
https://help.ubuntu.com/community/SSH/OpenSSH/Keys << a good resource, includes troubleshooting
http://www.mccarroll.net/blog/rpi_cluster2/index.html
https://www.howtoforge.com/tutorial/ssh-and-scp-with-public-key-authentication/
https://www.raspberrypi.org/documentation/remote-access/ssh/passwordless.md

Installing Samba

I also installed samba, following these instructions How to Create a Network Share Via Samba Via CLI (Command-line interface/Linux Terminal) – Uncomplicated, Simple and Brief Way!

With SSH, Samba and the ZFS modules installed, configured and running… let’s try and rebuild this raid :)

Let’s have a look!

disks:

lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0 111.8G  0 disk 
├─sda1                         8:1    0   243M  0 part /boot
├─sda2                         8:2    0     1K  0 part 
└─sda5                         8:5    0 111.6G  0 part 
  ├─ubuntu--vg-root (dm-0)   252:0    0 110.6G  0 lvm  /
  └─ubuntu--vg-swap_1 (dm-1) 252:1    0  1016M  0 lvm  [SWAP]
sdb                            8:16   0 698.7G  0 disk 
├─sdb1                         8:17   0 698.6G  0 part 
└─sdb9                         8:25   0     8M  0 part 
sdc                            8:32   0 698.7G  0 disk 
├─sdc1                         8:33   0 698.6G  0 part 
└─sdc9                         8:41   0     8M  0 part 
sdd                            8:48   0 698.7G  0 disk 
├─sdd1                         8:49   0 698.6G  0 part 
└─sdd9                         8:57   0     8M  0 part 
sde                            8:64   0 698.7G  0 disk 
├─sde1                         8:65   0 698.6G  0 part 
└─sde9                         8:73   0     8M  0 part 
sr0                           11:0    1  1024M  0 rom

and for pools specifically:

$ sudo zpool import
   pool: solaraid
     id: 10786192747791980338
  state: ONLINE
 status: The pool is formatted using a legacy on-disk version.
 action: The pool can be imported using its name or numeric identifier, though
	some features will not be available without an explicit 'zpool upgrade'.
 config:

	solaraid                                       ONLINE
	  raidz1-0                                     ONLINE
	    ata-WDC_WD7500AACS-00C7B0_WD-WCASN         ONLINE
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT         ONLINE
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT         ONLINE
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT         ONLINE

Ok, this PC only had 4x SATA drives and it appears I’ve found the correct drives. Things are looking good from the start.

Let’s do it!

:~$ sudo zpool import solaraid
:~$ sudo zpool status
  pool: solaraid
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Nov 11 21:37:59 2015
    11.5M scanned out of 1.63T at 1.15M/s, 412h58m to go
    2.60M resilvered, 0.00% done
config:

	NAME                                           STATE     READ WRITE CKSUM
	solaraid                                ONLINE       0     0     0
	  raidz1-0                              ONLINE       0     0     0
	    ata-WDC_WD7500AACS-00C7B0_WD-WCASN  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     2  (resilvering)

errors: No known data errors

YOUCH! 412HOURS… That’s 17 days! I gave it a couple of seconds to stabilise and ran it again, and came up with an error:

:~$ sudo zpool status
  pool: solaraid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: resilvered 2.60M in 0h0m with 0 errors on Wed Nov 11 21:38:24 2015
config:

	NAME                                           STATE     READ WRITE CKSUM
	solaraid                                ONLINE       0     0     0
	  raidz1-0                              ONLINE       0     0     0
	    ata-WDC_WD7500AACS-00C7B0_WD-WCASN  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     2

errors: No known data errors

Reboot and Ubuntu bootup fail

(long winded fluff about nothing, you can scroll down to “Resilvering” if you’d like to skip this)

I do recall I had issues with this raid, I could certainly go the route of upgrading it first and trying it again, but once I did a

tree -L 2 /solaraid/

and seeing things I had long thought were gone, I’m going to back this up first :)

The only problem is, the raid is installed in a box with only a 10/100 network card :(

I’ll let it run overnight taking off only what I need, and see how we go. This has been a good find

At this point I was operating in the house and the server is located off-site. I had several ssh/terminal windows open to the box and as I was working away I kept getting the message to reboot the system. I issued the relevant reboot command and set off a ping to tell me when it came back online… It didn’t come back online.

I went to the server and found it was still booting. This was after more than half an hour and eventually it gave up and crashed and restarted again.

For the next couple of hours I could not get the drive to boot and I was blaming the old boot drive, but after eventually getting into the “Try Ubuntu” mode from the DVD I found that one of the drives were not being reported in the system. Another was coming up as totally unknown and two were seen as part of a set. It took several hours to get to the bottom of it. Eventually thru the BIOS I could see one of the drives weren’t being detected.

A couple of sata cable changes and swapping power cables around and I was back in business.

Resilvering

:~$ sudo zpool status
  pool: solaraid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    1.73G scanned out of 1.63T at 32.3M/s, 14h40m to go
    12.5K repaired, 0.10% done
config:

	NAME                                           STATE     READ WRITE CKSUM
	solaraid                                ONLINE       0     0     0
	  raidz1-0                              ONLINE       0     0     0
	    ata-WDC_WD7500AACS-00C7B0_WD-WCASN  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0    13  (repairing)
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0

errors: No known data errors

Over the next few minutes I kept polling the status and it was picking up speed.

  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    24.3G scanned out of 1.63T at 84.5M/s, 5h31m to go
    12.5K repaired, 1.46% done
  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    32.1G scanned out of 1.63T at 97.0M/s, 4h47m to go
    12.5K repaired, 1.93% done
  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    56.9G scanned out of 1.63T at 102M/s, 4h29m to go
    12.5K repaired, 3.42% done
  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    184G scanned out of 1.63T at 137M/s, 3h4m to go
    184K repaired, 11.01% done

It’s starting to slow down again, and we’re seeing more errors!

  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    195G scanned out of 1.63T at 93.6M/s, 4h28m to go
    354K repaired, 11.68% done

The next day and something had gone wrong. I’m still unsure what happened, but the whole `solaraid` drive became unresponsive.. Where it had got to the evening(/morning before) at 195GB in the resilvering is where it was when I checked it later today. And the drive was otherwise not responding. I remotely tried to reboot and again it hanged.

At this present time, I’m still putting it down to hardware, but it’s really unknown what’s at the root of the problem.

It’s been back up and running for a while now and it’s present status is:

  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    1.41T scanned out of 1.63T at 160M/s, 0h23m to go
    1.23M repaired, 86.89% done
config:

	NAME                                    STATE     READ WRITE CKSUM
	solaraid                                ONLINE       0     0     0
	  raidz1-0                              ONLINE       0     0     0
	    ata-WDC_WD7500AACS-00C7B0_WD-WCASN  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0   251  (repairing)
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     6  (repairing)

errors: No known data errors

When I captured the above, I hadn’t realised the process was almost finished until I pasted and read over it.

  scan: scrub in progress since Thu Nov 12 03:12:54 2015
    1.62T scanned out of 1.63T at 158M/s, 0h0m to go
    1.29M repaired, 99.46% done

As I write this the process has been running for exactly 24 hours. We have 0.5% left.

The final capture:

:/solaraid$ sudo zpool status 
  pool: solaraid
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 1.29M in 12h0m with 0 errors on Thu Nov 12 15:13:05 2015
config:

	NAME                                    STATE     READ WRITE CKSUM
	solaraid                                ONLINE       0     0     0
	  raidz1-0                              ONLINE       0     0     0
	    ata-WDC_WD7500AACS-00C7B0_WD-WCASN  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0     0
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0   291
	    ata-WDC_WD7500AAKS-00RBA0_WD-WCAPT  ONLINE       0     0    12

errors: No known data errors

The resilvering has checked every checksum in 1.6TB of data and repaired 1.29MB. The process took exactly 12 hours (with a reboot thrown in there JUST to push the boundaries a little bit).

Next we’re to get any data off that we want… Let’s grab some directory information
To be continued…