What a Resilver Looks Like in ZFS (and a Bug and/or Feature)

At home I have an (admittedly small) ZFS array set up to experiment with this awesome newish RAID technology. I think it has been around long enough that it can now be used in production, but I’m still getting used to the little bugs/features, and here is one that I just found.

After figuring out that I had 2 out of 3 of my 1TB Seagate Barracuda hard drives in the array fail, I had to give the entire array up for a loss and test out my backup strategy. Fortunately it worked and there was no data loss. After receiving the replacement drives in from Seagate, I rebuilt the ZFS array (using raidz again) and went along my merry way. After another 6 months or so, I started getting some funky results from my other drive. Thinking it might have some issue as with the others, I removed the drive and ran Seatools on it (by the way, Seatools doesn’t offer a 64-bit Windows version – what year is this?).

The drive didn’t show any signs of failure, so I decided to wipe it and add it back into the array to see what happens. That, of course, is easier said than done.

One of the problems I ran into is that I am using Ubuntu and fuse to run zfs. Ubuntu has this nasty habit of changing around drive identifiers when USB devices are plugged in. So now when this drive is plugged in, it is on /dev/sde instead of /dev/sdd, which is now a USB attached drive.

No problem, I figure, I’ll offline the bad drive in the zpool and replace it with the new drive location. No such luck.

First I offlined the drive using zpool offline media /dev/sdd:

dave@cerberus:~$ sudo zpool status
  pool: media
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        media       DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            sdd     OFFLINE      0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

Now that it’s offline, I thought you should be able to detach it. No such luck – since it is a ‘primary’ device of the zpool it does not allow you to remove it.

dave@cerberus:~$ sudo zpool detach media /dev/sdd
cannot detach /dev/sdd: only applicable to mirror and replacing vdevs

What they want you to do is replace the drive with another drive. This drive (the same drive, with all info wiped from it) is now on /dev/sde. I try to replace it:

dave@cerberus:~$ sudo zpool replace media /dev/sdd /dev/sde
invalid vdev specification
use '-f' to override the following errors:
/dev/sde is part of active pool 'media'
dave@cerberus:~$ sudo zpool replace -f media /dev/sdd /dev/sde
invalid vdev specification
the following errors must be manually repaired:
/dev/sde is part of active pool 'media'

Even with -f it doesn’t allow the replacement, because the system thinks that the drive is part of another pool.

So basically you are stuck if trying to test a replacement with a drive that already been used in the pool. I’m sure I could replace it with another 1TB disk but what is the point of that?

I ended up resolving the problem by removing the external USB drive, therefore putting the drive back into the original /dev/sdd slot. Without issuing any commands, the system now sees the drive as the old one, and starts resilvering the drive.

root@cerberus:/home/dave# zpool status
  pool: media
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver in progress for 0h9m, 4.62% done, 3h18m to go
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdd     ONLINE       0     0    13  30.2G resilvered
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

It is interesting to see what it looks like from an i/o perspective. The system reads from the two good drives and writes to the new (bad) one. Using iostat -x:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          29.77    0.00   13.81   32.81    0.00   23.60

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.80    0.00    33.60     0.00    42.00     0.01   15.00  15.00   1.20
sdb               0.00     0.00  625.00    0.00 108033.20     0.00   172.85     0.56    0.90   0.49  30.80
sdc               0.00     0.00  624.20    0.00 107828.40     0.00   172.75     0.50    0.81   0.47  29.60
sdd               0.00     1.20    0.00  504.40     0.00 107729.60   213.58     9.52   18.85   1.98 100.00

It seems that ZFS is able to identify a hard drive by GID somehow but doesn’t automatically use it in the pool. This makes it so that you can’t test a drive by removing it, formatting it, and putting it into a new location. Basically, zfs assumes that your drives are always going to be in the same /dev location, which isn’t always true. As soon as you attach a USB drive in Ubuntu things are going to shift around.

After the resilver is complete, the zpool status is:

root@cerberus:/home/dave# zpool status
  pool: media
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: resilver completed after 0h16m with 0 errors on Sun May 15 07:35:46 2011
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdd     ONLINE       0     0    13  50.0G resilvered
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: No known data errors

You can now clear the error with:

root@cerberus:/home/dave# zpool clear media
root@cerberus:/home/dave#

Zpool status now shows no errors:

root@cerberus:/home/dave# zpool status
  pool: media
 state: ONLINE
 scrub: resilver completed after 0h16m with 0 errors on Sun May 15 07:35:46 2011
config:

        NAME        STATE     READ WRITE CKSUM
        media       ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            sdd     ONLINE       0     0     0  50.0G resilvered
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: No known data errors

So now the question I have is this: Are you able to manually update or remove the drive status somewhere in your system? How did zfs know that this drive already had a pool installed on it? I zeroed the drive and verified with fdisk there were no partition on it. Is there a file somewhere on the system that stores this information, or is it written somewhere on the drive?

ZFS is great, but it still has some little issues like this that give me pause before using it in a production system. Then again, I suppose all massive disk array systems have their little quirks!

8 comments

Robbyt says:

May 15, 2011 at 1:19 pm

It sounds like your zfs problems are all due to using the beta implementation on Linux. zfs itself is really solid when running on Solaris. You might want to give OpenIndianana a shot.
Dave Drager says:

May 15, 2011 at 1:27 pm

Thanks – will take a look at it. I am probably going to format the system and use FreeBSD as I hear their implementation of ZFS is pretty solid. ZFS on Linux has actually come a long way and has a posix enabled implementation available as of May 5th!
Giovanni Tirloni says:

May 16, 2011 at 9:42 am

Wrong. ZFS doesn’t rely on devices being at the same location all the time. In fact, I’ve had to swap chassis and removed all disks, added them in random fashion and ZFS detected everything just fine. With the zdb -l command you can see how it marks each disk.

You should change the title to “What a Resilver Looks like in FUSE-ZFS” because it seems you’re hitting bugs in Fuse’s implementation.
Dave Drager says:

May 16, 2011 at 8:40 pm

Thanks for your feedback Giovanni!
Anonymous says:

August 10, 2011 at 7:34 pm

I think you can work around this issue for NEW raidz pools. It won’t help with your test pool, but “zpool create” is perfectly happy working with symlinks. Just create something like:
ln -s /dev/sda2 /my-drive2
ln -s /dev/sda3 /my-drive3
ln -s /dev/sda4 /my-drive4

Then create your pool with my-drive2, my-drive3, and my-drive4.
Rudd-O says:

October 20, 2011 at 7:58 pm

I’m a zfs-fuse and zfs in kernel developer.

You should add your drives using the /dev/disk/by-id directory. That way the drive will be picked up consistently regardless of the device node in /dev/.
Benjamin Close says:

December 21, 2011 at 8:04 pm

If you want to make it so zfs doesn’t recognise the drive, you have to zero out the area zfs uses to store the drives GUID. Once that’s done you can reuse it in the pool. dd if=/dev/zero of=/dev/sde in this case. Then you can reuse the disk. It even tells you this must be done manually:

the following errors must be manually repaired:

/dev/sde is part of active pool ‘media’

Though it doesn’t tell you how to fix it. ZFS pretty much is preventing you shooting yourself in the foot by nuking what could be an active drive.

ZFS also doesn’t care where the disks are, it relies on GUID’s in the meta data it places on the disk. Shift your disks around all you like and it’ll find them where ever they are and reconstruct the pool accordingly. Try running: zdb poolname you’ll see what I mean
Egurzi says:

March 2, 2012 at 1:23 pm

how many compartments and what kind of wood backing and what does it have to do with those orange traffic cones?