Monday, 26 August 2013

VMFS-formatted USB sticks in VMware vSphere 5.1

UPDATE: vSphere 5.5 behaves in the same way - still no solution other than trying to find older USB keys on eBay and hoping for the best! :(

Over the last couple of weeks I've been building a VMware home lab (it'll probably get a post of its own in the near future) to replace 'eddie,' my ageing Core 2 Duo server and one of the constraints I enforced during the build was for the lab to be 'self consuming'.

By that I mean it mustn't rely on external devices for DHCP/DNS or storage - the lab will be using power all the time so I wanted to minimise the amount of running equipment.

For now I want to focus on the use of VMFS formatted partitions on USB keys / sticks / thumb drives with ESXi 5.1 because I assumed it would be straightforward. The summary is simple: ESXi is just as fussy with USB devices as it is with everything else.


Things started off well with a 4GB Kingston DataTraveler G3 (Vendor 0951, Product 1643)

# cd /dev/disks
# /etc/init.d/usbarbitrator stop
Need to disable this service else ESXi will make it available as a passthrough device for use by VMs.

Now insert the 4GB USB stick.
# esxcfg-scsidevs -c | grep USB
mpx.vmhba33:C0:T0:L0                                                      Direct-Access    /vmfs/devices/disks/mpx.vmhba33:C0:T0:L0                                                      988MB     NMP     Local USB Direct-Access (mpx.vmhba33:C0:T0:L0)
mpx.vmhba39:C0:T0:L0                                                      Direct-Access    /vmfs/devices/disks/mpx.vmhba39:C0:T0:L0                                                      3690MB    NMP     Local USB Direct-Access (mpx.vmhba39:C0:T0:L0)

Now I can see two USB devices - a 1GB and a 4GB. I am using the 1GB to boot the ESXi host - no problem there since there are no VMFS partitions on the boot device - just the small boot, a few FATs and a VMkernel diagnostic partition. The interesting stuff is on the 4GB device. Here's the commands I used to create a new GPT disk label and a VMFS5 partition to go on it:


# partedUtil mklabel mpx.vmhba39\:C0\:T0\:L0 gpt
# partedUtil setptbl mpx.vmhba39\:C0\:T0\:L0 gpt \"1 2048 7550550 AA31E02A400F11DB9590000C2911D1B8 0"
gpt  
0 0 0 0 
1 2048 7550550 AA31E02A400F11DB9590000C2911D1B8 0
# vmkfstools -C vmfs5 mpx.vmhba39\:C0\:T0\:L0:1  
create fs deviceName:'mpx.vmhba39:C0:T0:L0:1', fsShortName:'vmfs5', fsName:'(null)' 
deviceFullPath:/dev/disks/mpx.vmhba39:C0:T0:L0:1 deviceFile:mpx.vmhba39:C0:T0:L0:1 
Checking if remote hosts are using this device as a valid file system. This may take a few seconds... 
Creating vmfs5 file system on "mpx.vmhba39:C0:T0:L0:1" with blockSize 1048576 and volume label "none". 
Successfully created new volume: 521b9571-a9085b97-0a71-0015176fe0a0

At this point, you can simply click 'Refresh' in the VMware vSphere Client and the new volume will appear. Rename it to something sensible (or use the -S option in vmkfstools) and we're done.

Great. Now I got on with my life for a week and then came back to this work using the next of the USB sticks that I had lying around. This time it's a 4GB Lexar JumpDrive TwistTurn (Vendor 05dc, Product a815)

http://www.lexar.com/products/lexar-jumpdrive-twistturn-usb-flash-drive

So, same again, right?

# esxcfg-scsidevs -c | grep USB 
mpx.vmhba33:C0:T0:L0                                                      Direct-Access    /vmfs/devices/disks/mpx.vmhba33:C0:T0:L0                                                      988MB     NMP     Local USB Direct-Access (mpx.vmhba33:C0:T0:L0) 
mpx.vmhba39:C0:T0:L0                                                      Direct-Access    /vmfs/devices/disks/mpx.vmhba39:C0:T0:L0                                                      3690MB    NMP     Local USB Direct-Access (mpx.vmhba39:C0:T0:L0) 
mpx.vmhba40:C0:T0:L0                                                      Direct-Access    /vmfs/devices/disks/mpx.vmhba40:C0:T0:L0                                                      3748MB    NMP     Local USB Direct-Access (mpx.vmhba40:C0:T0:L0)
Ah OK, the Lexar is a bit bigger. Not all the 4GB sticks are the same. Anyway, let's continue as before:

# partedUtil mklabel mpx.vmhba40:C0:T0:L0 gpt 
# partedUtil setptbl mpx.vmhba40:C0:T0:L0 gpt "1 2048 7550550 AA31E02A400F11DB9590000C2911D1B8 0" 
gpt 
0 0 0 0 
1 2048 7550550 AA31E02A400F11DB9590000C2911D1B8 0   
# vmkfstools -C vmfs5 mpx.vmhba40:C0:T0:L0:1  
create fs deviceName:'mpx.vmhba40:C0:T0:L0:1', fsShortName:'vmfs5', fsName:'(null)' 
deviceFullPath:/dev/disks/mpx.vmhba40:C0:T0:L0:1 deviceFile:mpx.vmhba40:C0:T0:L0:1 
Checking if remote hosts are using this device as a valid file system. This may take a few seconds... 
Creating vmfs5 file system on "mpx.vmhba40:C0:T0:L0:1" with blockSize 1048576 and volume label "none". 
Usage: vmkfstools -C [vmfs3|vmfs5] /vmfs/devices/disks/vml... or,       vmkfstools -C [vmfs3|vmfs5] /vmfs/devices/disks/naa... or,       vmkfstools -C [vmfs3|vmfs5] /vmfs/devices/disks/mpx.vmhbaA:T:L:P 
Error: vmkfstools failed: vmkernel is not loaded or call not implemented.

wat

I had a look in the vmkernel.log (typically through the output of 'dmesg' and found a couple of lines which seemed relevant since they came up at the time when I issued the 'vmkfstools' command:
2013-08-26T17:40:09.176Z cpu1:2049)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x41240080c0c0, 0) to dev "mpx.vmhba40:C0:T0:L0" on path "vmhba40:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
2013-08-26T17:40:09.176Z cpu1:2049)ScsiDeviceIO: 2331: Cmd(0x41240080c0c0) 0x1a, CmdSN 0x1a4b from world 0 to dev "mpx.vmhba40:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

The sense data seems interesting. The VMware KB article 289902 discusses the sense keys - 0x5 is "ILLEGAL REQUEST" and the linked page on the T10 website goes into further detail that '0x24 0x0' is INVALID FIELD IN CDB. 

However, this is all a red herring since the DataTraveler G3 is also seeing the same logs and it's working fine:


2013-08-26T18:53:31.861Z cpu1:2049)NMP: nmp_ThrottleLogForDevice:2319: Cmd 0x1a (0x412400810cc0, 0) to dev "mpx.vmhba32:C0:T0:L0" on path "vmhba39:C0:T0:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0. Act:NONE
2013-08-26T18:53:31.861Z cpu1:2049)ScsiDeviceIO: 2331: Cmd(0x412400810cc0) 0x1a, CmdSN 0x4b6 from world 0 to dev "mpx.vmhba39:C0:T0:L0" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x24 0x0.

Conclusion

The conclusion is a very unsatisfactory "ESXI 5.1 is fussy" since really there's no news there at all. I also have no idea how to pre-judge which USB sticks will work and which will fail, so I've just gone and bought another three DataTraveler G3's since I need them to complete the lab build.

Comments welcome! :)

<blink>HILARIOUS UPDATE</blink>

Those three identical 4GB sticks I ordered? They arrived. Look:



The one which is not in the packet is currently in the first MicroServer.

It exhibits exactly the same behaviour as the Lexar sticks. 

There are a number of differences:
  • Product IDs: the working stick is Vendor 0951, Product 1643 whilst the new one is Vendor 0951, Product 1665
  • Model strings: 'G3' in the working and '2.0' in the non-working, despite both bearing the 'G3' moniker on the plastic casing
  • Usable capacity: 3690MB in the working and 3695MB in the non-working
  • SCSI revision: the working stick is natively version 4 and natively version 2 in the non-working. There is a kernel message regarding making the version 4 command set appear to be version 2

The SCSI revision number appears to be the crux of the matter.


Working Kingston DataTraveler:

2013-08-29T20:18:13.102Z cpu0:210103)usb-storage: detected SCSI revision number 2 on vmhba32 
2013-08-29T20:18:13.102Z cpu0:210105)ScsiScan: 888: Path 'vmhba32:C0:T0:L0': Vendor: 'Kingston'  Model: 'DataTraveler G3 '  Rev: '1.00' 
2013-08-29T20:18:13.102Z cpu0:210105)ScsiScan: 891: Path 'vmhba32:C0:T0:L0': Type: 0x0, ANSI rev: 2, TPGS: 0 (none)

Non-working Kingston DataTraveler:

2013-08-29T19:47:33.356Z cpu1:29884)usb-storage: detected SCSI revision number 4 on vmhba38
2013-08-29T19:47:33.356Z cpu1:29884)usb-storage: patching inquiry data to change SCSI revision number from 4 to 2 on vmhba38
2013-08-29T19:47:33.357Z cpu1:29885)ScsiScan: 888: Path 'vmhba38:C0:T0:L0': Vendor: 'Kingston'  Model: 'DataTraveler 2.0'  Rev: '1.00' 
2013-08-29T19:47:33.357Z cpu1:29885)ScsiScan: 891: Path 'vmhba38:C0:T0:L0': Type: 0x0, ANSI rev: 2, TPGS: 0 (none)

Non-working Lexar JumpDrive:

2013-08-29T20:30:25.002Z cpu0:210739)usb-storage: detected SCSI revision number 4 on vmhba40 
2013-08-29T20:30:25.002Z cpu0:210739)usb-storage: patching inquiry data to change SCSI revision number from 4 to 2 on vmhba40 
2013-08-29T20:30:25.002Z cpu0:210739)usb-storage: patching inquiry data to clear TPGS (3) on vmhba40 
2013-08-29T20:30:25.002Z cpu0:210740)ScsiScan: 888: Path 'vmhba40:C0:T0:L0': Vendor: 'Lexar   '  Model: 'USB Flash Drive '  Rev: '1100' 
2013-08-29T20:30:25.002Z cpu0:210740)ScsiScan: 891: Path 'vmhba40:C0:T0:L0': Type: 0x0, ANSI rev: 2, TPGS: 0 (none)

I did find that the 'usb-storage' kernel module being used supports per-device 'quirks':

~ # esxcfg-module -i usb-storage
esxcfg-module module information input file: /usr/lib/vmware/vmkmod/usb-storage 
License: GPL
Version: Version 1.0-3vmw, Build: 1065491, Interface: 9.2  
Built on: Mar 23 2013
[...]
quirks: string   supplemental list of device IDs and their quirks

Since the ESXi kernel is based on Linux, I found a clue as to what the various quirks mean: http://lxr.free-electrons.com/source/drivers/usb/storage/usb.c so I tried setting all the available options for a Lexar stick except for 'Ignore Device' (option 'i') with the following command:


# esxcfg-module -s quirks=5dc:a815:abcdehlnoprsw usb-storage

I know the module settings took effect because this was logged:



2013-08-29T21:16:22.137Z cpu0:2489)<6>usb-storage 1-4:1.0: Quirks match for vid 05dc pid a815: 392b1

However, even after a reboot of that ESXi host, there was no change in the behaviour :( Boo! Let's hope that the kernel with vSphere 5.5 is a little more forgiving!


Thursday, 22 August 2013

Shrinking a Windows boot drive in AWS

I recently deployed a couple of Windows domain controllers from a CloudFormation template and only after I'd done reasonable work to them did I realise that the boot drives were 100GB rather than a more sensible 40GB.

Now, that additional 60GB isn't exactly going to break the bank (it works out at about £50 per year) however I'm committed to provisioning the correct size. 

I investigated a few different options for shrinking the Windows boot drive in AWS and this is the one that I found to be the most clear.

In this walkthrough I'll be working on hostname DC01 as the Windows machine in question

AWS Console

  • Take a snapshot of DC01

Windows

  • Start -> Administrative Tools -> Computer Management -> Storage -> Disk Management
  • Right-click on Volume C: -> Shrink Volume so the 'Total size after shrink' is no more than 39000 MB



  • Start -> Shut down. 

AWS Console


  •  Determine which Availability Zone DC01 is in. In this case it's zone 1a in eu-west:

  • Create a new Amazon Linux EC2 instance in the same zone.
    • I chose an EBS-optimised m1.large to improve the disk speed
  • Detach the 100GB root volume from DC01
  • Attach the 100GB root volume to the new Amazon Linux EC2 instance as device /dev/sdf (the default as of writing)
  • Create a new 40GB standard EBS volume in the same zone and attach it to the Amazon Linux EC2 instance as device /dev/sdg (be careful - it defaults to /dev/sdf)

Amazon Linux EC2 instance

  • Check that the devices are mounted correctly with 'fdisk':

  • Yes, /dev/sdf is our 100G source drive and /dev/sdg is the new 40G destination
  • Use 'dd' to copy the contents of the source to the destination
    • dd if=/dev/sdf of=/dev/sdg bs=1M
    • The 'bs=1M' tells dd to use a block size of one megabyte - this is not only faster but also reduces the number of I/O operations. AWS charges per I/O operation :)
    • BONUS: this will also do a full block format to maximise EBS performance
  • The 'no space left on device' message is perfectly normal - 100GB of data will not fit into 40GB - the important part is the useful data is occupying less than 40GB

AWS Console

  • Stop the Amazon Linux EC2 instance
  • Detach both the 100G and 40G volumes
  • Attach the 40G volume as device /dev/sda1 of DC01
  • Start DC01

Windows

  • Verify that the instance comes up and the root device is 40GB as expected:


AWS Console

  • delete the 100GB snapshot
  • delete the 100GB original volume from DC01
  • terminate the Amazon Linux EC2 instance

Tada!

    Wednesday, 21 August 2013

    PuTTY as a quick VPN into an Amazon Web Services VPC

    A few weeks ago I was designing a client solution based on Amazon Elastic Beanstalk. The client initially had purely public-facing webservers in a single subnet with their own Elastic IP addresses and as discussions continued it became clear that an Active Directory environment with two domain controllers and private subnets would be necessary.


    Elastic Beanstalk and VPC Topology with Bastion Host
    Logical diagram showing Elastic Beanstalk in a VPC
    From http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/AWSHowTo-vpc-bh.html
    The generic diagram above (as provided by AWS themselves) shows two web servers sharing a single private subnet whereas the client's are in separate private subnets in different Availability Zones, but the core idea is the same.  

    The key point came when the client asked about all the other 'ancillary' instances surrounding the environment, i.e.

    • Internet Gateway
    • Bastion Host
    • NAT Instance

    These are fair points to make since every client should be concerned about cost. The Internet Gateway in a VPC is provided at zero hourly charge (only charged by traffic passed) but the Bastion Hosts and NAT Instances are EC2 instances charged exactly as for any other instance.

    The live environment spans two AZs so four running instances to support getting data in and out was difficult for the client to accept.

    I came up with a sensible workaround based on previous experiences with SSH forwarding. I knew that the NAT Instances (regular EC2 instances in the public subnet running Amazon Linux) already ran an SSH daemon so I was sure that I could securely connect to instances in the private subnet via the NAT instance and could therefore get rid of the Bastion Hosts completely!

    The configuration is fairly straightforward - there's more noise in the screenshots than anything else - yay Windows.

    • The default username for all Amazon Linux instances is 'ec2-user', so we connect to ec2-user@ELASTIC.IP.ADDRESS




    • Before we click 'Open', visit Connection -> SSH -> Auth and tell PuTTY where your PPK private keyfile is. 
      • If you only have the .PEM key downloaded from AWS, you will need to convert it with PuTTYgen. Sehere for details.



    • The magic takes place in Connection -> SSH -> Tunnels:
    • We have mapped a connection to 127.0.0.2:22 through to the private VPC subnet of 10.0.1.11:3389. I am using an SSH port on my local machine since Windows 7 binds to port 3389 on all local interfaces. I could've chosen any other local port number.
    • Go back to the 'Session' menu (top of the 'Category' tree view), give the session a name and click Save. 
    • Now click 'Open' and can see we are connected to the NAT instance in the public 10.0.0.0/24 subnet:

    • There's nothing obvious to say that the tunnel configuration has been successful - we need to right-click on the PuTTY title bar and select 'Event Log' for that:

    • We can now launch the MS Remote Desktop client and point it at 127.0.0.22:
    • Tada! 

    Final note - be sure to update PuTTY to 0.63 to benefit from improvements in the port-forwarding code, and to receive fixes for a bunch of security holes: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html