Seven rules to help you on preparing for problems on storage area networks

I wrote this article due the fact that every SAN-adminisrator should know how to be better prepared for problems on their storage area networking. As central storage nowadays is rather a rule than an exception does storage area networking play really important role for everyday computing.

Note that these seven steps are more like my suggestions rather than generic rules that should be followed strictly. So try to find at least something that you can implement for your environment and please let me know if there’s more relevant things than these, or something to add.

Rule #1 – Implement NTP

From my opinion this is the most important thing. Anyone who has faced situation where you needed to find out what happened on environment where all clocks are pointing different time understands relevancy of NTP. When you have all your devices on same time finding root cause is usually much easier. Implement NTP server on your management network and sync all your devices from there. You can keep your management networks NTP on proper time by syncing it from internet but it’s more relevant that all devices are on same time rather than exactly on right second of world clock.

Rule #2 – Implement good naming schema

In past servers usually got their names from action heroes and stars and this might be nice but if you want to have easy rememberable names use them as CNAMES rather then proper names. In problem situation it would be nice to see from name exactly where your device/server is located so I suggest that you use something like Helsinki-DC1-BC01-BL01-spiderman rather than just spiderman. In this example you could easily see that your server is located at Helsinki on datacenter 1 and is on blade chassis one and there blade number one.

Use consistent naming on zoning. I usually name zones ZA_host1_host2. This shows immediately that it’s zone on fabric A and it’s between host1 and host2. On SAN I always prefer that aliases are also named with same kind of naming schema; AA_host1 which is alias on fabric A for host1.

For storage area networking domain ID is like phone number, domain ID’s should always be unique. This is not usually problem if you have separate SAN’s, but if you move something between SAN’s having unique ID is crucial so from the beginning use unique id’s. This information is also used for several other things like fibre channel address of device ports etc.

Rule #3 – Create generic SAN management station

This is usually done on all bigger environments but every now and then I see environments where there is no generic SAN management station implemented. Almost every company has implemented virtualization at least in some level so creating generic SAN management station should not be any kind of problem. You can go easily with virtualized Windows Server or maybe even with just virtualized Windows 7 with RDP connection enabled but I would go with server so there can be more than just one admin on station at time.

This station should have at least these:

  • SSH and telnet tool witch allows you to output of session to text file, on Windows environments I usually go with putty
  • FTP-server (and maybe tftp also). I usually go with Filezilla Server which is really easy to configure and use
  • NTP server for your SAN environment
  • Management tools for your SAN (Fabric Manager for Cisco and DCFM for Brocade) – This is really important on larger environments for toubleshooting
  • Enough disk space to store all firmware-images and log files from switches (Rule #5)
  • ….access for internet in cases where you need to download something new or just use google when sitting on fire 😉

Rule #4 – Implement monitoring on your SAN environment

This can be done at least by using same software you use for your server environment but I would go with Fabric Manager on Cisco SAN’s and DCFM on Brocade SAN’s because these include also other features and are really useful when your environment gets bigger. Configure your management software to send email/sms when something happens – don’t just trust your eyes!

You should also implement automatic log collection for your environment. For example this helps a lot when you try to find physical link problems or slow drain devices. Configure your management station to pull out all logs from switches daily/weekly and then clear all counters so next log starts with empty counters. This can be implemented with few lines of perl and ssh library and there are plenty of exciting scripts already on google if you don’t know how to do it with perl.

Rule #5 – Design your SAN layout properly

This is really easy to achieve and doesn’t even need much time to keep in update. Create layout sketch of your SAN – even in smaller environments – and share it with all admins. You don’t need to have all servers on this sketch, include just your SAN switches and storage systems, if you want you can include your servers also but this usually makes your sketch quite big and unreadable. In two SAN environments (Having two separate SAN’s should be defacto!) plug your servers and storage always on same ports, so if you connect your storage system on ports 1-4 on switch one in fabric A, connect them to ports 1-4 also in corresponding switch on fabric B.

Rule #6 – Update your firmwares

Don’t just hang on working firmwares. There is no software which is absolutely free of bugs and this is why you should always update your firmwares regularly. I am not saying that you should go with new release as soon as it gets to downloads but try to be in as new version you can. There are lot’s of storage systems which makes requirement for firmware levels so always follow your manufactures advices. If your manufacturer doesn’t support newer then something released year ago it might be time to change your vendor!

If you have properly designed SAN with two separate networks you can do firmware upgrades without any breaks on production and most of the enterprise class SAN switches (Usually called SAN Directors) have two redundant controllers so you can update them on fly without any interruption on your production!

Rule #7 – Do backups!!!

Take this seriously. Taking backups is not hard. You can implement this on your daily statistics collection scripts or do this periodically by your hands – which ever way you choose take your backups regularly. I have seen lot’s of cases where there was no backups from switch and on crash admins needed to create everything from scratch. Implement this also to your storage systems if possible, at least IBM’s high end storage systems has features which allows you to take backups of configs. Config files are usually really small and there shouldn’t be place where there is no disk/tape space for backups of such a important things like SAN switches and storage systems. From SAN switches you might also want to keep backup of your license files as getting new license files from Cisco/Brocade can take while.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s