A war story
One thing that you never run out of when working in the IT field are war stories. Everybody have few and with years under the belt more and more follow.
Most of the stories are pretty hermetic, but to all that can get them, make hair on the back of your neck suddently stand.
I'd like to share one of mine.
Few years ago I've had a pleasure of running an install at a certain customer for a certain company. It was a nice travelling experience even though I'm not noted for being fond of riding out of home.
Besides the views and good hotel, I've had an objective. I've been sent to assist a hardware partner for our product, who was struggling to get the system and gear work for the customer. Customer was a financial institution with a tight schedule on the installation. The usual workflow for this given reseller was to put all the gear together in their lab, install the product, configure, test, then dismantle, transport to customer's DC and slap everything back together. Used to work as a charm. Not this time.
Right after the system boot a second pair of network ports were configured properly. SFPs were plugged in, access was tested, then systems were rebooted to test various functionality and both network ports never came back. On two identical servers.
When I got flied in to assist the troubleshooting, situation got pretty hot. All the functionality was apparently very satisfying to the customer and they were more then willing to buy even more, once the setup was succesfully deployed. Problem was - you couldn't reboot the gear, unless you were willing to go down to DC, unplug the ethernet, reboot and plug ethernet back. Unacceptable, as you may understand.
A little bit of a background may be required. The software product installed on top of the gear is a bit of a niche thing, not based on Linux nor Free/Net/OpenBSD familiy of systems. Thus people skilled in reading logs and understanding it are a bit scarce. Tied to that is also fact, that while community is pretty active and strong, it is relatively tiny, compared to other ones.
It didn't take me much time to see that where we hit the problem was the moment network card drivers were probing for SFPs.
If the SFPs were not plugged in during the boot, drivers were loaded properly, interfaces configured and later on, during system operation, we could use them. If SFPs were plugged in, driver refused to load for those given NIC ports. Digging a little bit deeper we discovered that driver itself was conducting a check on SFP vendor and only allowed to load for a list of certain ones. For any other it would refuse to. Googling the problem we found a similar reports for Linux systems, which were quickly solved by adding a line to driver configuration file, allowing to bypass the test. Reasoning that if you wanted reliability guaranteed by network card vendor, you had to comply and use only SFPs from their list of supported ones. But if you were willing to battle hardware on your own, you could disable the test and sail the waters.
I have made the discovery 4 hours before the deadline set by customer. It was obviously too early for shouts of joy. From a discovered issue, to engineering accepting it as an RFE, to compiled new version of driver was a long road, which we might not be able to travel within 4 hours.
Anybody that was personally responsible for a deal worth hundreds of thousands of Euro, standing face to face with end customer and direct company customer (product reseller), telling them we may not make it, you know the kind of experience I was going through. Nothing I'd ever want to do anytime soon, again.
All in all, we did it. Somebody up in the company pushed the one liner through corporate paths at express speed. One hour before the deadline set by customer, we were able to finish all the tests, including the reboot of the system. Deal was accepted.
So what is the lesson learned from this experience? There are actually few.
First one - open source is incredible. We were able to figure out problem and solution based on quite different operating system.
Second one - internet is incredible. We were able to find the solution at all, because someone went through this path already and search engines served the solution on silver plate.
Third one - for all the Software Defined movement, meaning liberating you from clutches of hardware lock in, meaning you can slap software product of your choice on hardware set of your choice - the combinations may not be as many as it seems at first and *someone* needs to do compatibility matrix. And let it not be the customer.