Monday, July 6, 2009

Update: Why Bing Travel Went Down

You might consider this a redundant question of sorts, but this is why Bing Travel went down for some 36 hours over the July 4th weekend when a transformer blew out much of the power at the Fisher Plaza facility in Seattle.

In the heat of the crisis over the weekend, much of the detail behind the incident was unavailable, but Bing provided me with some background today.

Bing Travel, known as Farecast before it was rebranded with the launch of Microsoft's Bing search engine, had "redundant counterpoints" for its core networking systems, databases and Web/applications servers, but they all were located within Fisher Plaza, said Bing spokesperson Whitney Burk.

"Microsoft has invested deeply in fortifying the Bing Travel infrastructure to handle high-volume server loads and emergency failure situations, and the amount of redundancy was increased in anticipation of the Bing launch [which took place in the beginning of June]," Burk said.

When the incident occurred last weekend, Microsoft already was in the process of transitioning the "legacy Farecast servers" to Microsoft's cloud computing platform, "the goal of which is to prevent this sort of unforeseen situation from impacting our customers," the spokesperson said.

Citing the complexity of the task, Burk said Microsoft hopes to have Bing Travel moved to cloud computing by the fall.

I will let Burk explain how cloud computing gives computing management systems new, well, "repurpose," and lessens reliance on specific hardware and specialized configurations in disaster-recovery situations.

Says Burk: "Fundamentally, cloud computing platforms offer a layer of hardware abstraction to applications that run on them. This means applications, when retooled to run on the cloud, no longer need to be sensitive or aware of the number of machines available to them running in production. Another distinction of cloud computing technology is that configurations with specialized hardware are avoided. When individual machines fail (which they sometimes do), the cloud computing management systems will automatically detect the failures and repurpose new or existing machines to run the applications.

"Because cloud computing platforms usually consist of thousands or tens of thousands of machines, adding capacity to handle increased load is simply a matter of changing some basic configurations and deploying more copies of the services and applications to additional machines. This also true for geo-redundancy. A new instance of an entire website can be brought online relatively quickly; it’s simply a matter of configuring a set of available machines in the cloud and letting the management software perform the software deployments and traffic routing."

With its data-mining and predictive technology, Bing Travel seemingly indeed runs on a complex system, and it reportedly was the last of the affected websites in Fisher Plaza to begin humming again.

Real estate website Redfin, which had redundancy capabilities at a second location, reportedly was back online in about five hours.

Meanwhile, asked about the business impact of the outage to Bing Travel, Burk declined to comment.

No comments: