BLOG

In Clouds we Trust

September 8th, 2009

cloud-trust

Gmail’s recent breakdown has once again brought up the topic of the lack of trust in cloud computing to the forefront. Trust is defined by the Webster dictionary as the “assured reliance on the character, ability, strength, or truth of someone or something” or one in which confidence is placed”.

The main reasons for the lack of confidence in cloud computing can be summed up in the following questions people typically have:

  1. Security – How can I trust that my data will be secure when it is not in my physical control?
  2. Reliability – Can I truly rely on the cloud to be there when I need it? What if it goes down? The track records of Google Apps, Amazon EC2, Microsoft Azure and the rest do not inspire confidence.
  3. Performance – Will the cloud be able to adequately service my specific performance needs? Doesn’t the cloud affect the latency for my applications?
  4. Lock-in/Portability: How confident can I be that I will be able to move my applications and data around seamlessly between providers?

Till the industry is able to address all of these issues satisfactorily, the adoption of clouds will grow only marginally. They key is to address all of them; addressing a subset will simply not suffice.  After all, what good is a high performance cloud infrastructure that is insecure or a secure cloud that is unreliable or proprietary?

So how far away are we from the current state-of-the-art in cloud computing to achieving trustable cloud computing? In the figure below, I have tried to list all the critical requirements to enabling trust in cloud computing. I have also color coded each of them to indicate current progress within the industry towards achieving these requirements. Green -> What has already been achieved; Yellow -> Work in progress; Orange -> What is yet to be achieved.  Below the figure, I have some additional thoughts on what needs to be done to address current deficiencies for each of these requirements.

cloud-trust-3

Scalability

One of the basic requirements for cloud computing is the ability of the infrastructure to scale. After all, if you aren’t able to adequately service incremental demand then the trust in clouds is a non-starter. Today, the likes of Amazon, Google and Microsoft and many others have clearly demonstrated the ability to scale compute infrastructure for cloud computing. So the industry is already well on the way with respect to this requirement for trusted clouds.

Cost

Cost or rather value is absolutely a factor in enabling trust in clouds. Affordability drives adoption and perceived value influences whether customers rely on it longer term. There is no absolute low cost threshold to aim for. The reality is different customers will have different requirements and what they are willing to pay will depend on the value (SLAs, QoS, features etc.) they receive. The key is that customers not only have these choices but that the offerings reflect customer expectation for lower costs derived from economies of scale.  For example if cloud providers were to start charging more as the size of clouds grow ( to cover increasing management costs), then customers are going to be skeptical of the paradigm.

The economies of scale in large public clouds are already driving costs down but there are still numerous avenues for improvement. To understand those areas of improvement lets first look at what drives costs today. The cost of cloud computing infrastructure is a function of a number of factors including:

  1. Hardware costs
  2. Software costs (license and maintenance)
  3. Installation services and administration services costs
  4. Cost due to inefficiencies resulting from
    1. Under-utilization
    2. Infrastructure complexity
    3. Shelf-ware both hardware and software (A $120K software on shelf is equivalent to a person year cost)
    4. Human latency

Costs for #1 through #3 are driven largely by vendor economies of scale and market forces i.e. larger the market and more the competition – lower the prices. So let us look more at #4 – the costs due to inefficiencies. It is well understood that existing compute resources are not being utilized effectively and this is a major contributor to high data center costs in terms of power, cooling, floor space, software and administration. Virtualization is helping address some of the inefficiencies on the server side but still not all servers are virtualized – either due to technical (e.g. I/O contention, performance), software (design or licensing hurdles) or – as in the case of private clouds, even organizational reasons (inter-departmental accounting, chargebacks and politics). Beyond servers we also have to deal with inefficiencies in storage. Thin provisioning and deduplication are just a couple of solutions here but we still lack for example a good approach to properly reclaim/consolidate capacity when it is no longer used. Clearly there is a lot more room for efficiency gains through the optimization of utilization.

Another area of inefficiency is the management complexity that has been introduced into the infrastructure within our datacenters. The figure below highlights the silos and the duplicated management functionality within those silos that need to be consolidated in order to address inefficiencies due to redundancy. Customers are paying for management functionality that is unnecessarily being duplicated all over.

cloud-complexity

Finally, there is the cost due to “human latency.” Without cloud infrastructure that can be automatically and optimally reconfigured in response to changing demand patterns, we are at the mercy of human administrators and experts. To accurately account for the cost for humans, we must not only factor what it costs to pay for them but also the cost of the time it takes them to recognize and react to an event requiring action.

Reliability, Availability and Performance

While the prevalent cloud computing platforms of the day have demonstrated their ability to scale – reliability, availability and performance are sorely lacking. This is amply evident in the numerous stories on cloud service outages that routinely make the tech press these days. Reliability, availability and performance are fundamental to establishing trust. What this means is that we need application QoS for the cloud. Application QoS can be defined as the ability of the cloud to satisfactorily service numerous applications with different latency tolerance requirements by monitoring the applications in real-time and then dynamically allocating or assigning compute resources based on business priorities. Latency tolerance may vary depending on the type of application and the application’s business priority. For example latency tolerance for a video streaming application may be less than that for an email server. However, the email server may be deemed more business critical than the video streaming application. The point here being that ultimately, latency tolerance should be specified by the application and that should dictate how all available compute resources are assigned dynamically to ensure application QoS. Today’s cloud computing infrastructure is not there yet. We do have a number of companies that recognize this as an opportunity i.e. the likes of RightScale, Elastra, 3Tera and numerous others. What these companies do is employ server virtualization technologies – which serve to abstract server software and the operating system from the server hardware, as the basis for enabling dynamic resource allocation. In essence, more server resources can be dynamically allocated to an application as demand changes. This however does not address the need for application QoS. The key point to note here is the dynamic resourcing in offerings from the above mentioned companies is limited to “server resources”. To truly enable QoS for applications in the trusted cloud, all resources – not just server resources must be dynamically assignable to match application requirements. Why? Because every compute resource contributes to the overall latency for an application. Storage and network performance in addition to server performance help determine an application’s overall performance. Consequently, the ability to dynamically provision storage IOPS and bandwidth in response to changing application needs is also required to deploy truly dynamic cloud infrastructure. This is essentially extending to storage what can be done today for servers using virtualization. This capability simply does not exist in present day storage systems and will be required for cloud computing infrastructure that can be trusted. I’ll have more in detail on this topic in future posts.

Security

There are two key elements to security in the context of trust in clouds. These are:

  1. End-to-end security of the data path between application and computer resources
  2. Security of the data that resides in the cloud

Earlier on we talked about the need for dynamic infrastructure that is able to assign compute resources to applications in real-time based on changing demand patterns. This dynamic infrastructure also needs to ensure security of that dynamically assigned connection end-to-end. Currently, we are in the early stages of this with Amazon’s recently introduced VPC (Virtual Private Cloud) demonstrating how to securely bridge traffic between a private datacenter and a public cloud using an encrypted VPN connection. All this does is securely transport data between a private datacenter and a public cloud infrastructure. The data on the public cloud infrastructure is still shared and there is nothing in Amazon VPC that secures the data once it is in the cloud. What is required is the ability to enable security for dynamic connections between the applications and compute resources. Finally, fine grained access control and encryption-based security must be implemented to lock down access to data that resides in the cloud.

Global Interoperability

In order to trust the clouds, customers will need the confidence that the investment they made in getting applications to deploy in a particular cloud infrastructure provider are not locked in. Accomplishing this will require the industry to create and adopt open standards for global interoperability to ensure portability of applications, data and management paradigms. This is similar to what happened during the course of the evolution of the telecom and the internet industries. In the Telecom world, the ITU ultimately helped establish worldwide standards that enabled seamless interconnection and interoperability between disparate communications systems. In the Internet world, IETF laid the foundation for interoperability by coordinating the creation of standards between customers, operators and vendors. A similar organization needs to drive open standards effort in the cloud computing space. The CCIF (Cloud Computing Interoperability Forum) is an early example of such an effort. However, it is still in the stage of trying to standardize taxonomy and create a common architectural framework. We are still a ways away from global interoperability

In summary it all boils down to assuring application QoS  by enabling the end-to-end visibility, fine-grained control and security requirements for each of cloud infrastructure stakeholders while also offering choice. Specifically:

  1. Service Providers require end-to-end visibility and control  to optimally manage resource utilization
  2. Service developers require that adequate resources will be assigned for applications they develop along with the ability to pick and switch between service providers without fear of vendor lock-in
  3. End users desire the ability to stipulate the SLAs they require, along with visibility, control and flexibility (choice) to manage to those SLAs

In my next post I will propose a reference architecture that could provide the basis for enabling the above.

Data, Electricity and Computing Clouds

August 1st, 2009

data-electricity

James Urquhart’s recent piece – “In Cloud Computing, Data is not Electricity”, rightly brought up the issue of “Trust” in Clouds. I wholeheartedly agree with him that Trust is the key issue to be solved with respect to clouds. It is a topic I’d like to discuss a bit more in an upcoming blog entry. For now, I just wanted to discuss James’ analogy of electricity and data a bit more. I think the analogy is just a bit off although I do agree with James conclusion that “until there is some way of consuming the cloud with verifiable security, control, service levels and compliance, much of the most valuable data in business today will not move to external clouds.”

In comparing data and electricity, it is important to keep sight of what exactly the “utility” or “commodity” is here. Just as James contends that an “amp is an amp is an amp”, I’d say a bit is a bit is a bit.  I use “bit” loosely here. To be more accurate I’d use Watt Hour for electricity and some yet-to-be-defined unit for measuring compute power that incorporates parameters such as processor cycles,  memory,  storage capacity, IOPS,  bandwidth,  throughtput etc.  Now you can certainly use those commodity compute units to generate some data that is valuable. However, the issue of storing, sharing and in general securing that resultant data is different from the underlying compute processing. I understand that for cloud computing to truly take off we have to address both of these issues. However, it is important that we recognize the distinction between those two elements. Let’s look at this another way, the electricity that we consume over the power grid is used to run various appliances – for example a lamp, a radio, a television etc.. So here the commodity is the electricity but the true value we derive is something else i.e. light, music or pictures.  Similarly, the bits flowing across from a cloud are a commodity but they are consumed, processed, and reconstituted by an appliance – in this case a client computer or server, and the value we derive is some data or information pieced together from those bits.

The excitement with cloud computing today is basically around the realization that cloud computing could make computing power i.e. cpu, memory, storage, bandwidth and throughput, a commodity. Clearly, a lot has yet to be done beyond that to enable globally interoperable and secure public cloud infrastructure..  However, in the interim there is so much efficiency to be gained by rearchitecting the data centers as clouds – even if its just for internal or private clouds. To be fair, I believe James actually reaches the same conclusion but I just felt that the “data is not electricity” statement warranted some clarification.

The Economic Imperative for Clouds

July 30th, 2009

While reading Nicolas Carr’s The Big Switch (pp. 37), the following paragraph struck me.

big_switch

“By 1905, a writer for Engineering Magazine felt comfortable declaring that “no one would now think of planning a new plant with other than electric driving.” In short order, electric power had gone from exotic to commonplace. But one thing didn’t change. Factories continued to build their own power-supply systems on their own premises.  Few manufacturers considered buying electricity from the small central stations,  like Edison’s Pearl Street plant,  that were popping up across the country. Designed to supply lighting to local homes and shops, the central stations had neither the size nor the skill to serve the needs of big factories.  And the factory owners, having always supplied their own power, were loath to entrust such a critical function to an outsider. They knew that a glitch in power supply would bring their operations to a halt – and that a lot of glitches might well mean bankruptcy.”

With the above paragraph, Carr does a good job of paralleling the adoption of electric power in 1905 to the current state of cloud computing adoption by enterprises. Elsewhere in his book (pp. 22), Carr makes the case that “..the fostering of invention and the embrace of new technologies that result…are the consequence of economic forces that lie largely beyond our control.”

Carr is clearly making the case that similar “economic imperatives” will apply to the computing world.  In that context, I thought that it would be useful to look back at the evolution of the computing industry and see how we are progressing and understand the forces driving us towards those economic imperatives.

tco-curves

Economies of Scale are often touted in the context of clouds, but we should really look at progress in the context of Total Cost of Ownership (TCO).  TCO can be used to model total cost relative to benefits for a given technology.  It typically includes cost of acquiring, upgrading, deploying, operating, utilizing, maintaining, servicing and subsequently disposing a technology. The graph above roughly charts TCO for data centers over time.

My sense is that over the course of the past 40 years of computing, we have seen only marginal improvements in TCO overall, with any significant gains offset by equally significant losses due to the increased complexity and the new management burden that accompanies every new data center technology. Let me highlight some of these vicissitudes in data center TCO over the years using the chart shown above.

Way to the left, we have the early days when all of the compute elements i.e. CPU, Memory and Storage were all packaged into a box i.e. the server. This was expedient as it allowed the fastest interconnects between the various components. In this phase, initial gains in TCO came from faster hardware and servers. The increasing number, physically dispersed servers and networking all added to the complexity but for the most part we did ok with newer management tools.

Next, thanks to the phenomenon of bandwidth inversion (external FC connections with more throughput than the internal bus in a server), it became possible to have external storage connected to the server that could deliver better performance than internal storage. This facilitated shared storage infrastructure i.e. SANs, which helped boost TCO by enabling better resource utilization, improved performance and significantly better Business Continuity. At the same time, SANs also increased management complexity and caused new problems. Now we had a whole new silo in the datacenter – other than the server, to be managed and optimized. This required new skills, tools and added new costs. Additionally, all businesses or applications that shared storage could potentially be impacted if shared storage is not ideally configured. (Just as a point of reference, a basic “health check” service for SANs – a prelude to more expensive tuning services, costs approximately $400 a switch port and typically run in the neighborhood of 25,000 – $50,000 overall.)  So in the overall measurement of progress of TCO, we gained some but also lost a fair bit of ground with SANs.

This trend continues today with virtualization. Like the SAN that preceded it and revolutionized storage, Virtualization is enabling unprecedented sharing of server resources and delivering significant benefits in terms of resource utilization, consolidation and HA/DR. However, once again, we have introduced new layers of complexity and management. For example, now we have to contend with Virtual Machine Sprawl, I/O contention between applications, new security concerns etc.

We seem to be on a path that is not sustainable. Today’s data centers have become complex silos of resources (servers, storage and networks) with poor end-to-end provisioning, management and security.  What is worse is that each silo often duplicates management functionality that is also present in the other silos resulting in counterproductive results. The customer unknowingly pays for duplicated and often counterproductive management tools and services pedaled by vendors of products in each silo. This is why Cloud Computing – in my mind, represents an opportunity for a fundamental re-architecture of the data center as we know it today. Cloud Computing will require us to be application / service centric in terms of data center infrastructure and its management (as opposed to the resource-centric paradigm we are used today)….and you simply cannot deliver application-centricity (SLAs, QoS etc.) without end-to-end, visibility and control. This is where it  might be helpful to borrow on some lessons from the telecom world which also went through a similar evolution but was able to arrive at a scalable infrastructure architecture and management paradigm that delivers consistently a mission critical service i.e. voice, reliably and securely. In particular, Telecom’s FCAPS based signaling, switching and service mediation concepts could be worth looking at.

The status-quo i.e. current IT silos and layer upon layer of  management infrastructure, is inefficient and not sustainable. Virtualization and network computing have only paved the first steps, but Cloud Computing imo represents that economic imperative as Carr mentions that will necessitate a rethink of current data center architecture and management practices. With computing, we are at that threshold – just like the one Carr paints from the world of electric power in 1905, beyond which the cloud will become the only way to produce, deliver and consume ubiquitous, cheap, efficient, reliable and secure computing power.

The Case for Industry-Academia Collaboration

July 20th, 2009

ieee-workshop

Earlier this month, I had the opportunity to participate and present at the First International IEEE Workshop on Collaboration & Cloud Computing that was held in Groningen, Netherlands this year. Like most of you, I probably would not have heard of this conference if not for my colleague Dr. Rao Mikkilineni who chaired the workshop this year. After attending it I am convinced that this conference has some real potential and is just the kind of forum that the cloud computing industry needs in order to enhance its collective understanding, identify future directions, explore potential standards and to generally foster better collaboration between industry and academics. “Why academics?” – you say. “Isn’t the corporate world doing just fine?” Maybe so. However, it is my sense that in the rush to grab corporate market share and establish early competitive advantage, the industry as a whole might miss out on an opportunity to do things right. Let me explain.

Cloud computing represents an opportunity to usher in new levels of efficiencies by breaking down existing computing silos and enabling transparency in terms of visibility and control to users within and across datacenters.  It also provides an opportunity to design massively scalable and globally interoperable computing clouds without vendor lock-in. That said the danger I foresee, is that an industry increasingly driven by short-term thinking – due to the need to generate ROI for their investors in the near term, could lead to restrictive, non-scalable architectures that are incremental improvements over what we have today.  What is needed is some good old fashioned no-holds-barred architectural brainstorming to generate new breakthrough computing models.  This is where I believe credible organizations like the IEEE can facilitate collaboration between industry R&D Labs and academia and uniquely provide a great forum where they can work together without short term ROI pressures, to propose, deliberate and lay out a solid foundation for the industry’s future.

Personally, this year’s workshop was memorable and refreshing for at least a couple of reasons.  For one thing, I had the opportunity to visit the Netherlands – a place I hadn’t visited before. Besides that, the conference offered me the chance to re-visit the world of academic research; a world that – having spent my entire career after graduate school in the corporate world, has now almost become alien to me.  What struck me was the intimate academic environment of the workshop,  entirely devoid of the corporate agendas and marketing hype that drive almost all conferences and events in the corporate world. I found it refreshing to be able to objectively discuss current trends in Cloud Computing in this setting with other participants from various university and industry organizations, who had assembled and really had no other axe to grind than to identify long-term research themes and to collaborate on future research in the field.  The topics and themes discussed there were very interesting. Some of the key topics included:

  1. Is the cloud just an XaaS stack or is there more to it?
  2. Is Unified Computing simply throwing servers, network and storage in blades?
  3. Are lessons from the past i.e. from telecommunications and from the Internet, relevant to enabling massively scalable and globally interoperable clouds?
  4. How can infrastructure hardware vendors accelerate cloud deployment by including service enabling features
  5. Is the current trend in throwing multiple OSes inside the server and including another networking abstraction layer that bridges these OSes the right model or should we look at a new network centric operating system that allows dynamic composition of distributed physical computing resources based on latency tolerance of services consuming the logical resources?
  6. Can we accelerate the creation of Computing Clouds through fresh ideas such as the concept of a virtual infrastructure fabric, a management services fabric and a business services fabric?

The last topic, the concept of a “fabric” for virtual infrastructure, management and business services – essentially a SOA architecture for datacenter infrastructure,  intrigues me very much. Take a look at this paper by Pankaj Goyal if you are intrigued as well. Actually, all the papers from the conference along with an overview of the workshop are here.

Overall, some really neat stuff that had me wondering why IEEE didn’t do more with the conference in the US? Turns out the conference used to be held mostly in the US (see past conferences)…till 9/11 changed things. Apparently US visa restrictions made it hard for many to travel to the country and since then the conference has mostly lived in Europe. A shame really as I think that this IEEE workshop can be a great forum and contribute immensely to the larger discussion on cloud computing.  In addition, as much as I enjoyed the academic setting, I think that more exposure and participation from the corporate world and from the US would only serve to enrich the discussions here. I have no doubt that IEEE’s rigorous vetting and review process will ensure that the discussions are substantive and not just marketing fluff driven by corporate agendas that often pervade other industry conferences.

If you are interested, the 2010 workshop is scheduled to be held in Larissa Greece.  Who knows, maybe the next one after that will be in the US.

Welcome to Out-of-the-Box Computing

July 16th, 2009

out-of-the-boxThis is – broadly speaking, a blog about Cloud Computing …and heaven knows you need another one of those as much as fish need bicycles.  So to be more specific, this is a blog that will aim not just to inform, but also to muse and provoke thought on the kinds of datacenter infrastructure and architectures that will be required to truly deliver on the promise of Cloud Computing.

OK. So what is “Out-of-the-Box Computing?”

The unpredictable demands of the Web 2.0 era in combination with the desire to better utilize IT resources is driving the need for a more flexible (elastic) IT infrastructure. This need for flexibility is about to fundamentally alter the datacenter landscape and transform the computer as we’ve known it. The computer in the Cloud Computing era will no longer be defined in terms of a physical enclosure (box) that has traditionally housed the processor, memory,  storage and associated components that constitute the “Computer” .  Instead, in the Cloud Computing era, the notion of a computer will be rather nebulous (ok – at least I didn’t say cloudy).  Actually, I rather like the idea of an Infrastructure “Fabric” and it is a concept that we will revisit often in this blog.  The Fabric could constitute Processors, Storage and potentially components such as memory that are distributed physically but organized and orchestrated on demand into a massively scalable and dynamic (elastic) pool of resources that is reliable, secure and optimally utilized end-to-end.  ….Now that doesn’t sound like something you could shoehorn into a “Box” – does it?

The “Box” is history.  Its time for a new computing paradigm.