Kindura


Funded by the: JISC Flexible Service Delivery programme.

Lead Institution: King's College London.

Partner Organisations: The Science & Technology Facilities Council (STFC) and DuraSpace.

Key Words: Cloud and shared services.

 

This case study was produced along with the 'Improving Organisational Efficiency' infoKit.

 

Background

 

Aims and Objectives

 

Kindura sought to pilot the use of a hybrid cloud, shared service and in-house model for providing repository-focused services to researchers across its partner institutions. These were to include services for:

 

 

We carried out our pilot using DuraCloud, which has been developed by DuraSpace. We used DuraCloud to broker between storage or compute resources supplied by external cloud services, shared services, or in-house services. The Fedora repository, which interoperates easily with DuraCloud, provided the researcher front-end for managing research outputs and information. The data management infrastructure was based on the iRODS grid-based storage system to leverage its facilities for automatic replication and server-side data processing workflows.

 

All services were built to appear “cloud-like”, even internal ones—it is thus a hybrid approach that combines the advantages of the commercial/external cloud with an institutional/consortium cloud.

 

The pilot aimed to provide cloud or cloud-like services at several levels. It provides Infrastructure as a Service (IaaS) components, via storage and compute services, but more importantly it aimed to combine these, using DuraCloud and the Fedora repository as enabling technologies, to provide an integrated Software as a Service (SaaS) package of repository-centric services.

 

Project partners

 

The Kindura project was a collaboration between the Centre for e-Research (CeRch) at King's College London, the Science and Technologies Facilities Council (STFC) and DuraSpace.

 

CeRch has a focus on developing Information and Communication Technology (ICT) solutions for supporting research activities across the College, including digital repository infrastructures and other research information systems.

 

STFC provides access to world-class experimental facilities, e.g. ISIS and the Diamond Light Source. They also have extensive expertise in managing scientific research data and information, in particular using grid storage systems such as iRODS, and data replication and migration (e.g. for the Large Hadron Collider).

 

CeRch and STFC previously collaborated on the ASPiS project, which worked on using iRODS for storage and management of research data.

 

DuraSpace is a not-for-profit organisation that specialises in open source technologies in the fields of digital repositories and clouds, with a particular focus on the needs of HE/FE, research centres, libraries and cultural heritage. The organisation was created in 2009 from a merger of Fedora Commons and the DSpace Foundation. DuraSpace provided technical consultancy on this project.

 

Context

 

King’s College London is a large research and teaching institution in the centre of London. Research is carried out across a wide range of disciplines including science, humanities and medicine. The Information Systems and Services (ISS) organisation provides centralised IT services to the University's departments. Since King’s College London was formed from an amalgamation of previously autonomous institutions, there has been an ongoing process of integrating IT functions from the member organisations.

 

For the purposes of this project we identified a number of target groups of researchers who make use of large datasets and would be particularly suited to the use of cloud computing. They came from a mix of subject areas including Environmental Science, Financial Mathematics, Biomedical sciences and Humanities. Once the technical and legal issues are fully understood and overcome, cloud storage is likely to be applied to a much wider group of researchers requiring the storage of large datasets.

 
Researchers are notoriously reluctant to spend time curating and archiving their research data. We drew upon our experiences in related JISC-funded projects such as BRIL and FISH.Net to help us consider how the process of archiving research data could be automated and integrated into research workflows. 

 

The scope of the project was to work with sample datasets from individual research teams in the target disciplines to evaluate the functionality and usability of the system as well as the technical capability in dealing with large collections of data. We planned to test the system's capability to archive both individual large files as well as large collections of small files. Such collections would be typically in the size range of 100Gb to 5Tb. It was not within the scope of this pilot to move the system into service within a given department. On the other hand, we intended to make the system as robust as possible, using production quality software components, in order to enable a rollout in a subsequent project.

 

The business case

 

Cloud technology is increasingly being used for flexible service provision; the large number of projects focusing on cloud technology, as well as the commercial service providers, show that this is a viable business strategy. There is a clear opportunity in the academic environment to increase flexibility in data management by storing data “in the cloud.” This combines elastic commercial or academic public clouds with internal archival storage in Kindura services. In turn, a Kindura service will enable an institution to make use of its own unused storage; replication of data within Kindura itself between federated iRODS services will enable some level of dynamic service provision.


As for the cost, all commercial cloud providers publish prices, and some (e.g. Microsoft Azure) provide “calculators” to compare the cost effectiveness of service provision with those of in-house services.


Without picking any particular provider, a typical storage rate is about £0.06 per Gigabyte (GB) per month (down from about £0.10 in 2010), which for 1 Terabyte (TB) for a year is £720. On top of that, add the same amount in transfer costs (at least).


Meanwhile, the cost of provisioning in-house services (at the same availability as the cloud prices quoted above) is often underestimated—it is far more than “just buying a 1 TB disk for £75.” Depending on the infrastructure already available at the institution, professionally-run services with backups can be run by IT services, or by central data centres. The staffing, power, hardware, software, maintenance, backup and facilities costs all need to be taken into account in making a realistic comparison between in-house and cloud storage.

 

The elasticity provided by public clouds is difficult to replicate using in-house services. Provisioning of in-house or data centre-hosted storage often takes place over a period of months, requiring financial, administrative and technical processes to be completed. Cloud storage can be provisioned in a matter of hours to cope with sudden increases in demand, and decommissioned if it is no longer required.  


There is thus a good business case for being able to mix and match services, and an opportunity for cloud services to complement internal institutional resources and traditional data centres, to the benefit of all researchers in the UK.

 

Key drivers

 

Your Mission/Vision—why this issue is important to your institution

 

King’s College London has grown from the amalgamation of previously autonomous institutions. As a result, the IT infrastructure is fragmented. Cloud provision and shared services provide a unified structure. Firstly, they enable existing resources spread across a number of geographic locations to be combined into a single resource with a “cloud-like” interface. This will enable more efficient use of storage and computing resources to be made, where previously systems might have been either overloaded or underutilised. Secondly they provide a mechanism to flexibly extend storage and computing resources to meet varying demands by migrating data to external cloud infrastructure.

 

STFC’s mission statement calls for providing resources to support UK-funded research. As researchers increasingly explore new avenues for making use of cloud-based computation, and STFC already provides data archives for many areas of research, the outcome of Kindura fits with STFC’s strategic aims.

 

Departmental and/or institutional strategies (i.e. policy issues that you considered to be relevant)


King’s College London has identified the strategic goal of enabling the preservation of research outputs, including both documents and data, through the provision of repository services. Research publications and research data represent some of the key outputs of a research institution. Existing systems and processes that were primarily designed to manage the preservation of paper-based outputs are no longer fit for purpose. Many journals are requiring the retention of research data to support publications in order that the results can subsequently be verified and as a resource for other researchers.

 

Funding bodies such as Engineering and Physical Sciences Research Council (EPSRC) are increasingly mandating that research data is retained for a period of 10 years or more as a condition of funding. The costs of retaining data outputs beyond the lifetime of research projects need to be met by the host institution, resulting in a critical requirement for reliable and cost effective repository storage.

 

STFC’s e-Science centre runs the datastore as well as large scale computation resources.

 

Financial considerations


King's College London is in the second phase of a £1bn redevelopment programme which is transforming its estate. The strategy runs over a period of ten years. The pilot provided an opportunity to demonstrate a model for future infrastructure that makes best use of internal and external infrastructure in order to provide repository services to researchers.

 

Within the HE sector, there is growing pressure to increase fees charged to students in order to meet the funding shortfall from central government. As a consequence, students are becoming demanding customers of higher education services. Provision of a modern IT infrastructure to support both research and teaching is a key expectation, and is likely to influence the recruitment of high quality applicants.


Public cloud storage providers, such as Amazon, provide resources which, for small volumes of data (a few gigabytes) provide a cost effective alternative to in-house storage. For larger volumes of data, or “hot” data (data which is accessed or transferred frequently), traditional data centres become more cost effective, at a cost of losing the elasticity provided by the clouds. Enabling researchers to mix and match in-house, commercial, and data centre providers according to their needs and financial constraints will provide new flexibility in research data management. Many technical, legal and governance issues are raised by the use of third party cloud services that need to be addressed to make this a viable solution.

 

Technical considerations


Continued expansion of IT infrastructure at King’s College London is difficult due to the nature of the buildings (many are listed), space, cost and power supply constraints of the central London location. A more cost effective solution is therefore likely to be the reliance on offsite data centres and pooled resources. King’s College London already has transferred some servers to University of London Computer Centre (ULCC), a resource that is shared with other institutions in London.


Confidentiality and security of data is a major concern, particularly when outsourcing both data storage and computing to third parties. This is particularly the case for data such as personal data, medical records, and information covered as part of Non-Disclosure Agreements (NDA).


Continuity and reliability of third party services is an issue, since there would be large costs associated with a sustained IT outage.  Existing Service Level Agreements (SLAs) provided by individual cloud resource suppliers may not guarantee the quality of service required by a Higher Education Institution (HEI).


When flexible services are provided, interoperation and standards become extremely important; standards promote interoperation, and interoperation enables flexibility and prevents “lock-in” to a single provider. Although standards exist (or are emerging), e.g. OCCI for computing and resources, and CDMI for storage, these are not yet universally supported. Kindura, therefore, chose the pragmatic approach of having a single layer which knows how to talk to different providers, and to interface it to the data storage layer via a widely implemented interface.

 

Other factors

 

There is already existing demand from researchers for the provision of cloud services to support their work. Indeed, many researchers are independently making use of cloud resources. Accounting for this usage is difficult since it is typically paid for by credit card. Making use of cloud resources on a case-by-case basis by researchers does not represent the most efficient use of funding, since costs of cloud resources reduce as the volume of data and compute resources increases. Independent use of cloud resources also make it difficult to provide governance for usage of such resources and may result in unacceptable data loss or the security of personal data such as medical records being compromised.


Nonetheless, the take-up of cloud-based services by individual researchers needs to be considered. Despite the hype, clouds are not easily accessible by some researchers who are perhaps/likely to be more interested in research than in configuring and deploying IaaS. In Kindura, two development activities improve usability: first, the brokered approach to storage and repository services provide a friendly front-end, which insulates the user from having to deal directly with storage infrastructure providers; secondly, to the extent we can automate the data management behind the scenes (e.g. collect files which have not been accessed for a while for archiving), usability is improved because researchers do not need to micro-manage their individual datasets.


Flexibility and scalability of storage and computing resources are an issue for research disciplines requiring intermittent use of high powered computing resources or the storage of very large datasets. These cannot be easily planned for or supported by existing IT infrastructure. Kindura planned to deliver the archival storage based on iRODS services. Since iRODS servers can be federated and can manage their own data policies, we planned to replicate data between service providers. This provides some level of dynamic service provision in that the data is available as long as at least one provider, who holds it, is present in the infrastructure. Within Kindura, we provide automatic replication between King's College London and STFC (and the STFC provider will have backups on tape). So for a pilot service, we are already providing a good assurance of data availability.


This replication is similar to those provided by commercial cloud storage providers, with the notable exception that we know where the data is located. A commercial provider may have to make use of services elsewhere, e.g. in other countries, and cannot usually guarantee that sensitive data does not leave the country or region. Having strict controls on the data placement provides another advantage of Kindura based services over many commercial providers.

 

Kindura didn't expect integration with the NGS, to enable analysis of data held in Kindura and migrated to NGS cloud resources for analysis, to be difficult but knew it would require some thought nonetheless. We planned to demonstrate this interaction through this project.

 

Establishing and maintaining senior management buy-in

 

At King's College London, Kindura worked closely with an initiative in ISS that was working on the specification and development of private cloud infrastructure.  The Kindura project provided a testing ground for investigating how private clouds, external clouds, grid and internal resources could be used to provide integrated storage services. In particular we considered user requirements, architecture, technology solutions, cost-benefit analysis and user evaluation. The outputs of the project were regularly fed back to the ISS leadership team.

 

STFC runs a multi-petabyte (PB) datastore, with currently 20-40 PB tape capacity and 10 PB disk. This datastore manages data for space science, the Diamond synchrotron, STFC’s instruments and facilities, other research councils, as well as the CERN Large Hadron Collider. Cost effective management and analysis of data and metadata is an essential part of STFC’s services. Kindura worked closely with target groups of researchers who have a requirement for flexible storage and computing facilities. At King’s College London, the researchers included groups in biophysics and financial mathematics. At STFC, we gathered requirements from environmental researchers working with the US-based Earth System Grid. Cloud technology and virtualisation technology are increasingly being used to provide resources for researchers, including by the NGS, although the NGS currently has no cloud storage activity. By gaining a clear understanding of the needs and concerns raised by researchers at all stages through the project and gaining buy-in of the relevant departments, we can build a strong case to senior management.

 

Technologies used

 

The Kindura project made use of DuraCloud software, an open source Java application being developed by DuraSpace. DuraCloud is being used to provide a common “cloud-like” interface to storage and computing facilities.


We made use of the iRODS storage system available at STFC and King’s College London that was developed during the JISC-funded ASPiS project to provide a pilot infrastructure. We also investigated the use of Eucalyptus for the creation of a private cloud infrastructure.


We built a policy management layer on DuraCloud that enabled us to perform a brokerage across the available storage providers.


The Fedora Commons repository was integrated with DuraCloud to provide services for the archival of research data.

 

Outcomes

 

Achievements

 

Pilot system

 

 

Specific outputs

 

 

Cost and efficiency

 

 

Benefits

 

Tangible

 

Cost and efficiency benefits

 

 

Management and governance

 

 

New and improved capabilities

 

 

New skills

 

 

Intangible

 

The main anticipated benefit was that this project would contribute to improved archival practices of researchers at the partner institutions by the provision of central repository infrastructure. The project also aimed to demystify the use of commodity cloud infrastructure by providing a practical and usable solution.

 

 

Drawbacks

 

The concept of providing a centralised repository and storage services has great potential benefits for improving efficiency, standardising processes and realising cost savings. Replicating and migrating data across different storage locations has the disadvantage of generating additional network traffic. Many migration operations can be performed overnight to reduce the impact on users. When moving larger datasets, it is necessary to consider how this can be scheduled to avoid causing network issues and to ensure that data is available at the required location at a specific time. As the size of datasets increase, improvements and upgrades to storage capacity should be considered in conjunction with network infrastructure upgrades. Further cost benefit analysis may be required to determine the tradeoffs between network infrastructure upgrades and storage flexibility.

 

The choice of a hybrid cloud solution is inherently more complex than a cloud-based model, where all storage is outsourced. The hybrid model results in additional costs in managing both internal and external resources and in determining which content may be moved to external cloud providers. Given the current state of the cloud infrastructure market, we believe that these costs are justified in the short to medium term. However, in the longer term, it may prove to be feasible to adopt a purely outsourced storage model. 

 

Key Lessons

 

 

Looking Ahead

 

Kindura was a pilot for hybrid cloud repository storage. We expect the project to be taken forward in a number of ways.

 

 

In its current form, Kindura provides a pilot repository platform for investigating and evaluating technologies. Further steps are required to move from a pilot to a production environment. These include:

 

 

Kindura was conceived as a pilot for increasing the capaciity and flexibility of in-house storage using a hybrid cloud approach. Deploying Kindura as a shared service presents a number of challenges. There are outstanding technical issues with running DuraCloud and Fedora repository in the cloud, which DuraSpace are currently addressing. Further, there would need to be a common approach to classification of content and storage brokerage across institutions, as well as a pooling of SLA agreements to make this feasible.

 

The Centre for e-Research at King’s College London plans to carry out further projects on cloud computing infrastructure for use in research as well as digital repositories. This includes both work on future pilots as well as moving Kindura technologies into production environments. The knowledge gained in Kindura therefore makes a contribution to our ongoing programmes. The use cases and the community of researchers who are actively involved with computationally and data intensive research that have been built up during the Kindura project are likely to be of ongoing value.

 

The Kindura project has generated a great deal of interest amongst STFC researchers and collaborators, mainly in the technology. There are opportunities for collaboration with King's College London on further development as well as with other data centres who are running iRODS.

 

Sustainability

 

The existing infrastructure will be available, as open source, which other institutions can download and evaluate. Kindura is built from a set of open source components that need to be downloaded and installed. Each institution needs to integrate DuraCloud to its own storage providers.  Plug-ins for the large public cloud providers (Amazon, Azure, Rackspace) are available out of the box, as is a plug-in for iRODS. iRODS is available as an open source download. Integration of other storage providers may require additional development. Each institution needs to configure its own business rules for storage brokerage, depending on the specific internal policies. Sample rules developed in the project are provided as a guide.

 

Summary and reflection

 

Kindura has demonstrated that a hybrid cloud solution is a viable solution for HE institutions wishing to increase the flexibility of their repository storage provision through integration of existing storage with commercial cloud and grid-based technologies such as iRODS.

 

Appendix

 

Project Website

 

http://kindura.cerch.kcl.ac.uk/