Jisc case studies wiki Case studies / University of Glasgow - Business Intelligence
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Files spread between Dropbox, Google Drive, Gmail, Slack, and more? Dokkio, a new product from the PBworks team, integrates and organizes them for you. Try it for free today.

View
 

University of Glasgow - Business Intelligence

Engage - Using Data about Research Clusters to Enhance Collaboration

 

Case study written October 2012.

 

Contents

 


Background 

 

Aims and Objectives

 

Engage is a project that has been borne out of both top-down and bottom-up demand for classifying research efforts and expertise in a more sophisticated way than their simple alignment with disciplines and organisational units. Its principle aims are twofold. Firstly, to develop a prototype system for researchers, administrators and the wider world of HE, FE and business to define, apply, compare and visualise themes and associated relationships and secondly, to present a compelling business and technical case for the integration of its demonstrated functionality into core systems at the University of Glasgow.

 

Business Case and Context

 

Our University strategic plan requires us to identify and deliver research that meets funder priorities, fosters inter-disciplinarity, and makes our achievements more widely available to a range of stakeholders including the public, and potential collaborators in business and academia.  It has a focus on minimising duplication of effort and maximising sharing of information and resources. Engage seeks to meet these requirements by bringing disciplines and researchers together via the use of non-discipline specific classifications, and providing comprehensible categories of research to facilitate publicity and broader collaboration. 

 

Key Drivers

 

  • Facilitating flexible collaboration across and independent of disciplines;
  • Facilitating discovery of research activity and outputs; 
  • Establishing agreed terminological bases for research themes;
  • Addressing inefficiencies in gathering information about research themes;
  • Complementing traditional organisational classifications;
  • Evaluating researcher attitudes to research theme knowledge engineering and exploitation.

 

Project approach

 

  • Analysis of research activities
    • Establish core focus group
    • Identify themes that should be defined 
    • Explore mechanics of their construction
    • Consider how to manage granularity/customisability 
    • Explore potential applications
  • Develop taxonomy(ies) for research theme clustering
  • Consider how one might embed taxonomy(ies) in core university systems
  • Conceive visualisers/applications for data exploitation
  • Evaluate in association with focus group

 

Scoping and Requirements Analysis 

 

In high level terms the project operated with the following self-imposed scoping constraints:

 

  • Develop a prototype  for information aggregation and visualisation;
  • Explore the use of extracted keywords and methodological classifications in describing research 
  • Focus in the first instance on researchers in Arts and Social Sciences communities but with a view to establishing a proposed model that can be rolled out across every discipline

 

When first convened our focus group (comprising researchers and research administrators) were encouraged to consider the following questions in order to define the scope of our prototype application:

 

  • How would you like your research to be thematically described? (Flexibly/Modularly?; Aligned to pre-defined taxonomy?)
  • What value do themes offer for starting, undertaking, funding or supporting research?
  • What software functionality would you like to see?
  • How could the information be best presented?

 

The issue of themes and their association with corresponding descriptors was our main priority. To reflect the dynamic qualities of the research landscape themes would be fluid groupings of relevant keywords or metadata, with additional structure provided by more hierarchical disciplinary classifications. AHRC, GU College of Arts or ‘Emerging’ themes would be defined by grouping relevant keywords and aligning with relevant academic disciplines. Prescribed themes could therefore be mapped to investigators via shared keywords or discipline, illustrating strategic alignment and facilitating research administration. Furthermore, researchers would be able to identify benefits in defining their own ‘personal’ themes as the basis of their own collaboration activities. Themes can be thought of as similar to custom or saved searches from systems like Research Professional. 

 

Our scope extended beyond alignments between individuals and themes, to support classification of a range of entities. For example, research projects, activities, publications or events could be tagged in the same way and their relevance to other entities within the system expressed through visualisation. Researchers could define their interests and keywords and receive notification of forthcoming events covering similar topics. 

 

Research has an important temporal dimension too, and our requirements analysis assumed that it was not only current research activity that may be of interest in the pursuit of collaboration opportunities. Both past and speculative research may provide valuable indications of similarities and compatibility. Impact requirements in particular demand an institutional awareness of research that may have been undertaken several years ago. Therefore it was important to be able to express changing interests over time and to manage themes to ensure their currency. To further support this and ensure they are represented correctly we have a requirement for users to be able to manipulate weightings of their own descriptors. In terms of how such descriptors are initially assigned, workshop participants were asked whether they would prefer full control, or to choose from predefined keywords. Most favored complete discretion, but we were conscious of the potential pitfalls - most notably we may end up with few instances of common keywords, with users selecting terms extremely specialist to their own domain. As described below we decided to automate the process by mining publications and other textual content associated with each researcher and this satisfied our requirement of expressiveness and ensured that the collected keywords could be more usefully related.

 

While we would define according to discipline to inform themes this faced constraints, given that it tends to enforce traditional disciplinary silos. A key goal of Engage has been to support the evolution and identification of new inter- and trans-disciplinary research opportunities. Our intention was to ensure that arbitrary keywords could illustrate relationships that transcended traditional groupings. Taking an example keyword as simple as "rivers" we might anticipate its appearance in research classifications from domains as diverse as geology, bacteriology, town planning, conflict history and naval engineering. Specialisations of that keyword (e.g., "River Clyde") may reveal common activity with opportunities for collaboration that may not have occurred to participating researchers. The addition of topical keywords enhances existing classification and points out links that might not be explicit. Some of the other types of relevant descriptors we had in mind when scoping the project included:

 

  • Temporal characteristics- when was the work done, or what period was being studied?
  • Geographical characteristics - where was the work done, or what place was being studied?
  • Methodological characteristics - what research methods were used?
  • Sources - what datasets were used or publications cited? 
  • Other topical descriptors (e.g. discipline/focus/subject/name)

 

The specific mechanism used to define and manage descriptors was given only limited coverage during initial scoping discussions. An ontology or taxonomy approach was considered to be viable. Important functionality requirements included support for relational definitions and fuzzy matching.

 

In terms of applications, within scope for the prototype was functionality for mapping research activities, matchmaking researchers with shared descriptors, a PhD supervision recommender tool and a suite of visualisations expressing relationships. Ideas that would not be implemented but were borne in mind throughout the project included a mailing interface whereby communications could be filtered according to research descriptors and tools to map researcher skills and interests to explicit funder calls. 

 

A critical requirement was that the system should be user friendly so that researchers could manage their profile, to ensure that the core dataset could be populated by more than just existing centrally administered data (e.g.projects, applications, and awards in research support system).  However we also considered that where appropriate, information should be collected in core systems for re-use and therefore any functionality to allow researchers to manage their profile should be built with this in mind. In terms of incentivising researchers, it was pointed out that users would yield better matches with greater levels of self-description, and this was broadly accepted by our stakeholder group.

 

Governance 

 

The project team met on a fortnightly basis with work reported via the project wiki and blog and telephone conversations to steering members at University of Glasgow and also partner organisations Glasgow Life and the University of Aberystwyth.  The project was managed using PRINCE2 principles and the project sponsor Professor Steve Beaumont was updated as appropriate.

 

Technologies & Standards Used

 

  • PHP 5
  • MySQL 5
  • jQuery 1.7 & jQuery UI 1.8
  • XHTML as far as possible, with CSS 2 for styling
  • Google Chart API for Venn diagrams
  • Google Maps API for geo-spatial visualisations

 

Data has been extracted from ePrints-based institutional repository (primarily by interogating the underlying MySQL database, but also using RDF export), we did some testing on the Oracle-driven HR system and extracted data from the research support system's BIQuery interface.

 

General Technical Comments

 

  • Text extraction (from PDF) was less trivial than expected (Decoding streams, dealing with odd characters, etc.) 
  • Search and comparison algorithms have been improved by incorporation of stemming and fuzzy search

 

Stemming

 

  • We used a version of the Porter stemming algorithm to suggest keywords from publications and project descriptions. This was more useful when used to conflate search results. This means that more relevant results are presented to the user even where text strings do not match exactly.

 

Fuzzy Search

 

  • We experimented with an implementation of Jaro–Winkler distance and also tried PHP’s built-in similar_text function, and chose Levenshtein distance after extensive testing with data revealed that it provided the most complete and relevant results. Following testing we configured the native PHP function with additional parameters for acceptable distances. Fuzzy search proved quite useful as another conflation tool behind the scenes to dynamically build profiles.

 

Establishing and maintaining senior management buy-in 

 

The project was a direct response to strategic requirements for better mechanisms for research classification and clustering.  The project was supported by the Vice-Principal for Research and Enterprise. A formal proposal for the absorption of research themes into the existing suite of connected University systems will follow.

 

Outcomes

 

At the start of the project we rated the current maturity level of the business information held in relation to research themes as level 2 on the JISC InfoNet BI Maturity conceptual framework. Although data about projects and outputs were stored in centrally managed systems, the actual research clusters were not. Analysis of research clusters was done by manually compiling a list of relevant staff and extracting data. The prototype tool we have developed raises our standard closer to a stage 4 rating. Soft outcomes have included greater understanding between various support units which are necessary precursors to more concrete system integration. Our tool is demonstrably capable of informing strategic decision making and research development, and offers us a language to more effectively communicate our research strengths and synergies (something currently being exploited by the JISC Encapsulate project). More details about perceived BI maturity are available in the Impact Assessment appendix to this case study.

 

  • We aimed to drive process improvement by reducing the burden of identifying common research interests in a more intelligence and automated way
  • We used some innovative technology and are pursing these in partnership with suppliers (e.g. ePrints) and data suppliers (e.g. other HEI’s involved in the Encapsulate project, and internal systems custodians)
  • We have considered change management and have suggested that re-visiting the business case and explaining the value add is essential to Senior Management if they wish to make a robust case to implement our ideas live in core systems in terms of making the financial case and getting buy-in from users.   
  • The project was based on re-using existing internal data in a more sophisticated way to help address requirements stated in our institutional strategy.  There was on-going engagement with data suppliers from a range of core systems.
  • There were data quality and definition aspects to the project.  For example we considered how data could be maintained with relevant topics attached to entities and where data management responsibilities might be placed.
  • Within the specific area of research themes we could be considered a maturity exemplar.  We are not aware of many other organisations that are considering how to incorporate detailed research themes with staff, project, and output data and utilise this for improved discovery of research related information.

 

 

 

A research profile in Engage, which draws on data from the research support system (projects) and the institutional repository, Enlighten (publications). Keyword descriptors assigned to the researcher (by the researcher themselves, or by the Engage system) may be used to characterise the researcher’s work and, using an algorithm developed by the Engage team, identify fellow researchers with similar interests. Visualisations such as the Venn diagram shown here help to express the degree to which researchers’ interests might match. Using keywords and other descriptors (which may be automatically generated from data obtained from core systems such as Enlighten or the research support system) we will be able to identify emerging themes, and create new research themes based on patterns observed in the Engage data and on themes identified by other parties such as funders and governmental bodies.

 

 

One of the most commonly-requested visualisations is the tag cloud. This example shows the keywords used to describe researchers’ work, with Scottish Studies being the most common keyword. As one would expect, each keyword in the tag cloud is clickable, allowing the user to ‘drill down’ and find out which researchers (and, potentially, projects or funding bodies) have been tagged with a particular keyword. The test data being used for Engage comes from the College of Arts.

 

 

The research map is envisaged as an easy-to-navigate hierarchical structure that draws on live (or almost live, from an overnight copy) data from sources such as the research support system and institutional repository to show what research is being carried out where in the University (the example here again draws on our Arts-based sample data). The data shown on the map is linked to the other visualisations and tools within the Engage system, with the aim of encouraging collaborations both within and without the University.

 

Currently, the Engage prototype can suggest keywords based on researchers' publications. A similar algorithm is applied to research projects held in the research support system.

 

 

Data for the Engage pilot were not generally pulled directly from the core systems. Instead, SQL dumps were imported into a local MySQL database that mirrors the structure of each core database. 

 

The Engage project has also undertaken some exploratory work with other potential stakeholders, such at the College of Arts Internationalisation office. As part of the College's recent international focus, there was a desire to be able to show our overseas research activities on a map. The result has supplied Engage with another interesting visualisation, and provided a trial run at using the Research Councils' hierarchical classification scheme to characterise our research. Modest internal testing of the map system suggests that using this research classification system makes sense to researchers, most of whom are cognisant of the value in aligning our work with the language used by those who fund us.

 

Perhaps our greatest achievement to date is the establishment of an internal discussion involving a range of University constituencies that share a common interest in the pursuit of research excellence, and maximising opportunities for this to be achieved. Engage’s conversation with units including Corporate Communications, Research and Enterprise, Human Resources, the custodians of the institutional repository and a number of individual research communities has illustrated a widespread demand for information about research themes, outputs, contributions and expertise. Their respective priorities and expectations are being realised in a web-based prototype application, which is the physical manifestation of this discussion. Our final proposal will recommend enhanced interconnectivity between respective systems, making conceptual joins increasingly tangible. 

 

In terms of benefits, we think the Engage project has the capacity to be tremendously useful. Participating researchers will be able to classify their expertise/profile (useful for self-promotion/personal home page applications), find and evaluate potential collaborators based on keywords, discipline, methodologies, common data sources or other facets. Research administrators will be able to meet strategic/funder research priorities and identify thematic trends in research that may be indicative of widespread impact or of silos of related activity that require only a nudge to effect interdisciplinarity. Data and publications managers are expected to be able to better classify and facilitate sharing of outputs using the implicit classifications. These benefits are expected to expand beyond the prototype stage as more data is extracted from core systems and made available via the institutional website to stakeholders. 

 

 

Research themes can be defined according to a range of descriptors, and we can easily add additional facets to enrich them. Themes are defined in the same way as researchers or outputs and therefore alignment and comparison is straightforward. For our prototype, we looked at a number of means of controlling the categorisation of research activity before finally settling on a classification scheme used by the UK Research Councils. The idea was that we should attempt to impose some formal, broadly recognisable structure on our research themes, to facilitate discussion and to provide a cross-disciplinary vocabulary that would allow us to make meaningful comparisons. As discussed previously, this formal structure will be supplemented by allowing for much less rigidly-controlled keywords, drawn from researcher publications and assigned by researchers themselves.

 

The rationale for choosing the Research Councils’ classification scheme related to that fact that it was:

 

  • Cross-disciplinary, meaning the Arts were as well represented as the Sciences, etc. 
  • Hierarchical, potentially allowing us to make more intuitive associations between related research 
  • Tied to a large percentage of funding, meaning that it was difficult to argue with its relevance

 

In addition, it should be noted that if we had discovered an alternative classification scheme already in widespread use across projects or institutions, this would likely have become our front-runner. However, our search for a de facto standard classification – which also satisfied the above requirements – did not uncover any obvious contenders. Meanwhile, the RCUK scheme has already proven popular with our sample stakeholders, not least because it can be mapped directly to our funders’ own classification (it is already employed on the Je-S system used to apply for funding from the AHRC, for example).

 

 

Relating entities within the system (clustering) enables groups of themes, researchers, outputs or projects to share a significance and makes further relationships easy to trace.

 

A "wizard" enables the definition of new themes based on existing entities within the system. Therefore, if a researcher appears representative of a new or high priority research area it can be cloned and reclassified as a generic theme, that can in turn be related to other entities.

 

 

 

 

 

Available visualisations include tag clouds, venn diagrams (illustrating unions / intersection / differences between researchers), connections, and motion charts which display along a temporal axis researchers' changing research keywords and by extension, interests.

 

For further information about the visualisations visit: http://researchclusters.wordpress.com/2012/07/31/feedback-on-the-engage-prototype/

 

Benefits (tangible)

 

  • Prototype system illustrates viability of research classification / comparison
  • Visualisation tools demonstrably assist in the identification and interpretation of research clusters 
  • Prototype applications provide evidence of value of fundamental use cases

 

Benefits (intangible)

 

  • Improved integration between core systems via the research description system
  • Improved culture of research awareness 

 

Key lessons learned

 

We have established a greater understanding of the relationships and potential relationships between roles, systems and data that contribute to our institutional perspective of what research is. How we classify our research activities has a number of implications, including the following:

 

  • evolution of new research opportunities (how new research partnerships form);
  • dissemination of research outcomes, in terms of how we present a cohesive and compelling account of research and
  • identification of areas of research strength.

 

We have quickly realised that there are many ways in which different groups would like to describe research activities for a range of purposes. Some of the descriptors we've included in our pilot implementation are RCUK theme classifiers (broadly corresponding to traditional domain categories), extracted key words and key phrases (from publications and other research-related literature) and associated research methodologies. To these we have proposed the addition of others such as expressions and measures of impact (the work done by the MICE project within JISC's RIM programme has provided a useful taxonomy for impact which we've been able to leverage) and common sources (including shared data, publication citations and potentially instrumentation or equipment) to a production system. Our intention has been to seek to transcend traditional means of information categorisation and discussions with a wide array of contributors has revealed a range of use cases, from helping to populate researcher web pages/profiles to facilitating interdisciplinary access to research outputs.  

 

We have also learned lessons that have propelled forward other existing or forthcoming project activities. Given the rich range of use cases that has emerged we were relatively unsurprised that when looking at other ongoing projects within the University we were able to identify some very complementary overlaps and some really fascinating opportunities for further development. For instance, the JISC MRD programme's C4D project is exploring mechanisms for classifying data for widespread availability and multi-disciplinary reuse, and we have been able to have some very useful conversations on this issue. The Encapsulate project (funded under JISC's Business Community Engagement programme) will develop a proof-of-concept demonstrator to explore the feasibility of using digital repositories of research outputs as a means of automatically identifying the relevant academic expertise within HEIs in response to online queries from businesses. This will benefit tremendously from the time already invested in Engage (which is partly focused on directing external parties towards our expertise and resources) and provides a vehicle to explore some of its core ideas, realising a continuum of activity we'd hoped to contribute to in our very first project discussions.

 

More practical lessons learned are widespread. Our most pressing issues are generally associated with information classification. We've learned that sometimes no existing systems or data resource hold information that our users expect to be able to use (such as information about adopted research methodologies, which user have suggested may be a useful entry-point into inter-disciplinary collaboration, even where more topical relationships appear limited). Our first port of call has typically been existing systems and data, but where these are limited we have had to look elsewhere. We spent some time considering whether to allow expressive descriptions (which users appear to favour, but limit opportunities for clustering), or selecting a controlled vocabulary (functionally more straightforward but from a user perspective less satisfying) and realised that these are not necessarily mutually exclusive.  Extracting text from publications has been a successful choice - using stemming and fuzzy search in information retrieval appeared to overcome perceived limitations of expressiveness. 

 

 

A notable process-related lesson has been the importance of expectation management. Our diverse constituencies have common expectations we’re integrating for our prototype, but often these will be to a greater or less extent intertwined with more bespoke requirements. Continuing to manage expectations of what users will see emerge from the project is therefore critical – we are conscious that stakeholders' time is invested with the expectation of a useful outcome and we aim to ensure that this is the case for all who have contributed their perspectives and assistance.

 

Looking ahead

 

Engage is part of a portfolio of projects and activities at the University that collectively strive to enhance the research process from the perspectives of researcher, support services and third party collaborators and information users. These include the IRIOS2, CERIF for Data (C4D), CERIF in Action, and Encapsulate projects, funded by JISC under its Research Information Management, Managing Research Data, and Business Community & Engagement Programmes. The latter in particular is representative of continuity from Engage, since Encapsulate will reuse classifications devised in our project to help businesses to find relevant academic expertise when seeking HE support. In terms more closely related to Engage, our project will submit proposals for the integration of Engage into existing University infrastructure, extending the prototypes availability and coverage to a much wider community of academic, support and external user constituencies.

 

The sustainability is very much based on our ability to present a compelling case for the technology's wider integration into existing systems. Our intention throughout has been to make the tools we've developed as lightweight as possible - we are seeking to integrate and reuse where possible. This work is extremely closely aligned with University strategy. Furthermore, there is an existing and growing range of highly complementary project activities ongoing. With these things in mind it is clear that there is an appetite from senior management to develop research theme management at Glasgow University. We will be providing a summary of the project and our recommendations to senior management. This would include proposals for the following:

 

  • Store the actual themes in one of the core systems rather than in the prototype
  • In addition to existing well established responsibilities, for other data in core systems make local staff responsible for the data they supply about research themes
  • Integrate themes into web extracts
  • Wide rollout of the functionality
  • Integrate reports into formal planning rounds
  • Continue discussions with the sector on the requirements for, and integration of, research themes into the research information management processes 

 

Summary and reflection 

 

Engage has provided the University of Glasgow with a tremendous opportunity to enhance systems that maintain information related to researchers, research and research outcomes, and to build a picture of how these can better inform the research process itself.  

 

We are now considering research themes in the wider business intelligence context.  For example, the university is exploring options for workload modelling and one option to facilitate better use of data for workload modelling is new data warehousing applications.  Research themes could be better applied and re-used in such an environment.

 

At a closing workshop attendees were encouraged to consider the project and its prototypical outputs in terms of their strengths, opportunities and shortcomings. The session brought together a group of about twenty five staff from research organisations.  This included database and system support staff, library staff, research administrators, business development staff, and academics. Overall, workshop attendees’ response to the software was very positive, with particular interest in our work on automatically extracting keywords from publications and other sources, and in the prototype visualisations.

 

Various applications for the visualisations were posited, with their usefulness in terms of reporting and marketing, and their ability to convey “research stories” striking particular chords with our attendees. This latter suggestion, that the visualisations could be used to tell the story of a researcher’s or a group of researchers’ work, was particularly intriguing, and has implications for how we might make our research outputs available to the general public. Related to this suggestion was the idea that we might create widgets based on some of these visualisations, allowing researchers to place up-to-date graphical representations of their work on their personal or departmental website, for example. 

 

Of the visualisations on show, the network diagram showing how researchers connect to one another was well-received, as were those which added a temporal dimension to the data, showing research interests over time. It was suggested that we could combine these approaches and come up with a way of visualising academics’ relatedness over time. We liked this suggestion a lot.

 

Some of the work we did in and around the keywords was also met with approval. In particular, the usefulness of the fuzzy search and stemming techniques to conflate search results and improve automatic suggestions was noted. This was encouraging, as it’s not always clear that this sort of “under the hood” functionality will be widely appreciated. The temptation, from a developer’s point of view, might have been to gloss over this less palpable functionality and concentrate on creating ever more spectacular visualisations or adding other more immediately obvious features to the system. 

 

In our demonstration, we alluded to an aspiration to explore more sophisticated weighting of keywords, and to allow researchers to apply their own weighting to the publications, research projects and other entities used to generate their research profiles. While researchers can currently delete unwanted or inaccurate keywords assigned to them by the Engage system, there was certainly interest in a more flexible approach to the weighting of keywords and themes. Also on the subject of keywords, an excellent suggestion was made with regards to how we handle those keywords a researcher has opted to remove from their profile. It seems obvious now, but if a researcher deletes a keyword that has been automatically derived from their publications and so forth, the system should avoid suggesting that keyword again in the future (for example, after the researcher has published a new paper that contains the unwanted keyword). Taking this further, there would be an opportunity for the system to ‘learn’ from these user interactions and subsequently improve on the themes and keywords it suggests. By taking into account contextual factors, it is possible that learning from user feedback in this way might offer a means of tackling the problem of homonyms.

 

This sort of functionality would require some careful thought and some considered development effort but opens up some exciting possibilities for where the Engage software could go, if it was to be taken forward as a live manifestation.  It should not be allowable to delete certain data and a clear control mechanism would be required.

 

Finally, attendees pointed us towards a couple of interesting software resources. The GATE (General Architecture for Text Engineering) project provides a suite of text processing tools which would have probable value in developing a production version of our tool. The BatchGeo service was also mentioned as a possible means of geo-coding large quantities of place data for visualisation on a map. Currently, we use the Google Maps API for geo-coding but it would be worth looking into using BatchGeo to efficiently generate latitude and longitude values for existing data derived from our core systems.

 

The JISC BI maturity model has been extremely useful in informing our work, as it has prompted us to explore the relationships between disparate systems and conceive approaches that allow us to join things up more effectively. The project has revealed a lot of existing strengths in the University, and highlighted various ways we can make it work even better in future. Our first step was to evaluate our initial position on the model (in terms of research themes), and we quickly agreed that we were at stage 2. Much of what followed in the project development was essentially in response to this, and scoped in terms of how do we advance, and what are the potential barriers to accomplishing that? In that sense, the tool was very useful, focusing our minds and our activities to ensure more effective and more widespread data inter-connectivity and with a critical requirement to facilitate and promote adoption across our stakeholder communities. Longer term, as we seek to complete its journey from prototype to established part of University infrastructure we are hopeful that the tool will play a fundamental part in evidence based decision making, projections and planning, as described in Stage 6 of the BI model.

 

Video case study for this project