Web interface defines new paradigm for life science data sharing

(Nanowerk News) A new lightweight web service interface for accessing massive amounts of life science research data across multiple public and private domains has been developed by researchers at RIKEN, Japan's flagship research institute. Through the powerful RIKEN Scientists' Networking System (SciNetS), the service provides a secure, flexible and light weight interface to millions of data records and their network of semantic relationships, ushering in a new era of collaboration, analysis and information-sharing for life science research and applied innovation.
Gene annotation, protein structure analysis, plant ontologies, transcriptomes - dramatic increases in the size, variety and complexity of data resources in the life sciences have accentuated the challenges of data analysis in the information age. Adding to these challenges, much of the data handled at each step of the research process is private, making integration with public data more difficult and hindering collaboration. Overcoming these challenges requires systems for securely integrating data resources and making their information widely available through a flexible interface.
The RIKEN Bioinformatics And Systems Engineering (BASE) division, Japan's leading research institute focusing on the integration and publication of life science research data, has now developed such an interface. Referred to as Semantic-JSON, the interface accesses a "virtual laboratory cloud centre" also developed at BASE named the Scientists' Networking System (SciNetS), which brings together, as of May 2011, a total of 192 public database projects both internal and external to RIKEN. SciNetS creates common ground for sharing life science data resources by linking these resources together in a network of semantic relationships based on standardized Semantic Web techniques.
Semantic-JSON provides a flexible interface to SciNetS on the web, enabling bioinformaticians to access specific data from across the SciNetS network using the programming languages and information tools they normally use in their research. The interface does so by defining a set of simple but relevant commands for accessing and searching SciNetS data and their semantic relationships, delivering results in the widely-used JavaScript Object Notation (JSON) format.
Already, RIKEN has successfully applied Semantic-JSON to a number of projects, including international data collaborations on mouse phenotypes, domestic integrated database projects, and the GenoCon International Rational Genome Design Contest. Looking ahead, RIKEN plans to use the interface to distribute life science data across its research centres and with international collaborators via the SciNetS project, broadening the life-sciences Semantic Web data universe and promising to achieve not just comprehensive understanding of various life phenomena, but also collaborative breakthroughs for medicine, industry and the environment.
This research result will appear in the online version of the British scientific journal Nucleic Acids Research on June 1 (2011, Vol. 37, No. 12 1-7.).
1. Overview
Life science research depends crucially on the availability of informatics infrastructure for systematically storing and integrating vast amounts of diverse bioinformatics data. Indeed, a deep understanding of data collected using today's cutting-edge bioinformatics technologies is impossible without this infrastructure, yet conventional databases are limited in the types of data they can handle. For more sophisticated processing and analysis, infrastructure is needed that can simultaneously sort and organize the vast variety of different types of life science data and make this data available for public use.
Virtual laboratory cloud centre: SciNetS
Virtual laboratory cloud centre: SciNetS. The SciNetS cloud service provides virtual laboratories that undertake advanced research activities by collaboration among scientists on the Web, achieving systematic sharing of life-science data resources obtained utilising the latest bioinformatics technologies.
At the RIKEN Bioinformatics And System Engineering (BASE) division, researchers have developed a novel research infrastructure around a set of virtual laboratories (collaboration via the cloud1) that allows researchers to store massive amounts of life-sciences data and schematically and semantically organise relationships between individual records in a virtually-constructed, closed, secure data space. This collaboration centre, the Scientists' Networking System (SciNetS), does more than just publish data from RIKEN to the web. As an infrastructure for life science data sharing, it also encourages new forms of research collaboration, enabling scientific discoveries not possible through individual research activities alone.
Fully exploiting this collaborative potential, however, requires that SciNetS data be made available on the web through an easy-to-use interface, to be accessed and analysed via commonly-used programming languages. Semantic-JSON is the technological innovation which makes this possible.
Integrated databases on RIKEN SciNetS
Integrated databases on RIKEN SciNetS. Pink circles represent individual "virtual laboratory" projects. Yellow squares and green circles denote respectively organisational reality of centres at RIKEN and organisations outside RIKEN. Blue lines show the number of links between data in proportion to thickness. Red lines show relationships between organisations that produced the data, and green lines show comprehensive collaboration within RIKEN.
2. Semantic-JSON
To encourage its worldwide distribution and use, data organized in SciNetS is formatted according to the Semantic Web2 standard, a data format which is understandable not only to humans, but also to computers. The new Semantic-JSON programming interface (http://semantic-json.org), developed at BASE and made available for public use as of June 1, enables bioinformaticians3 to access this Semantic Web data on the web via the programming languages and information tools they normally use in their research. Data obtained through the interface is described in the widely-used, highly-portable JavaScript Object Notation4 (JSON) format, freeing researchers from depending on any specific programming languages for their data analysis.
The Semantic-JSON concept
The Semantic-JSON concept. Semantic-JSON extends the concept of short URL services to Semantic Web. It also provides functions of data access control, data search and inference and access to biomedical raw data such as DNA sequences.
Semantic-JSON also achieves a second major advance in life science research by bridging the gap between public data available for general use, and private data held by individual researchers or research groups. Researchers often need to unite public and private data for analysis; yet doing so is far from trivial due to differences in access permissions across virtual laboratories. Freely releasing such data, on the other hand, poses significant security issues. What is thus needed is a technology to enable virtual laboratories to manage their own data access permissions in a secure way, while also accessing relationship information and merging (public and private) original data from different virtual labs.
To accomplish this union of data, Semantic-JSON employs a trick similar to the URL shortening tools used on common social media services such as Twitter. The Semantic-JSON interface shrinks URLs for data internal and external to SciNetS into shorter identifiers, and uses these to lookup permissions for specific data, returning only the data appropriate to the access privileges of a given user. Unlike conventional URL shortening services, however, a short identifier in Semantic-JSON points to not only a URL but to a wealth of relationship between data, thus realising a unified domain semantic web structure.

By incorporating such security considerations, Semantic-JSON achieves a form of data access not implemented in conventional Semantic Web data tools. Researchers can thus access both public and private original data on SciNetS under a data access control, and use Semantic-JSON to traverse individual virtual labs, obtaining relationships not only for public data but for private data as well. Simply by selecting a single data item, a user can access related public and (depending on their privileges) private data from different data constellations, enabling deeper integration of widely-dispersed data resources.

RIKEN BASE has already applied Semantic-JSON to the implementation of a tool that allows users to create programs on their web browsers by accessing SciNetS data. This tool was successfully employed in 2010 by contestants in GenoCon, the first International Rational-Genome-Design Contest, for designing Arabidopsis plant genome sequences using data managed on SciNetS.

3. Future applications

Since its foundation in 2008, research at RIKEN BASE has focused on the development, through SciNetS, of an infrastructure for enabling collaboration between researchers (virtual laboratory centre). Internationally, BASE has played a key role in the release of data in Japan for an international collaboration on Arabidopsis and mouse phenotypes5. In Japan, BASE is one of the core institutions supporting activities of the Japan Science and Technology Agency (JST) Bio-sciences Database Centre. In each of these roles, the interface for data interchange is of key importance. By enabling this interchange for data published from virtual laboratories on SciNetS, Semantic-JSON achieves a major milestone, opening the door to data sharing via a variety of different devices such as mobile phones and PCs.

Through the use of SciNetS and Semantic-JSON, RIKEN aims to broaden the application of research results to society, developing the life-sciences information infrastructure necessary to accelerate data schematisation research both in Japan and across the world.
Notes
1) Cloud (cloud computing)
A technology that quickly and effectively satisfies vast information processing needs by utilizing an organized huge number of computation servers and data as an immense virtual computer. In SciNetS, the cloud technology is used for distributed data storage and periodic automatic data processing.
2) The Semantic Web
As a web technology the Semantic Web is an advanced World Wide Web (WWW) proposed by Tim Berners-Lee, a computer scientist in England. The WWW is a technique to connect resources such as documents located on the network by hyper-links, and has been explosively successful as the globally standardised information infrastructure of the Internet. However, though a hyper-link is suitable for humans to move through documents by traversing the links, the hyper-link does not describe the semantics of a relationship since it simply connects two resources.
Ensuing problems are recognized in the WWW. The semantics of a hyper-link must be interpreted by human reading of the contents written in the documents, and a hyper-link does not include information that can be recognised as semantics by computers for advanced knowledge processing; these are pointed out as a problem of WWW.
Reflecting this, it is required that semantics should be given on the Web. The goal of the Semantic Web is management and usage of information processed by computers on a wider scale than by human reading, through providing documents in a machine readable format and assigning value describing semantics for links to those documents.
3) Bioinformatics
Bioinformatics is a research field to solve biological problems by applying techniques of applied mathematics, informatics, statistics and computer sciences.
In recent years, since huge biological data are produced by genome projects and structural genome projects performed for various species, developing novel useful bioinformatics techniques is desired to realise protein lineage analysis, structure prediction and interaction predication using such data.
4) JSON (JavaScript Object Notation)
JSON (JavaScript Object Notation) is a data description language defined using the JavaScript data object description format. Since data specification is simple and an operation across the data is light-weight, various programming languages support JSON besides JavaScript.
5) Phenotype
A phenotype is a characteristic appearing as an individual difference as the result of differences arising from individual gene variation and genome divergence. Diversity of individuals caused by gene variation is reached through difference of gene expression, protein, and metabolites. For mice, in the case of a model diseased animal, phenotypes appear as differences of pathological symptoms and behaviour characteristics. Since quantification of phenotypes is more difficult than other characteristics, data comparison and database integrations among laboratories are difficult and require dedicated research activity.
Source: RIKEN