surechem Archives - Digital Science

Digital Science donates SureChem data of >15million chemical compounds and patents to EMBL-EBI

Katy Alexander — Wed, 11 Dec 2013 09:39:00 +0000

Big news from Digital Science; we are donating the SureChem collection of >15million chemical structures from world patents into the public domain through the European Bioinformatics Institute (EMBL-EBI).

This is the very first time a world patent chemistry collection has been made publicly available, marking a significant advance in Open Data for use in drug discovery. This transfer will give researchers around the globe access to a vast new source of medicinally relevant compounds related to the curing of human disease.

SureChem, which was Digital Science’s very first portfolio company, extracts chemical structure data from the full text and images of patents. This makes it easier to check whether a newly developed drug or other product is actually novel. Previously held within commercial systems and inaccessible to most researchers, this important life science data source is now freely available from EMBL-EBI as SureChEMBL. You can explore these data at www.surechembl.org.

Digital Science’s Nicko Goncharoff, Head of Knowledge Discovery says:

“Our mission is to give researchers better tools and services and from the start Digital Science has preferred solutions that support Open Science and Open Data communities whenever possible. By placing this collection into the trusted hands of EMBL-EBI, we’re opening up an entire new class of life science data to the public that has previously been locked behind paywalls, and inaccessible for data mining. We couldn’t think of a better home for SureChem, anywhere.”

John Overington, Head of Chemical Biology at EMBL-EBI says:

“Patents are the foundation of high-tech enterprise and innovation and form the basis of the knowledge economy. We hope that making chemical patents more discoverable in the public domain will considerably speed up the identification of promising molecules. This new source of data will be a major boost to translational research and the discovery of novel bioactive molecules. By putting all this data together in a structured way with other EBI resources, we can help increase competitive innovation.”

Academic researchers particularly stand to benefit from SureChEMBL, notes chemistry luminary Christopher Lipinski, Scientific Advisor, Melior Discovery:

“Having the SureChem patented chemical structures freely available to researchers would by itself be an excellent idea. Having the interface through EMBL-EBI is an even better idea, since the new SureChem interface takes advantage of EMBL-EBI’s nearly 20 years’ expertise in technical and professional aspects of interfacing data sets, internal analysis and customer service to the broad genomic, chemo-bioinformatic, chemical biology and drug-discovery communities.”

SureChEMBL joins a wide array of connected life-science informatics resources at EMBL-EBI, which offers a comprehensive source of freely available molecular data. Today’s transfer opens the door to integrating disease and drug-target data in more meaningful ways, enhancing links between chemical structures and other biological data and their discoverability through the scientific literature.

The post Digital Science donates SureChem data of >15million chemical compounds and patents to EMBL-EBI appeared first on Digital Science.

Digital Science and SciBite form strategic partnership for life sciences text mining

Nicko Goncharoff — Tue, 26 Mar 2013 13:51:00 +0000

Some exciting news for Digital Science on the text mining front. We recently signed a strategic partnership agreement with SciBite, a UK startup that has put together an impressive scientific news and alerts service focused on drug discovery. SciBite is highly complementary to SureChem, our patent chemistry search offering, and we will be looking at ways of quickly combining them, particularly the direct data API services they both offer.

But this partnership is also significant in that it marks the expansion of Digital Science text mining beyond patents and chemistry to other data sources and biology. This agreement gives SciBite and Digital Science access to each others annotation and data integration technologies and content, enabling us to create a world class life sciences text mining pipeline.

In addition to SureChem and SciBite, there are multiple Digital Science products that would benefit from this capability and as you might imagine, there are plenty of conversations under way already.

You can read more about the partnership on our press release page here. For more, follow us at @digitalsci, and do stay tuned for more in coming months.

The post Digital Science and SciBite form strategic partnership for life sciences text mining appeared first on Digital Science.

SureChem Deposition to PubChem

Nicko Goncharoff — Mon, 10 Dec 2012 15:08:00 +0000

In a move that will boost open drug discovery, SureChem has deposited more than eight million chemical compounds into PubChem, the first time that any patent chemistry database has been made publicly available in its entirety. More than half of these are novel to PubChem – the world’s primary public chemistry database – providing a rich new source of medicinally relevant compounds for researchers worldwide. Users can view patents related to SureChem data in PubChem through links provided to SureChemOpen, a free patent chemistry search resource.

This data deposition is key part of SureChem’s mission to integrate patent chemistry into the online research community, enabling access to wider range of users. SureChem is also linked to the free ChemSpider database as well as chemical structure data in Royal Society of Chemistry journals. More links to other public and proprietary resources will follow in 2013.

For more information on the PubChem deposition, visit the SureChem blog or read today’s press release.

The SureChem patent chemistry line of products is an internal offering from Digital Science. SureChemOpen offers basic free patent chemistry search and document viewing. SureChemDirect provides batch access to its complete chemistry and full text data collection, enabling integration of patent chemistry with internal workflows and databases. Professional and frequent searchers will soon have access to SureChemPro, and end user tool that provides powerful search, export and analysis tools. For more information, visit surechem.com.

The post SureChem Deposition to PubChem appeared first on Digital Science.

Digital Science Hackbreaks: The SureChem Mobile App

José Airosa — Wed, 19 Sep 2012 09:17:00 +0000

Scientific Literature Survey using Machine Vision

Earlier this year, a group of Digital Science engineers decamped to a holiday home in Norfolk for three days of intensive hacking. This blog posting is about one of the applications developed at the Hackbreak: SureChem Mobile, a smartphone application to help chemists learn more about chemical compounds in printed matter. To recognize the compounds in photographs taken on the smartphone, we used a combination of Keymodule’s CLiDE machine vision tool plus several custom image pre-processing steps.

Here’s a video of the prototype SureChem mobile application in action. This first version links out to widely used, general purpose chemistry databases. In a future version we’ll be adding support for the more specialized data available in SureChem patent documents:

SureChemMobile Prototype Demo from Digital Science on Vimeo.

The aim of the application is to provide a new workflow for scientific survey:

Use a mobile phone camera to take a photograph of a chemical structure
Upload the picture to the SureChem server for image processing and analysis
After successful recognition, retrieve resulting data from the server, including the Molfile and SMILES representation of the chemical structure, and chemical meta-data such as a generated name.
Query external chemistry resources to learn more, such as detailed chemical information from ChemSpider and Journal articles from the Royal Society of Chemistry that refer to the compound

The idea is to make it easier for a researcher to learn more about a compound after seeing it in a publication.

Implementation

The prototype mobile application was created over three days (and nights) in the holiday home shown below, by team J – that is, Jan Wedekind (expert in all things image processing), Jim Siddle (SureChem back end guru), and Jose Airosa (SureChem API master).

We used the PhoneGap toolkit to develop the mobile app, and a Sinatra server application was used to receive the queries and return the results via a RESTful API. Incoming images were preprocessed using the HornetsEye Ruby library and conversion of the chemical image to a searchable chemical structure was performed using Keymodule’s CLiDE. Chemical meta-data was generated using the ChemAxon JChemBase toolkit.

Below, we’ve included a series of images that together describe the preprocessing algorithm that identifies and isolates a chemical compound in an image. This algorithm is applied to images before they are passed to CLiDE for chemical recognition.

	The initial image. An enlarged version of the area marked by the red rectangle is shown in the top-right corner of the image.
	A dilated version of the input image. This is used as an estimate of the local background brightness.
	The difference of the (greyscale) input image and the dilated image serves as a basis for thresholding.
	Otsu’s method is used to reduce the data to a binary image.
	The input image is dilated and the connected components are identified.
	A weighted histogram is used to find a large component close to the centre of the image.
	The selected component of the dilated binary image and the initial binary image are used to extract the graphical representation of the chemical structure.
	The result is ready to be processed with optical structure recognition software.

Here’s a larger version of the input photograph, side-by-side with the image we pass for chemical recognition.

That’s all for now, but keep an eye out for future articles where we’ll describe other experiments and prototypes from Digital Science hackbreaks.

The post Digital Science Hackbreaks: The SureChem Mobile App appeared first on Digital Science.

SureChem launches its first enterprise product, offering direct API access

Guest Author — Mon, 20 Aug 2012 13:00:00 +0000

SureChem, our patent chemistry product line, today launches a new enterprise product called ‘SureChemDirect’. The tool will provide API access to the content that powers SureChem: patent full text and metadata currently comprising 12 million structures, 90 million patent abstracts and full text records from US, European and WIPO patent authorities. SureChemDirect enables customers to incorporate sophisticated chemical patent search into their workflows, perform batch services on their own platforms, and enhance their internal databases. The announcement was made in line with this week’s American Chemical Society (ACS) meeting in Philadelphia, where the team is exhibiting and presenting.

SureChemDirect is the second product released in the SureChem product line, following the launch of SureChemOpen last March, their free portal for researchers. Both aid in making patent chemistry search (think: early stage drug development) more streamlined and approachable for researchers and businesses alike, putting high-powered tools and databases into the hands of those who use them.

The underlying technology works to make patent chemistry less painful and less reliant on brittle string matching and text string searches by allowing chemists to search how they think – in chemical structures – the interface allowing users to actually *draw* chemical structures and sift through the database that way. This functionality, paired with highly curated underlying content, makes for an extremely powerful tool – with that premium content now available for use via SureChemDirects’s API access.

For more information, visit www.surechem.com.

The post SureChem launches its first enterprise product, offering direct API access appeared first on Digital Science.

Mining patents in the Cloud (part 1): the SureChem data processing pipeline

james-siddle — Mon, 28 May 2012 12:00:00 +0000

Introduction

This is the first part of a three-part blog series on how the SureChem team have rebuilt their data processing pipeline using AWS tools.

Digital Science recently launched SureChemOpen, a free service to help research chemists to find interesting chemistry in patents. This article is about the text mining infrastructure that makes SureChemOpen possible.

To start with, we’ll talk about what SureChemOpen is, what it’s used for, how to use it, and the data required to enable patent chemistry searching. We’ll then describe the text mining process: i.,e. how to go from a textual patent document, through annotation and chemistry detection, to building a database of patent chemistry. The SureChemOpen pipeline is built using Amazon Web Services technologies, and a future article will describe how we implemented the text mining pipeline in the cloud, using EC2, SQS, S3, and other technologies. Also to come is a design discussion about the scalability, reliability, complexity, data integrity, and the performance of our cloud-based data mining implementation.

What is SureChemOpen?

SureChemOpen is a search engine for chemists interested in patent chemistry, such as researchers at institutions working on drug discovery. Typical uses for SureChemOpen are to check if particular compounds have been protected (and thus may or may not be patentable), or to identify new or unexplored types of compound which may be candidates for research projects. SureChemOpen is a free version of the SureChem product portfolio, allowing low-intensity searches for free. Premium offerings (SureChemDirect and SureChemPro) are planned for release later this year.

At the core of SureChemOpen is a text mining pipeline, which we’ll describe in detail in the next section. But fundamentally, we start with a corpus of patent documents, run them all through a cloud-based data processing pipeline, and in the process build up a collection of chemical name annotations, chemicals found in images, and where possible, chemical structure data for names.

Every chemical we find is added to a searchable database, which allows chemists to find “interesting” chemistry. A typical chemistry search might involve entering one compound, and searching for any compounds with a 95%+ similarity level. Or a chemist may enter one part of a chemical compound, and search for all other compounds that contain that fragment as a “substructure”. After finding one or more interesting compounds, the chemist will naturally want to view the documents that contain them. Clicking one more more structures from a SureChemOpen structure results list shows a list of matching documents; these can then be opened, and the matching names or image annotations shown.

The Text Mining Process

So what is really involved in generating data for SureChemOpen? The data mining process can be broken down into the following discrete tasks.

1. Text annotation

The first step in the pipeline is text annotation. Here, we take the raw text of the document (typically HTML, provided by patent offices), and run it through a machine-learning based named-entity recognition tool, referred to as the SureChem Entity Extractor (EE). The tool is used to find “systematic” chemical names, for example:

1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-methylpiperazine

The SureChem EE identifies chemical names in text by first tokenizing around white space and other significant separators, then calculating a probability for whether each token is chemical.

The probability is calculated based on which “n-grams” the token contains, where the presence of certain n-grams is a strong indicator that the full token is actually a systematic chemical. An n-gram is a sequence of characters of length n; so for example the word “example” has the following 4-grams: exam, xamp, ampl, and mple. We identify the 4-grams of each potential chemical name, then combine the “chemical” probability for each of these 4-grams to get the overall likelihood that the name is chemical. A finely tuned threshold ensures a high “F-measure” score, meaning that we find the vast majority of chemical names, and very few false-positives.

A machine-learning model isn’t all there is, however. We use dictionaries to find well known drug names, as well as heuristics and certain post-processing steps to improve the quality of our annotations.

There are two outputs of the annotation task: annotations and names. Annotations are simply start and end positions for the chemicals in the document; these are stored in a database and can be extracted later for rendering the document with chemistry. The names are sent on to the next “downstream” task for further processing.

2. Convert names to structures

Next, we try to generate chemical structures for every name detected in the previous step. There are several third-party tools (both commercial and open source) that take one or more names, then provide the chemical structure data that the name corresponds to. The most common structure format output by the tools is the .mol file.

We try our best to convert the name by passing it to five different tools. This can result in more than one chemical compound being generated for a given name; we capture everything and ensure that searching and exporting handle these cases appropriately.

Names that can be converted are sent, with all generated chemistry, for standardization and storage (see step 4, below). Non-converting names however are sent for OCR correction…

3. Optical Character Recognition Correction

Unfortunately, not every name can be converted to a structure. Sometimes this is because the name just isn’t known to the tools, and occasionally because we’ve falsely identified some text as a chemical. But often, names don’t convert because they contain errors introduced through Optical Character Recognition (OCR).

Many patents (even newly published ones) are digitized using OCR, which can mean slight mistakes in chemical names because OCR classifiers are typically trained to recognize prose, rather than systematic names. Common OCR errors include spurious spaces being inserted into chemical names (often around commas, as is typical in prose), or certain numbers being changed to similar looking letters (the number 1 changed to the letter l, for example).

The next step in our pipeline is to try to correct these. We use a combination of heuristics, dictionary lookups, and third party tools to create correction candidates. Every correction candidate is sent back to the previous step, and if convertable to a structure it is treated as non-corrected names, and sent for structure standardization and storage.

4. Structure Standardization and Storage

Every structure generated in the SureChemOpen data processing pipeline is ultimately processed by what we typically call our “Structure Handler”. The Structure Handler is responsible for processing every chemical generated by earlier steps. This means standardization, error checking, chemical property calculation and storage.

We use a third party standardizer and error checker provided by ChemAxon, which (using a custom configuration) ensures the output is a valid chemical in a consistent form. Automated chemistry extraction can generate spurious chemicals, so by running a series of careful checks (such as checking size, or atomic makeup) we can ensure the storage of meaningful chemistry. Similarly, standardization steps such as de-aromatization ensure that all chemicals are in a consistent form, making chemists lives easier and reducing duplication in our database (see below)

Chemicals that pass standardization and error checking are added to a searchable database, along with a number of derived properties. After being added to this database, the chemicals will appear in search results on SureChemOpen. Often, different names will generate the same chemical structure (think “water” and “dihydrogen monoxide”), in these cases we detect the duplicate chemistry and only store one searchable chemical.

Each chemical successfully processed by the Structure Handler (whether resulting in a new chemical or recognition as a duplicate) will now have a unique ID, which will be sent on (with the originating name) to the Entity Mapper task.

5. Entity Mapping

The final step in the SureChemOpen text annotation pipeline is entity mapping. So far, we’ve seen that documents have been annotated with chemistry, names have been converted to chemical structures, and chemical structures have been stored in a searchable database. But what’s missing is a link between annotations in documents and the chemicals generated from them. Without this information, it’s impossible to find documents that match results from chemical searches; it also makes it hard to show chemical structures for annotations in documents.

The Entity Mapper, therefore, is passed pairs of names and chemical IDs, and updates the database of annotations to ensure that the relationship from annotation to chemistry is recorded.

Image Analysis

Another aspect to the data processing pipeline not mentioned above is extraction of chemistry from images. In the SureChemOpen pipeline, this is done in a similar way to name extraction. Documents are sent to a task that retrieves clipped images from patents, and processes them using CLiDE (a third party tool for detecting chemical compounds in images).

The resulting image annotations are stored in the database, chemistry is standardized and stored by the Structure Handler, and image annotations are associated with detected chemistry. The only significant difference from text processing is that chemistry from images is aggressively filtered because it’s very difficult to prevent non-chemical images from being processed, and false positives can easily be detected.

Intermission

Part two of this series will focus on how we’ve utilised AWS technologies to build the data processing pipeline. Stay tuned.

The post Mining patents in the Cloud (part 1): the SureChem data processing pipeline appeared first on Digital Science.

Drug discovery is no longer the province of pharma – Introducing SureChemOpen

Kaitlin Thaney — Mon, 26 Mar 2012 12:00:00 +0000

Today, thanks to our text mining team, users looking for better patent chemistry search have a free means of doing so. Announced in line with the American Chemical Society’s spring meeting in San Diego, Calif., SureChem (our patent chemistry product line here at Digital Science) has introduced a new open interface for researchers needing a more effective means of searching patent literature.

“SureChemOpen” is a free layer added to SureChem’s product line, building off the technology of their flagship product. To date, sophisticated patent search has been largely walled off through costly enterprise software most commonly associated with pharmaceutical research. However, drug discovery is no longer the province of pharma, as increasingly patents are being integrated more into other pockets of the research ecosystem. SureChemOpen is an important first step in making this sort of search more accessible to other members of the research community.

The underlying technology works to make patent chemistry less painful and less reliant on brittle string matching and text string searches by allowing chemists to search how they think – in chemical structures – the interface allowing users to actually *draw* chemical structures and sift through the database that way (ask a chemist, this is big). As seen in other pockets of life science research, structures or entities can often be classified under a myriad of names, inadequately linked to contextual information (if at all), all further exacerbating the sifting problem, and as a result, slowing down research dramatically.

There are other free chemistry databases, most notably resources like ChemSpider from the Royal Society of Chemistry and PubChem from the US National Library of Medicine. Whilst complimenting these resources, SureChem is unique in that its the only free patent chemistry database searchable by chemical structure, each structure linked directly to the patent literature. This not only maps to how chemists think, but makes for an incredibly powerful resource and way of sifting through the wealth of information tucked away in patents.

SureChemOpen is the first part of a re-launch of the SureChem product line, with additional products planned to roll out over the coming months. For more information on the release, see our press release. And do let us know what you think of SureChemOpen. We’d love to hear your thoughts.

The post Drug discovery is no longer the province of pharma – Introducing SureChemOpen appeared first on Digital Science.