our NLP blog series Archives - Digital Science

Artificial Intelligence and Peer Review

Suze Kundu — Wed, 23 Sep 2020 14:09:00 +0000

Despite the fact that, for many people, it still feels like the middle of March, we have somehow made it to September and find ourselves celebrating the sixth annual Peer Review Week! This year’s theme is Trust, and what better way to celebrate than to look back on some of the amazing developments and discussions happening around peer review and natural language processing (NLP).

In April’s episode of RoRICast, the podcast produced by the Research on Research Institute that Digital Science co-founded a year ago, my co-host Adam Dinsmore and I chatted to Professor Karim Lakhani, the Charles E. Wilson Professor of Business Administration and the Dorothy and Michael Hintze Fellow at Harvard Business School. Karim is an expert in the application of artificial intelligence in research processes, from collaboration to peer review.

Karim joined us from his home in marvellous Massachusetts. Although an MIT graduate, Karim is now based across the river at Harvard Business School. His research involves analysing a range of open source systems to better understand how innovation in technology works. One of his specific research interests is in contest-driven open innovation and how, by throwing problems open to the wider world, we are often able to engage with a range of as yet unexplored solutions, owing to the different approaches a fresh perspective can bring.

Having determined that science is both a collaborative and competitive process, Karim and his team run experiments to better understand how teams are formed, and how different novel ideas are evaluated. Karim is also investigating the impact of artificial intelligence (AI) on organisations in terms of optimising scale and scope and gathering insights to help shape future business strategy.

Mirroring the experiences of Digital Science’s own Catalyst Grant judges and mentors, Karim has seen a rise in machine-learning based tech solutions at innovation contests. His latest book, Competing in the Age of AI: Strategy and Leadership When Algorithms and Networks Run the World, includes examples of how AI is now not only having an impact on technology and innovation but also on our everyday lives. Karim’s work informs best practice in research and innovation by conducting research on research.

In this episode of RoRICast, Karim gave us some examples of how AI is not just confined to sci-fi movies and Neal Stephenson novels, though such stories give a great many examples of what is termed ‘strong AI’, capable of carrying out many tasks extremely efficiently. However, ‘weak AI’, that is tech that has been created to do one narrow task very well, has already permeated our everyday lives, whether that is through some of the NLP solutions we have discussed in this blog series, or whether it is something as commonplace as our voice-activated smart devices capable of playing Disney songs on demand, our email spam filters, or even our Netflix recommendations.

Karim discussed some of the potential applications of AI in research, from facilitating collaboration between researchers to writing papers. He also discussed how research can implement aspects of NLP within the research process that relate to peer review. For example, by using an NLP-driven tool such as Ripeta, researchers can receive recommendations on how to improve a paper prior to submission. Ripeta analyses the reproducibility and falsifiability of research, including everything from a well-reported methodology to the inclusion of data that adheres to FAIR principles.

With the rise of the open research movement, preprints have been gaining momentum as an important research output alongside the more traditional journal publications. This is particularly relevant in these current COVID-19 times, where research output is being produced at an unprecedentedly high volume, and many researchers are opting to share their work via preprint that undergoes an ongoing and dynamic review process rather than the more formal journal peer-review process.

A rise in preprint publication has been seen across almost all fields of research in 2020, in part due to the fact that many research areas contribute to solving the challenge of a global pandemic. This has however led to some concern over preprints, and whether they are a trustworthy research output without more formal peer review practices. It is here that a tool like Ripeta could add some level of trust, transparency, robustness and reliability to research shared via preprint even before the work is shared. The Ripeta team investigated this perceived lack of confidence in COVID-19 related preprints and found that although reporting habits in pandemic-related preprint publications demonstrated some room for improvement, overall the research being conducted and shared was sound.

The use of AI in peer review is a hot topic. There are many reasons to use AI in peer review, such as eliminating the potential conflict of interest posed by a reviewer in a very closely related field, or as a means to quickly assess the vast volume of submissions, again for example during a global pandemic. However, there are limitations to the technology, and factors that must be considered when determining whether the AI could be propagating and amplifying any areas of bias within the process, simply by failing to consider the bias within the training data fed to the programme, or by failing to eliminate said bias. As Joris van Rossum explained in his article on the limitations of tech in peer review, AI that has learned from historic decisions made is potentially able to reinforce imbalances and propagate the impact of unconscious biases in research.

Karim went on to describe the way that AI can be built to mitigate such circumstances that would actually lead to breaking down many barriers to inclusion by using AI, providing we as a community invest the time and effort into creating good data, testing the technology and ensuring that programs work fairly and ethically; an aspect of social science research that RoRI is particularly interested in. Furthermore, complementary AI could be used in other parts of the research process to eliminate many stumbling blocks that could be presented by reviewers on submitting a paper.

Using AI in peer review is just one example of open innovation to improve an aspect of research, but when can we expect to see this and other AI solutions being widely adopted as part of the research process? There is already a lot of tech around us, but within the next few years, this field will expand further as we learn more about how research works. By conducting research on research, researchers like Karim can uncover trends and connections in a range of research processes, and work towards creating tech solutions that will alleviate the burden and increase the efficiency of research.

We would like to thank Professor Karim Lakhani for giving up his time to join us for this episode of RoRICast. You can hear this whole episode of RoRICast here.

SEE MORE POSTS IN THIS NLP SERIES

We’ll be staying on the topic of peer review and pandemics by kicking off a mini-blog series tomorrow on the PREreview Project. Earlier this year a collaboration of publishers, industry experts and a preprint site (PREreview) joined together to respond to overwhelming levels of COVID-19 papers. Using information and feedback from the parties and reviewers involved, our authors Jon Treadway and Sarah Greaves examine what happened, whether the initiative succeeded, and what the results can tell us about peer review.

The post Artificial Intelligence and Peer Review appeared first on Digital Science.

Using NLP to Build a Market Intelligence Platform for the Biotech Industry

Suze Kundu — Wed, 03 Jun 2020 15:30:49 +0000

Today’s chapter of our NLP blog series is written by Andrii Buvailo. Andrii is a co-founder and director at BPT Analytics, and also Editor at BiopharmaTrend.com, responsible for all content, analytics, and product development in the project. He has been writing about research and business trends in the pharmaceutical industry for over four years, mainly focusing on the digital transformation of drug discovery. Before moving to the pharma space, Andrii held a number of executive positions in various hi-tech companies. Prior to his industrial career, he spent years as a practising scientist, having participated in numerous research projects in Belgium, Germany, the United States, and Ukraine. Andrii holds a BSc and an MSc in Inorganic Chemistry, and a PhD in Physical Chemistry from Kyiv National Taras Shevchenko University. Outside his professional career, Andrii is a big fan of travel, chess, and digital drawing.

See more posts in the NLP blog series

Building an initial knowledge base about the pharmaceutical industry

BPT Analytics project started back in 2016 with a simple drug discovery market research blog at BiopharmaTrend.com, where we posted our own regular observations about innovations and technology trends in the pharmaceutical industry, focusing on what companies did to advance the field. At that time we started a systematic effort of collecting data about as many drug discovery and biotech companies and startups as we possibly could. The idea was to create a large enough database of properly labelled companies to see if we would be able to later train machine learning models on it. In 2019 we were awarded a Catalyst Grant to advance our efforts.

Today, we already have a database of more than 7,000 pharma/biotech companies and over 3,000 investors active in the area. The list of companies is matched with numerous other databases and information resources, including clinical trials, marketed drugs, research papers and patents, funding rounds, R&D partnerships, and other aspects important to understand each company’s role and position in the pharmaceutical landscape. We gather data from numerous sources, including our web-parsing engine, collections with external APIs, data supplied by users, and data collected manually.

Importantly, we have built the infrastructure to manually curate the inflowing data by our freelancers. This process allowed us to accelerate and scale up our manual data curation effort. In order for this data to become useful for the pharmaceutical professional and other decision-makers, we are building a subscription-based web-interface BPT Analytics where users can conduct their own market research using our data, with advanced filters and powerful visualization tools. The interface is currently in private beta testing mode for basic functionality.

On the horizon: using NLP to automate ontology construction

While well-organised manual data curation is one way to build a useful market intelligence service for the pharma industry, it is certainly a limited value proposition. For example, our search is limited to exact keyword-based indexing, without any semantic search options. It means that we can only find information using exact terms and parameters. If a document contains a slight variation of the same term, our search will not be able to find that document.

Another limitation is that all labelling has to be done manually for each entity, and all entities have to be manually associated in the database, which is extremely resource-demanding and inefficient. In order to provide a new level of data mining capabilities for our future customers, we are now exploring ways to apply natural language processing (NLP) technologies in our project. Some of the key tasks that we are hoping to solve by implementing NLP models is to be able to automate domain-specific entity recognition – identifying biotech companies, drugs, diseases, therapeutic modalities etc. – out of vast amounts of mostly unstructured data, and grouping them by a number of requirements.

In time this means we will be able to extract relations between the entities and build knowledge graphs; a key component in being able to understand the pharmaceutical R&D market and derive macro- and micro-trends and business insights for the user.

Challenges to overcome

Integrating NLP models into the existing project is a tricky endeavour, and we will need to expand our expertise substantially to achieve this goal. We have unique domain-specific expertise in the life sciences industry and biotech market, a large corpus of quality data to train models. We are now exploring our potential customers’ needs to formulate use cases, and pipeline requirements for the NLP system and its output.

See more posts in the NLP blog series

The post Using NLP to Build a Market Intelligence Platform for the Biotech Industry appeared first on Digital Science.

NLP Series: Natural Language Processing and Paper Digest

Suze Kundu — Wed, 20 May 2020 17:19:38 +0000

This latest article in our blog series on Natural Language Processing comes from the co-founders of 2019 Catalyst Grant winners Paper Digest. Dr Yasutomo Takano is a project researcher at University of Tokyo. Dr Cristian Mejia is a specially appointed assistant professor at TokyoTech. Nobuko Miyairi is a strategic advisor at Paper Digest, and a scholarly communications consultant.

What is Paper Digest?

Paper Digest is an automated summarisation service specialised in academic literature. It aims to help non-native English-speaking researchers by reducing the burden of reading the ever-increasing pile of research articles written in English. As ‘English as a second language’ (ESL) researchers ourselves, we keenly felt this disadvantage, and decided to develop this tool. To our surprise, it has been well-received by native English speakers as well, because everyone can benefit from such a time-saving tool.

At its core, Paper Digest helps users assess whether a given academic article is worth their time for more careful reading. This is done by offering a list of sentences picked verbatim from the document, which are expected to provide more information than those shown in the abstract. In the NLP parlance, this is known as extractive-based summarisation.

Paper Digest is a tool that summarises the key points of an academic paper using natural language processing and extractive-based summarisation

**How Paper Digest works**

In order to find the key pieces of information, instead of reading the article in a linear manner from introduction to conclusion, we use the analogy of networks. Imagine if we decomposed the article into sentences and mix them together in a box. As we draw sentences from the box, we use string to tie together those sentences that are similar to another one previously drawn. By scrambling the sentences we have lost contextual information. However, we can still assess whether a pair of sentences are similar by looking into their vocabulary: they use the same keywords, synonyms, or refer to the same concepts. The more similar they are, the shorter the string we use to tie them. Once the box is empty, we end up with what resembles a network of sentences. Here and there we may find some groups of sentences tightly connected, where at least one sentence is playing a central role in keeping the bundle together. What Paper Digest presents, as a result, is a list of those central sentences.

As a baseline, the above approach works, but lots of efforts have been put in to optimise our methodology; from how to better split the document into sentences, to better definitions of what being “similar” means. Typical NLP evaluation methods such as Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Bilingual Evaluation Understudy (BLEU) show that our algorithm performs well. However, we refrain from using those evaluation scores because in the end there is still a gap between what a machine and a human can understand as to what makes a good summary.

The future of Paper Digest and NLP

Our current focus is to better understand our users’ needs and optimise the algorithm accordingly. For instance, a Ph.D. student who is interested in writing a review article might be looking for methods and statistical significance tests, while someone writing a science communication piece might be interested in other things. They may both have different opinions about what they deem to be the ‘most important’ sentences from the same document. To capture these nuances, we have put in place a feedback system in our interface, so that users can give a ‘like’ to each of the extracted sentences to indicate their agreement. By accumulating this feedback from our users over time, they play a huge role in helping us improve the algorithm, and making this application of NLP as useful for as broad a range of people as possible.

The post NLP Series: Natural Language Processing and Paper Digest appeared first on Digital Science.

Natural Language Processing and Inclusion

Guest Author — Wed, 06 May 2020 02:25:54 +0000

Quick Read

Co-founders of Scismic discuss natural language processing
Scismic is a tool designed to remove bias from the recruitment process in the life sciences, to create a more diverse and representative workforce
Scismic uses a skills-based matching algorithm to ensure candidates are being assessed on non-biographical qualifications
NLP exists to make sense of the human language, which has vast potential to improve technical systems across industries

We continue our NLP blog series on Natural Language Processing with an article from the team behind Scismic, Dr Danika Khong and Dr Elizabeth Wu. Scismic is a tool designed to remove bias from the recruitment process in the life sciences, to create a more diverse and representative workforce. The founders both hail from Boston, USA. In April 2019 Scismic won a Catalyst Grant for the beta version of their product, and last month Digital Science welcomed Scismic into their family of portfolio tools to help make research the best it can be. In this post, we will hear more about what Scismic does, and how and why the team hope that they will soon be able to implement NLP techniques into their processes.

Dr Danika Khong and Dr Elizabeth Wu, the team behind Scismic

What is Scismic?

Scismic Job Seeker is an online hiring platform for scientists that works toward accelerating therapeutic development by delivering the people best suited to specific research positions. As part of our efforts, we are working to reduce human biases in the candidate evaluation process, which has been shown to exclude scientists of non-traditional backgrounds from the recruitment pipeline. Scismic was recently awarded a grant from the National Institutes of Health in the US to further develop the matching system towards increasing the number of underrepresented scientists invited for job interviews. As we build out our system, we are exploring the incorporation of natural language processing (NLP) to enhance our matching system.

How can Scismic help reduce or eliminate unconscious bias?

Scismic uses a skills-based matching algorithm to ensure candidates are being assessed on non-biographical qualifications. In the US, the Equal Employment Opportunity Commission (EEOC)’s Uniform Guidelines on Employment Selection Procedures stipulates that the outcome of hiring assessments should be based on 3 forms of evidence. Criterion validity is one form of evidence encompassing predictive ability of the assessment to job outcome (e.g. skill sets). With its skills-based matching algorithm, Scismic bypasses the need for the preliminary human screening of skills. In addition, Scismic’s system is also able to translate the words scientists use for their skills to the words talent acquisition teams use to describe the same skills. Scismic achieves this using a skills taxonomy to ensure that the contextual definitions of individual words can be preserved during this translation process. This comprises of words manually linked or strung together to define a precise scientific skillset, each carrying a defined functional importance within the field of life sciences. Matching these skill sets between job seekers and companies using Scismic’s taxonomy establishes the criterion requirement while reducing human bias, and enhancing both job seeking and staffing efficiencies.

The Scismic recruitment process

How Scismic works

As Scismic scales towards wider and more diverse audiences, the manual efforts in building and maintaining this taxonomy will no longer be sustainable. Naturally, Scismic is preparing to turn to NLP to emulate recognition of these skill sets, their functional role in the life sciences, and therefore their relationships to each other.

NLP exists to make sense of the human language, which has vast potential to improve technical systems across industries. The challenging aspect of this is in the semantic analysis of words; i.e. the understanding, and hence preservation of the meaning and interpretation of the words and how sentences are structured. Scismic strives to build a more sophisticated taxonomy of skills that takes complex and highly specific scientific terms and translates them to simple but functionally similar terms through this AI tool.

Implementing AI and machine learning into an existing process

Incorporating AI is not without risk. One danger is the unintentional creation of bias within the system due to population biases among its users. Scismic will analyze strict metrics to proactively test its system for biases in order to evaluate and address potential biases within the matching system.

For Scismic, the adoption of NLP promises scale without compromise to its match accuracy. We walk down this road, knowing that the process will not be easy and will require frequent assessments that validate the incorporated AI for accuracy with little analytical bias.

SEE MORE POSTS IN THIS NLP SERIES

The post Natural Language Processing and Inclusion appeared first on Digital Science.

NLP Series: Speeding up academic writing

Suze Kundu — Tue, 21 Apr 2020 12:07:24 +0000

In this week’s edition of our blog series on Natural Language Processing, we hear from two members of the team at Writefull, the academic writing support tool. Dr Hilde van Zeeland is Chief Applied Linguist at Writefull. After having completed an MSc and PhD in Applied Linguistics at the University of Nottingham, UK, she worked for several years as a language testing consultant and a scientific information specialist before joining Writefull. Dr Juan Castro is one of the founders of Writefull. He finished his PhD in Artificial Intelligence at the University of Nottingham, UK. He then did a few post docs at the same university before founding Writefull.

SEE MORE POSTS IN THIS NLP SERIES

Introducing Writefull

Writing is key to science. Whether it is journal articles, book chapters, reports or conference proceedings, most research is communicated through written texts. For most researchers however, writing takes up more time and effort than they would like. Fortunately, we now have Writefull: a tool that uses the latest Natural Language Processing (NLP) techniques to speed up the writing process.

Data, data, data, and models

NLP is a strand of Artificial Intelligence that refers to the automatic understanding and generation of human language. It can be applied to many purposes, such as predictive text, automatic translation, and text categorisation. Whatever the application of NLP, its techniques often rely on the training of models on vast amounts of data. While these models process batches of data, they acquire knowledge needed for the task at hand. For predictive text, for example, they require recurrent linguistic strings.

NLP models and Writefull

To help with academic writing, we need models to do three things:
1) to learn the recurrent patterns of academic texts;
2) to recognise when an author’s language does not follow these patterns, and;
3) to change such language so that it follows the expected patterns.

Writefull can suggest changes to academic writing based on the likelihood of a word or sentence to be correct.

At Writefull we have spent the last few years developing and training models that do just that. We offer an editor in which researchers can write their text. They then get automatic feedback on their writing, and can accept or reject Writefull’s suggestions. The models that Writefull uses to give feedback have been trained on millions of journal articles. Thanks to this, they can spot when the author’s writing deviates from the norm – that is, from the expected language patterns as acquired from our dataset. In many cases, such deviations will be grammatical errors, but they can also include things like awkward wording or unnecessary commas.

Why AI beats grammar rules

Traditional language checking software uses grammar rules to check for fixed elements in a sentence. For example, they might ensure that the right prepositions precede certain nouns by coding rules such as: correct ‘at progress’ into ‘in progress’.

Programming rules are definitely easier than training models. However, once models work well, they are much more powerful. Rules are limited; even thousands of rules wouldn’t cover all of the mistakes that authors can make, whereas models can cope with any input: their knowledge is generalisable to any sentence. To give you an example, Writefull recently corrected ‘time of the day and day of the week’ into ‘time of day and day of the week’. Writefull knew that, in this context, ‘the’ precedes ‘week’, but not ‘day’. There are many of these usage-based norms, and it is impossible to cover all of these in a rule set, but a model, if trained sufficiently, will eventually learn them.

Another downside of rules is their black-or-white nature. If an author’s sentence triggers a rule, it will then be corrected regardless of the context. This may lead to false corrections. Models, on the other hand, look at the context to judge what suggestions are needed and, based on this, can give nuanced feedback. When Writefull spots that something is off in a text, it often gives the author the probability of their phrase and compares this to alternatives. For example, when writing, “He is sitting on the sun” in the Writefull editor, Writefull shows that “He is sitting in the sun’ is a more probable alternative, with 82% likelihood of the latter versus 18% for the former. In cases like this, Writefull does not give a harsh correction, but an insight into the likelihood of the author’s wording versus alternatives. Language correctness is, after all, not always black-or-white. Messiness and ambiguity, both inherent to language, are two key challenges in the field of NLP.

The challenge of messy language

A challenge to Writefull – and to any NLP application – is noisy input. If an author writes sentences that are very different from the language that Writefull’s models know from training (i.e., from the journal articles), Writefull may fail to give accurate feedback. Think of an author messing up word order or making several serious grammar mistakes in one sentence. The challenge is therefore to identify those cases where it is best to not suggest anything, for a suggestion might turn out to be incorrect.

The possibilities are endless

At Writefull, we’re continuously exploring avenues to make our feedback even more accurate and complete. While Writefull currently gives feedback on many language features, including the use of punctuation, prepositions, subject-verb agreement, etc., there are still plenty of science-specific features to cover. Academic writing might use virtually the same grammar as other genres, but it is highly specific on other things, such as word use. We now have the technology in-house to expand – and in doing so, we’re keeping a close eye on developments in the NLP field.

Writefull Website

More about NLP

The post NLP Series: Speeding up academic writing appeared first on Digital Science.

NLP Series: AI in Science; the Promise, the Challenge, and the Risk

Suze Kundu — Tue, 07 Apr 2020 18:46:22 +0000

Continuing our blog series on Natural Language Processing, Dr Joris van Rossum focuses on AI in science; the potential to make research better, but also the pitfalls that we must be wary of when creating and applying these new technologies. Joris has over 20 years of experience driving change in the publishing industry through new technologies and business models. His former roles include Director of Publishing Innovation at Elsevier and Director of Special Projects at Digital Science, a role in which he authored the Blockchain for Research report. He co-founded Peerwith in 2015, and currently serves as Research Data Director at STM, where he drives the adoption of sharing, linking and citing data in research publications.

Understanding the risks

According to Professor Thomas Malone, Director of the MIT Center for Collective Intelligence, AI should essentially be about connecting people and computers so that they collectively act more intelligently than any individual person, group or computer has ever done before. This connectivity is at the core of science and research. Science is a collective activity par excellence, connecting millions of minds in space as well as time. For hundreds of years, scientists have been collaborating and discussing their ideas and results in academic journals. Computers are increasingly important for researchers: in conducting experiments, collecting and analyzing data and, of course, in scholarly communication. Reflecting on this, it is perhaps surprising that AI does not play a bigger role in science today. Although computers are indispensable for modern scientists, the application of artificial intelligence lags behind other industries, such as social media and online search. Despite its huge potential, uptake of AI has been relatively slow. This is in part due to the nascent state of AI, but also to do with cultural and technological features of the scientific ecosystem. We must be aware of these in order to assess the risks associated with unreflectively applying artificial intelligence in science and research.

AI and NLP in healthcare

A logical source of data for intelligent machines is the corpus of scientific information that has been written down in millions of articles and books. This is the realm of Natural Language Processing (NLP). By processing and analyzing this information, computers could come to insights and conclusions that no human could ever reach individually. Relationships between fields of research could be identified, proposed theories collaborated on or rejected based on an analysis of a broad corpus of information, and new answers to problems given.

This is what IBM’s Watson has attempted in the field of healthcare. Initiated in 2011, it aims to build a question-and-answer machine based on data derived from a wealth of written sources, helping physicians in clinical decisions. IBM has initiated several efforts to develop AI-powered medical technology, but many have struggled, and some have even failed spectacularly. What this lack of success shows is that it is still very hard for AI to make sense of complex medical texts. This will therefore most certainly also apply to other types of scientific and academic information. So far, no NLP technology has been able to match human beings in comprehension and insight.

Barriers to information

Another reason for the slow uptake of NLP in science is that scientific literature is still hard to access. The dominant subscription and copyright models make it impossible to access the entire corpus of scientific information published in journals and books by machines. One of the positive side effects of the move towards Open Access would be the access to information by AI engines, although a large challenge still lies in the immaturity of NLP to deal with complex information.

More data give greater context

Despite the wealth of information captured in text, it is important to realize that the observational and experimental scientific data that stands at the basis of articles and books is potentially much more powerful for machines. In most branches of science the amount of information collected has increased with dazzling speed. Think about the vast amount of data collected in fields like astronomy, physics and biology. This data would allow AI engines to fundamentally do much more than what is done today. In fact, the success of born-digital companies like Amazon and Google have had in applying AI is to a large extent due to the fact that they have a vast amount of data at their disposal. AI engines could create hypotheses on the genetic origin of diseases, or the causes for global warming, test these hypotheses by means of plowing through the vast amount of data that is produced on a daily basis, and so to arrive at better and more detailed explanations of the world.

Shifting the culture around data sharing to create better AI

A challenge here is that sharing data is not yet part of the narrative-based scholarly culture. Traditionally, information is shared and credit earned in the form of published articles and books, not in the underlying observational and experimental data.

Important reasons for data not being made available is the fear of being scooped and the lack of incentives, as the latest State of Open Data report showed. Thankfully in recent years efforts have been made to stimulate or even mandate the sharing of research data. Although these offers are primarily driven by the need to make science more transparent and reproducible, enhancing the opportunity for AI engines to access this data is a promising and welcome side-effect.

Like the necessary advancement of NLP techniques, making research data structurally accessible and AI-ready will take years to come to fruition. In the meantime, AI is being applied in science and research in narrower domains, assisting scientists and publishers in specific steps in their workflows. AI can build better language editing tools, such as in the case of Writefull, who we will hear from in the next article in this series. Publishers can apply AI to perform technical checks, such as in Unsilo, scan submitted methods sections for assessing the reproducibility of research, the way Ripeta and SciScore do, and analyze citations, like Scite. Tools are being developed to scan images of submitted manuscripts to detect manipulation and duplication, and of course scientists benefit from generic AI applications such as search engines and speech and image recognition tools. Experiments have also been done with tools that help editors in making decisions to accept or reject papers. The chance of publishing a highly cited paper is predicted based on factors including the subject area, authorship and affiliation, and the use of language. This last application exposes an essential characteristic of machine learning that should make us cautious.

Breaking barriers, not reinforcing them

Roughly speaking, in machine learning, computers learn by means of identifying patterns in existing data. A program goes through vast numbers of texts to determine the predominant context in which words occur, and uses that knowledge to determine what words are likely to follow. In the case of the tools that support editors in their decision to accept or reject papers, it identifies factors that characterize successful papers, and makes predictions based on the occurrence of these factors in submitted papers. This logically implies that these patterns will be strengthened. If a word is frequently used in combination with another word, the engine subsequently suggesting this word to a user will lead to that word being used even more frequently. If an author was successful, or a particular theory or topic influential, AI will make these even more so. And if women or people from developing countries have historically published less than their male counterparts from Western countries, AI can keep them underperforming.
In other words, AI has the risk of consolidating the contemporary structures and paradigms. But as the philosopher of science Thomas Kuhn showed, real breakthroughs are characterized by replacing breaking patterns and replacing paradigms with new ones. Think of the heliocentric worldview of Kepler, Copernicus and Galileo, Darwin’s theory of natural selection, and Einstein’s theory of relativity. Real progress in science takes place by means of the novel, the unexpected, and sometimes even the unwelcome. Humans are conservative and biased enough. We have to make sure that machines don’t make us even more so.

DOI: https://doi.org/10.6084/m9.figshare.12092403.v1

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: AI in Science; the Promise, the Challenge, and the Risk appeared first on Digital Science.

NLP Series: Applying Natural Language Processing to a Global Patent Database

Suze Kundu — Tue, 31 Mar 2020 11:42:59 +0000

The latest article in our blog series on Natural Language Processing is from Catherine Suski, Director of Marketing at IFI CLAIMS Patent Services. Catherine has a passion for technology, and enjoys working in an area where she can see the direct impacts of implementing new tech into existing processes. Here Catherine will be talking about the benefits of using NLP to create an inclusive global patent database.

The Role of NLP in Inclusive Data Curation of Patent Information

CLAIMS Direct is a global patent database created by IFI CLAIMS Patent Services (IFI). NLP allows for the vast amount of information contained in patents to be applied to many situations. Through the curation of data, such as the standardisation of organisations, data can be amalgamated from a range of original sources. Using NLP, this patent information can also be translated into English from over 40 languages. By curating the data in this way, researchers can quickly access information from a broad range of original sources.

IFI receives inquiries from companies that require access to patent information for a range of use cases. From discovering important new invention types for use in investment decisions, to analysing the effects of government programmes on regional economic stimulus, the analysis of patent documents is becoming more widespread.

The growth of inexpensive and ever more powerful computing has led to easier methods for extracting meaningful data from patents, and NLP is a prime example of this. This technology is absolutely vital because, according to the 2019 report from the World Intellectual Property Organization (WIPO), 3.3 million patent applications were filed globally in 2018. This is almost twice the 1.85 million filed in 2008. There are more than 14 million active patents globally. With this many applications, it would be impossible to manually search for relevant information. Enter NLP.

Using NLP to Overcome the Language Barrier of Global Patent Information

With so many global patents that can contain important information, accurate translations are a must. Machine translation, or the use of computer software to perform translations, has been used for decades to translate patents. Recent advances employing NLP are speeding up this process. Early attempts looked at each word or phrase and translated it, however new techniques look at the overall context to provide higher quality results.

CLAIMS Direct, the global patent database and platform from IFI, uses Google Translate to convert documents in 48 languages to English. Based on neural network technology, one of the several driving forces behind NLP, Google Translate offers an exceptional level of accuracy. It overcomes problems found in most older phrase-based machine translation systems that do not sample a large enough segment of text to produce a proper translation. Using a large end-to-end network, this technology translates whole sentences or paragraphs at a time to provide context, and uses machine learning to continually make improvements over time.

Patent documents are often used by organisations and individuals who seek to patent something themselves. To be awarded a patent, the concept cannot infringe on another patent, and must also be a novel idea. Making a mistake by missing an existing publication or previously granted patent can lead to costly infringement lawsuits. The stakes are high and there will always be a big incentive to get it right. It is, therefore, common for many people to be involved in researching previous patent data, often employing multiple search methods.

While the exact format of a patent can vary by region, they have a number of structured data elements in common including invention title, inventor, submission data, active or inactive status, etc. This information, stored in named fields, is accessible in databases and is easy to search for. However, the body of a patent can contain far more useful free range text, or unstructured data, that is not parsed into fields and is difficult to search for with keywords and legacy search engines.

Search tools that use NLP can reveal crucial ideas contained in patent literature more easily than traditional methods which rely on keyword matches. Patent documents can be written using language which is meant to obscure the true nature of the invention, with the aim of keeping the subject matter hidden from competitors. Sometimes even technical subject matter experts cannot clearly see the idea being put forward. With the use of semantic and NLP algorithms, improved accuracy is achieved by ingesting large areas of text, examining the context, and making connections that are not otherwise obvious. The use of synonyms can also uncover new and relevant documents. Search intent of the user is better understood, and uniting all of these capabilities saves a huge amount of time.

Traditional Use Cases for NLP in Patent Documents

In business activities where intellectual property (IP) traditionally plays a large role, such as engineering and developing new drugs, some very successful new products incorporating NLP are improving the patent search process. Many clients of IFI have used CLAIMS Direct to build features such as:

Integration with other data sources: In addition to patents, searchable indexes can include scientific publications, internal research, websites, and other industry specific knowledge sources.

Text mining with specific vocabulary: This is especially important to the related industries of life sciences, biotechnology, and pharmaceuticals. For example, when developing new therapies, gene and disease target scanning can find research from another company that may be applicable to a new invention.

Clustering and categorisation: While patents from most countries use a common classification system, it is limited, and not industry specific. Some applications use pre-built tools tailored to different business requirements, while others allow users to set up their own requirements. The resulting visualisations provide quick insights about the latest inventions in any given field.

Relevancy scoring; With traditional search tools, results are ranked. Taking this a step further and providing a percentage score for relevancy shows the user a more finely-tuned answer.
Results delivered in an interactive framework; Search results can be refined by choosing “more like” in a field of related concepts. For example, when searching for “wind” a semantic application could give results that include wind turbines, wind-up clocks, and wind speed. The user can then select the most relevant category.

New Use Cases for NLP in Patents

Advances in NLP have resulted in it becoming a lot easier to extract important information which used to be hidden in patent documents. This has led to a range of new use cases.

Patents are making their way onto the trading floor. Fund managers want to know which technologies are on the verge of quick growth, and who owns them, in order to inform investment decisions. Here, well indexed, easy to search patent data is crucial. By adding a data source such as CLAIMS Direct to their fast-moving algorithmic trading systems, they are utilising NLP to find hidden tips, enabling analysts to create better reports.

Management consulting companies are getting in on the action too. They need to keep clients informed about the most up-to-date technology and competitive intelligence across the globe. Knowing when relevant patents have been published or granted can be a game changer. NLP offers consultants the ability to quickly uncover trends important to their clients, while improving efficiency through automated workflows. Clustering visualisations makes the information easier to understand.

As the technology continues to evolve, more use cases for patent information will emerge. We look forward to implementing these advances into our processes at IFI CLAIMS Patent Services, to continue to be as inclusive of, and useful to, the wider research community as we can possibly be.

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: Applying Natural Language Processing to a Global Patent Database appeared first on Digital Science.

NLP Series: NLP and Digital Science

Suze Kundu — Fri, 27 Mar 2020 13:35:36 +0000

Continuing our blog series on Natural Language Processing, today’s article is from Steve Scott, Director of Portfolio Development at Digital Science. As a member of the founding management team, Steve has been involved in the majority of Digital Science’s early-stage portfolio investments, taking founders through product and business model validation to launch and growth. He has given out 32 Catalyst Grant awards since their inception, with five recipients going on to become Digital Science portfolio companies. An entrepreneur himself, Steve has founded, or been involved in setting up, three of his own companies. In his spare time, Steve enjoys building and riding his own bikes.

The value of NLP

In 1950 Alan Turing wrote a paper, “Computing Machinery and Intelligence” in which he outlines what we now know as the Turing Test. In it he says, “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” While examples of his test portrayed in films usually use speech as the communication mechanism, text, the main ingredient for NLP, is equally suitable as a Turing Test. If a computer can write a prize-winning novel, or can analyse a researcher’s writing and act as their editor, it would pass a form of Turing Test. As computer vision and speech recognition have improved dramatically over the last 10 years, NLP is widely seen as the next key challenge in deep learning, and allowing computers to make sense of human language in ways that are valuable.

From a layman’s perspective, NLP allows non-programmers to extract useful information from computer systems. Think of the way Gmail automatically sorts your inbox into different categories and controls your spam folder, or how Alexa or Google Home can translate your voice into commands that return music, answers to questions, or switch on a light in your home. Smart homes of the future will also be more energy-efficient as they learn their inhabitants’ patterns and behaviours.

NLP attempts to make sense of unstructured data, for example, as text, and that data comes in an almost endless variety of forms, including papers, emails, abstracts, grant applications, etc. Our challenge is to find real-world problems and apply NLP to help overcome them.

From a Digital Science perspective, the two companies that best highlight the application of NLP to research challenges, Dimensions and Ripeta, share a number of benefits and features that capitalise on NLP to benefit their customers.

Over the last 10 years, Digital Science has funded and supported solutions to address the rapid growth in data generated by scientific research. The application of AI and Machine Learning to this data, in the form of unstructured textual data, has become a key focus for us. Our solutions allow for, among other things, better job-matching, improved conference identification, improved written English in papers and automated reports evaluation reproducibility. I want to focus on two examples of the application of NLP in action.

Dimensions

Dimensions is a scholarly search database that focuses on the broader set of use cases that academics now face. By including awarded grants, patents, and clinical trials alongside publication and Altmetric attention data, Dimensions goes beyond the standard publication-citation ecosystem to give the user a much greater sense of context of a piece of research. All entities in the knowledge graph may be linked to all other entities. Thus, a patent may be linked to a grant, if an appropriate reference is made. Books, book chapters, and conference proceedings are included in the publication index. All entities are treated as first-class objects and are mapped to a database of research institutions and a standard set of research classifications via machine-learning techniques.

One of the challenges faced by the Dimensions development team was how to classify publications, grants, policy papers, clinical trials, and patents using a common approach across types. This is key to allowing cross-referencing between multiple content types. In Dimensions, standardized and reproducible subject categorization is achieved algorithmically using an NLP approach. The team started by giving a subject expert the capacity to build a classification based on a set of search terms. Starting with a general search term, or a longer constructed search string, the expert starts to amass an inclusive set of objects that fall into the presumptive category. Concepts are extracted from the corpus that has been returned and the expert can then boost particular keywords, re-ranking the search results to produce a different relevance score, or they can exclude objects that include particular terms. After repeating this process the expert (who is an expert in the subject but not an expert in computer coding) can define a field in a way that a computer can understand. This approach allows the computer to codify a set of rules that can be applied reproducibly to any content.

One of the problems in categorizing articles is the human factor. We are constantly learning and changing our opinion, so in order to have a standardized basis for analysis of categories we need to remove the vagaries of the human classifier. Using NLP, we can build a useful, reproducible definition of an arbitrary categorization system that will automatically be applied to any new content that is brought into Dimensions.

Ripeta

In a similar fashion to Dimensions, Ripeta has been trained on research outputs. Ripeta aims to improve the transparent and responsible reporting of research, allowing stakeholders to effectively take stock of their reproducibility and responsible reporting programme and enhance their practices. Analysing over 100 variables within a text that relate to reproducibility, Ripeta gives the user an assessment of the likelihood of being able to reproduce the results of that paper. Looking for things like study purpose; code acknowledgements; data availability statements; software programmes used (along with version numbers) gives what in effect is a credit score for that paper. Publishers and grant funding bodies can now analyse their archives and future grants and publications in order to ensure that funding is being used to conduct transparent and reproducible science.

Looking ahead

What these companies offer are ways to increase efficiency, reduce costs and ultimately support better research. In the case of Dimensions that means giving the users a much greater sense of context of a piece of research. For Ripeta, that means shining a light on funded research to ensure it’s improving its efforts around reproducibility.

In the next ten years, we will see NLP capabilities expand and be embedded in new products and services, helping researchers navigate ever-expanding data outputs and allowing new ways to extract and interpret meaningful analysis from past and present papers.

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: NLP and Digital Science appeared first on Digital Science.

NLP Series: What is Natural Language Processing?

Suze Kundu — Tue, 24 Mar 2020 15:04:56 +0000

We’re continuing our blog series on Natural Language Processing with a brief guide to what it is, where it is being used, and why it is exciting news for research. Later this week, we will be hearing from Steve Scott, Director of Portfolio Development at Digital Science, to find out a bit more about why Digital Science are excited by NLP.

What is NLP?

Not to be confused with neuro-linguistic programming, natural language processing, or NLP, is the way technology can interact with humans through words. These words could be written, spoken or heard, as input or output. NLP is a subset of artificial intelligence and machine learning, whereby systems are, in this case, able to ‘learn’ words in a language by analysing a range of input sources, or training data. The system can start to make sense of the patterns in text and dialogue through statistical analysis and the formation of algorithms. The system does not need programming; it simply picks up the ability to create words or a sequence of words that seem statistically likely given the contents of the training data and the context of the query.

Where have I encountered it in my everyday life?

Though you may not have heard of the term NLP, you are highly likely to have used it in your everyday lives. NLP has the capacity to add a level of efficiency to many tasks. Take good old autocarrot. I mean, autocorrect! The little bit of tech that thinks it knows best can often be a useful tool when you mistype a word or aren’t sure of the spelling. It is, however, widely regarded as a source of great hilarity when it does get things wrong, or just hasn’t yet learned a new word in context; one example being my own name, Suze, which frequently autocorrects to ‘Size’ – ironic as I am of rather diminutive stature. Even spelling and grammar checkers are based on NLP technology. These programs are constantly reading and referencing the words we write. They compare them to the likelihood of these words being correct based on patterns that have been determined by the same programs reading ‘training data’ from a range of sources. Similarly, having learned not only the spelling of words but the likely order of words based on rules of grammar and sentence structure, predictive text and autocomplete are also examples of NLP used in everyday life.

Beyond the ability to monitor and suggest, NLP also has the ability to translate, whether that is from speech into text, such as the dictation feature on many phones, or whether it is your favourite language-to-language translation service. The latter of these features is being used by IFI Claims to ensure that their patent database is as inclusive as possible, by extracting information in languages other than English and indexing that extracted information appropriately. The same application of NLP is used in apps that help you learn a new language. There are some limitations though, as a quick search for ‘funny Duolingo phrases’ will attest to; while the sentence structure of some of Duolingo’s best offerings make sense, the meaning can sometimes be lost in translation, so it certainly wouldn’t be able to pass a Turing test any time soon!

The one where the Duolingo app has been using the sitcom Friends as training data.

Mimicking human conversation is however a common application of NLP. If you have recently asked for online help with an issue, you may be directed to a live chat function that will triage your query as best it can. Often these first stages are led entirely by NLP, for example when you are asked what your query is regarding, which order it relates to, and what the problem is. Based on your responses, it will offer up a range of solutions, before asking whether your query has been resolved. If you are unsatisfied with the help offered, only then will you be transferred to an actual human assistant who is often already prepped with the key information about your query, increasing the efficiency of service offered to you.

This increase in efficiency of processes that can be analysed for expected patterns and routines is where NLP is most commonly applied, but where have we seen NLP being used in research?

Applications of NLP in research

NLP can be applied in many stages of research. Steve Scott, Digital Science’s Director of Portfolio Development will be diving into some of his favourite case studies from the Digital Science family of portfolios in our next article in the series. Steve will be covering everything from NLP’s ability to pick out keywords in published research and forming links, as seen within Dimensions, to the way that Ripeta can ‘read’ a research paper and look for key components that indicate the robustness and repeatability of the research carried out.

However NLP features in many more ways across the Digital Science family, such as the IFI Claims patent database that translates patent information from a range of source languages to create the most inclusive resource possible, to Writefull’s ability to create suggestions on how scientific writing can be improved based on similar text that it has ‘read’. Catalyst Grant winners Paper Digest’s tool can also ‘read’ a journal article and create a paragraph abstract of the key points in the paper in layman’s terms, in order to allow researchers and communicators of research alike to quickly determine whether a paper is of relevance to them or not.

Some of our portfolio’s tools support the research community using NLP-based add-on programs, such as chemRxiv, powered by figshare, which utilised iThenticate to detect plagiarism in submitted articles by ‘reading’ the articles and comparing them to other available resources for matching sentences and paragraphs.

What can NLP do for me in the future?

The brains behind these amazing innovations will be contributing longer pieces to this blog series where they dive into their successes and challenges of implementing NLP within their systems. We will also be hearing from Scismic who will discuss how they hope to implement NLP into their inclusive research recruitment tool to make it even better, while Joris van Rossum will discuss some of the challenges we still face when using NLP, and how we will be able to overcome these.

The ultimate goal of NLP is to make things more efficient, and therefore more productive, whether that is through more inclusive gathering and better linking of research information, or by making research information easier to understand quickly, by improving the quality of research outputs through checking for repeatable research or appropriate use of scientific language, or even by checking for plagiarism. However, this is just the start. NLP is already being used as a research tool, to identify patterns and narrow down statistically likely positive results in a range of scenarios. At Digital Science, we can’t wait to learn from, nurture and support the next wave of machine learning innovations, and to share the results of the more productive research that results from it.

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: What is Natural Language Processing? appeared first on Digital Science.

Launching our blog series on Natural Language Processing (NLP)

Suze Kundu — Wed, 04 Mar 2020 15:25:30 +0000

Today we launch our blog series on Natural Language Processing, or NLP. A facet of artificial intelligence, NLP is increasingly being used in many aspects of our every day life, and its capabilities are being implemented in research innovation to improve the efficiency of many processes.

Over the next few months, we will be releasing a series of articles looking at NLP from a range of viewpoints, showcasing what NLP is, how it is being used, what its current limitations are, and how we can use NLP in the future. If you have any burning questions about NLP in research that you would like us to find answers to, please email us or send us a tweet. As new articles are released, we will add a link to them on this page.

Our first article is an overview from Isabel Thompson, Head of Data Platform at Digital Science. Her day job is also her personal passion: understanding the interplay of emerging technologies, strategy and psychology, to better support science. Isabel is on the Board of Directors for the Society of Scholarly Publishing (SSP), and won the SSP Emerging Leader Award in 2018. She is on Twitter as @IsabelT5000

NLP is Here, it’s Now – and it’s Useful

I find Natural Language Processing (NLP) to be one of the most fascinating fields in current artificial intelligence. Take a moment to think about everywhere we use language: reading, writing, speaking, thinking – it permeates our consciousness and defines us as humans unlike anything else. Why? Because language is all about capturing and conveying complex concepts using symbols and socially agreed contracts – that is to say: language is the key means of transferring knowledge. It is therefore foundational to science.

We are now in the dawn of a new era. After years of promise and development, the latest NLP algorithms now regularly score more highly than humans on structured language analysis and comprehension tests. There are of course limitations, but these should not blind us to the possibilities. NLP is here, it’s now – and it’s useful.

NLP’s new era is already impacting our daily lives: we are seeing much more natural interactions with our computers (e.g. Alexa), better quality predictive text in our emails, and more accurate search and translation. However, this is just the tip of the iceberg. There are many applications beyond this – many areas where NLP makes the previously impossible, possible.

Perhaps most exciting for science at present is the expansion of language processing into big data techniques. Until now, the processing of language has been almost entirely dependent on the human mind – but no longer. Machines may not currently understand language in the same way that we do (and, let’s be clear, they do not), but they can analyse it and extract deep insights from it that are broader in nature and greater in scale than humans can achieve.

For example, NLP offers us the ability to do a semantic analysis on every bit of text written in the last two decades, and to get insight on it in seconds. This means we can now find relationships in corpuses of text today that it would previously have taken a PhD to discover. To be able to take this approach to science is powerful, and this is but one example – given that so much of science and its infrastructure is rooted in language, NLP opens up the possibility for an enormous range of new tools to support the development of scientific knowledge and insight.

Google’s free NLP sentence parsing tool

NLP is particularly interesting for the research sector because these techniques are – by all historical comparisons – highly accessible. The big players have been making their ever-increasingly good algorithms available to the public, ready for tweaking into specific use cases. Therefore, for researchers, funding agencies, publishers, and software providers, there’s a lot of opportunity to be had without (relatively-speaking) much technical requirement.

Stepping back, it is worth noting that we have made such extreme advances in NLP in recent years due to the collaborative and open nature of AI research. Unlike any cutting edge discipline in science before, we are seeing the most powerful tools open sourced and available for massive and immediate use. This democratises the ability to build upon the work of others and to utilise these tools to create novel insights. This is the power of open science.

Here at Digital Science, we have been investigating and investing in NLP techniques for many years. In this blog series, we will be sharing an overview of what NLP is, examine how its capabilities are developing, and look at specific use cases for research communication – to demonstrate that NLP is truly here. From offering researchers writing support and article summarisation, to assessing reproducibility and spotting new technology breakthroughs in patents, all the way through to the detection and reduction of bias in recruitment: this new era is just getting started – where it can go next is up to your imagination.

Look out for the next article in our series, “What is NLP?”, and follow the conversation using the hashtag #DSreports.

The post Launching our blog series on Natural Language Processing (NLP) appeared first on Digital Science.