Digital Science Reports Archives - Digital Science

Worldwide Cancer Research

Fri, 02 Feb 2024 17:02:45 +0000

Worldwide Cancer Research

The Practicalities of Partnership

Worldwide Cancer Research was the first Grant Tracker client, and has used the platform since 2011 to manage its full grants lifecycle. During the 2023 User Day in London, we heard from Peter Fisher, Research Funding Manager at Worldwide Cancer Research, on the practicalities of partnership and how Grant Tracker can be used to identify, manage and track joint funding initiatives.

This case study covers:

How Worldwide Cancer Research uses Symplectic Grant Tracker to identify suitable projects for partnerships.
Storing, tracking and monitoring contracts and legal arrangements with multiple partners across international borders.
Monitoring finances, invoicing and payments and maintaining a comprehensive record across financial interactions.

Download the report

The post Worldwide Cancer Research appeared first on Digital Science.

]]>

How is UK Funding Allocated to Support Sustainability Research?

Tue, 11 May 2021 08:42:00 +0000

A year on from the launch of our Contextualizing Sustainable Development Research report, we continue to dive into the data demonstrating research trends around the UN’s Sustainable Development Goals (SDGs). This month, Dr Juergen Wastl, Dr Briony Fane and Bo Alroe take a look at the distribution of UKRI grant funding supporting sustainability research.

Read our SDG blog series

Juergen Wastl is Director of Academic Relations and Consultancy at Digital Science. He previously headed up the Research Information team at the University of Cambridge’s Research Strategy Office and worked for BASF managing BMBF-funded projects internationally. Briony Fane is a Research Analyst at Digital Science. She has a higher education background, having gained a PhD from City, University of London, and has worked as both a researcher and a research manager. Bo Alroe has worked with research management and administration since 2004, and currently as Director of Strategy with Digital Science. Bo is from Aalborg, Denmark, where he studied and currently lives with his family.

Read our SDG report

Introduction

The United Nations Sustainable Development Goals (UN SDGs) are global targets set by the UN across 17 areas that will give rise to a better and more sustainable world for all. Research relating to these SDGs can therefore be seen as socially impactful, and analysing trends in SDG-related research can indicate how a researcher, institution, funder or country is contributing to meeting these targets.

Continuing our SDG blog series, this blog focuses on insights in SDG related allocation of research funding related to the SDGs. Having broadened its categorisation of grants data to include classification by SDG codes, we can use Dimensions to gain insights into how competitive funding supports the Goals. By applying Dimensions’ SDG classifications to its grants database, we discovered over 6 million grants worth more than £1.37 trillion, from over 600 funders worldwide that can be searched and analysed. We took a dive into the data to discover which UK research councils support SDG-related research, where funding is focused across the UN SDGs, how much is allocated to sustainability research, and more.

Figure 1: A Dimensions screenshot showing how a single grant can be assigned multiple SDG classifications; SDGs 2, 7, and 12 in this instance

How much UKRI funding supports SDG-related research?

Dimensions’ classification system was developed jointly with the Dutch universities (via the Association of Universities in the Netherlands, VSNU), SpringerNature, and Digital Science. The aim was not only to design a classification system that can categorise grants that mentioned sustainability or the UN’s work but also to assign SDG classifications to research – including grants and publications – to better support the goal of sustainability. Supervised machine learning was used to classify content in Dimensions. For a publication abstract or grant to merely mention sustainability or related concepts such as ‘pollution’ would not be enough to earn an SDG classification. This means that Dimensions can identify grants that support sustainability improvements both explicitly – eg, by mentioning the UN’s sustainable development goals – and implicitly.

After extracting all UKRI grants indexed in Dimensions from eight UK research councils between 2011-2020, we applied the SDG classification to determine the proportion of UKRI funding that supports the SDGs.

Figure 2: The sum in GBP of SDG-classified UKRI grants awarded between 2011 and 2020

By applying the 17 SDG-classified grants records and publications in Dimensions, we can evaluate how funders support research towards more sustainable development. Figure 2 provides an overview of the sum in GBP of UKRI grants that have supported sustainable development research between 2011 and 2020. We have selected a public research funder as an exemplar for this analysis on the basis that a competitive public research funder, such as UKRI, is the origin of most public research funding in the UK, with arguably the most impactful strategic waypoint for research on sustainable development. Such funders have considerable influence on the type and focus of research conducted.

Figure 3: The total number of UKRI grants with and without an SDG classification awarded between 2011 and 2020

Figure 3 shows the total number of SDG-related UKRI grants versus all UKRI grants, using the same base data as for Figure 2. There is a notable increase in the number of grants awarded after 2016, the year the UN SDGs were implemented. The graph also reveals that an average of 24.9% of all UKRI grants each year aimed to support sustainable development research, with a growth rate of 218.0% over the period. This is a higher rate than for all UKRI grants in the period, which was 128% over the same period. The Global Challenges Research Fund (GCRF) UK aid strategy is administered through UKRI and aims to assist in making progress on the global effort to address the UN SDGs. It has committed £1.5bn funding to address the UN SDGs between 2016 and 2021. This has also contributed to both the number and value of grants for research with a focus on sustainability.

Unlike Figure 3, Figure 2 does not show a similar trend for awarded amounts. This could indicate an increased number of smaller sums of funding per grant awarded by UKRI.

The proportion of UKRI funding with SDG classifications by year shows an approximately linear growth, with SDG funding having almost tripled in 2020 compared to 2011. This trend is very likely to continue and we may see an even greater increase in funding as we move towards the 2030 deadline of achieving the Goals, especially as UKRI have committed to supporting the ambitions of the UK government’s aid strategy and progressing the UN SDGs¹. If and how the recent adjustments to GCRF funding by the UK government will affect the grant landscape and visibility remains to be seen.

How is sustainability research funding distributed across the 17 SDGs?

Figure 4: The value and number of UKRI grants awarded by SDG classification between 2011 and 2020

Figure 4 sheds light on the focus of UKRI’s funding in support of the UN 2030 Agenda. The graph shows the total SDG funding amounts by the total number of SDG grants awarded, as classified in Dimensions. It is clear that SDG7, ’Affordable and Clean Energy’, appears to have been prioritised as a funding objective. It has the greatest number of grants awarded and the highest total funding amount compared to other SDGs over the 10-year period. Similarly SDG13, ‘Climate Action’, is also prioritised. Given the climate crisis we face and the role that energy has to play in this, it makes sense that increased funding would be focused in these areas, as the transition towards reaching climate neutrality is now so urgent.

How is funding split across the three pillars of sustainability; societal, environmental, and economic?

We can analyse SDG-related UKRI funding in Dimensions through the lens of the three pillars of sustainability (societal, environmental and economic, also depicted as the ‘wedding cake’, as seen in a previous blog) as a means of assessing the proportion of UKRI research funding that is concentrated in these three components.

Figure 5 visualises the prioritisation of UKRI’s sustainability research funding by each pillar of sustainability. The size of each circle is directly proportional to the total amount of funding that has been awarded to support the SDGs within each pillar. The big hitters in the Social pillar are SDG7, ‘Affordable and Green Energy’, and SDG3, ‘Good Health and Well Being’. In the Environmental Pillar funding is prioritised in SDG13, ‘Climate Action’, and for the Economic Pillar SDG8, ‘Decent Work and Economic Growth’, and SDG12, ‘Responsible Consumption and Production’, are the most highly funded research areas.

Figure 5: The value of UKRI’s sustainability research funding between 2011 and 2020, split by the three pillars of sustainability; societal, environmental and economic sustainability

Conclusion

UKRI has allocated close to £10 billion or 28% of all its awards from 2011 to 2020 in ways that would offer support to the SDGs. Grants appear to be more likely than publications to receive SDG classifications. One reason for this is that a grant abstract is more focused on what needs to be achieved and the intention behind the funding, while a publication is reporting on what has been achieved, which may not be as comprehensive in terms of its SDG focus. Our showcasing of the grants data here shows how this funding has been allocated to each of the SDGs.

This blog shows that funding research aligning with sustainable development is prominent on UKRI’s funding programme agenda and that research capacity in this area is growing. With research and innovation having such a vital role to play in helping to find sustainable solutions to address global challenges, it is reassuring to see that SDG related research by a competitive funder is so extensive. It is also particularly gratifying to see that all 17 Goals are included to some extent in UKRI funding.

References

1: UKRI announces International Development Research Programme Awards

Read our SDG blog series

The post How is UK Funding Allocated to Support Sustainability Research? appeared first on Digital Science.

]]>

The state of altmetrics

Wed, 10 Feb 2021 11:26:34 +0000

The state of altmetrics: a tenth anniversary celebration

DOI: 10.6084/m9.figshare.13227875

In honour of the tenth anniversary of the Altmetrics Manifesto, Altmetric has published The State of Altmetrics, which explores a decade of innovation and growth in the field. The report also features contributions from leading thinkers on topics including:

Ethical uses of altmetrics
Using machine learning to improve altmetrics
Altmetrics as “sensors” to detect the spread of disease
What makes researchers more likely to use altmetrics
Predictions for the future of altmetrics

Download Now

The post The state of altmetrics appeared first on Digital Science.

]]>

The State of Open Data 2020

Tue, 01 Dec 2020 15:06:27 +0000

The state of altmetrics: a tenth anniversary celebration

ISBN: 978-1-9993177-5-1
DOI: 10.6084/m9.figshare.13227875

The Report is the fifth in the series and includes survey results from 4,500 participants and a collection of articles from global industry experts. It is now the longest-running longitudinal study on the subject, which was created in 2016 to examine the attitudes and experiences of researchers working with open data – sharing it, reusing it, and redistributing it. We feel inspired and encouraged that most open data trends are heading in the right direction.

Download Now

The post The State of Open Data 2020 appeared first on Digital Science.

]]>

How Covid-19 is Changing Research Culture

Thu, 04 Jun 2020 19:54:00 +0000

The state of altmetrics: a tenth anniversary celebration

ISBN: 978-1-9993177-4-4
DOI: doi.org/10.6084/m9.figshare.c.4997174

The research world has moved faster than many would have suspected possible in response to the current pandemic. In five months, a volume of work has been generated that even the most intensive of emergent fields have taken years to create.

The Digital Science Consultancy Team investigates the research landscape trends and cultural changes in response to COVID-19. The report includes an analysis of publication trends, geographic focal points of research, and collaboration patterns.

Headline Findings

Download the Report

Find out more about our Consultancy Team’s work

Our consultancy team

We deliver reports and solutions to research policy and management clients.

Find out more

Read our blog

Keep up-to-date with our latest reports, analysis and work in the Digital Science Blog.

Find out more

Supporting REF2021

Discover how our portfolio
companies support UK HEIs.

Find out more

The post How Covid-19 is Changing Research Culture appeared first on Digital Science.

]]>

Contextualizing Sustainable Development Research

Thu, 07 May 2020 15:25:37 +0000

The state of altmetrics: a tenth anniversary celebration

ISBN: 978-1-9993177-3-7
DOI: doi.org/10.6084/m9.figshare.12200081

Our report, Contextualizing Sustainable Development Research, highlights the growth in research around the UN’s Sustainable Development Goals (SDGs). We believe the SDGs are now more relevant than ever, for they provide a framework for recovery from the current pandemic.

In the report we ask, if we are to have an impact agenda for research, should it not be one that is informed by the SDGs? And if so, should we not be actively measuring sustainable development as part of research evaluation?

Headline Findings

Download the Report

The post Contextualizing Sustainable Development Research appeared first on Digital Science.

]]>

NLP Series: AI in Science; the Promise, the Challenge, and the Risk

Tue, 07 Apr 2020 18:46:22 +0000

Continuing our blog series on Natural Language Processing, Dr Joris van Rossum focuses on AI in science; the potential to make research better, but also the pitfalls that we must be wary of when creating and applying these new technologies. Joris has over 20 years of experience driving change in the publishing industry through new technologies and business models. His former roles include Director of Publishing Innovation at Elsevier and Director of Special Projects at Digital Science, a role in which he authored the Blockchain for Research report. He co-founded Peerwith in 2015, and currently serves as Research Data Director at STM, where he drives the adoption of sharing, linking and citing data in research publications.

Understanding the risks

According to Professor Thomas Malone, Director of the MIT Center for Collective Intelligence, AI should essentially be about connecting people and computers so that they collectively act more intelligently than any individual person, group or computer has ever done before. This connectivity is at the core of science and research. Science is a collective activity par excellence, connecting millions of minds in space as well as time. For hundreds of years, scientists have been collaborating and discussing their ideas and results in academic journals. Computers are increasingly important for researchers: in conducting experiments, collecting and analyzing data and, of course, in scholarly communication. Reflecting on this, it is perhaps surprising that AI does not play a bigger role in science today. Although computers are indispensable for modern scientists, the application of artificial intelligence lags behind other industries, such as social media and online search. Despite its huge potential, uptake of AI has been relatively slow. This is in part due to the nascent state of AI, but also to do with cultural and technological features of the scientific ecosystem. We must be aware of these in order to assess the risks associated with unreflectively applying artificial intelligence in science and research.

AI and NLP in healthcare

A logical source of data for intelligent machines is the corpus of scientific information that has been written down in millions of articles and books. This is the realm of Natural Language Processing (NLP). By processing and analyzing this information, computers could come to insights and conclusions that no human could ever reach individually. Relationships between fields of research could be identified, proposed theories collaborated on or rejected based on an analysis of a broad corpus of information, and new answers to problems given.

This is what IBM’s Watson has attempted in the field of healthcare. Initiated in 2011, it aims to build a question-and-answer machine based on data derived from a wealth of written sources, helping physicians in clinical decisions. IBM has initiated several efforts to develop AI-powered medical technology, but many have struggled, and some have even failed spectacularly. What this lack of success shows is that it is still very hard for AI to make sense of complex medical texts. This will therefore most certainly also apply to other types of scientific and academic information. So far, no NLP technology has been able to match human beings in comprehension and insight.

Barriers to information

Another reason for the slow uptake of NLP in science is that scientific literature is still hard to access. The dominant subscription and copyright models make it impossible to access the entire corpus of scientific information published in journals and books by machines. One of the positive side effects of the move towards Open Access would be the access to information by AI engines, although a large challenge still lies in the immaturity of NLP to deal with complex information.

More data give greater context

Despite the wealth of information captured in text, it is important to realize that the observational and experimental scientific data that stands at the basis of articles and books is potentially much more powerful for machines. In most branches of science the amount of information collected has increased with dazzling speed. Think about the vast amount of data collected in fields like astronomy, physics and biology. This data would allow AI engines to fundamentally do much more than what is done today. In fact, the success of born-digital companies like Amazon and Google have had in applying AI is to a large extent due to the fact that they have a vast amount of data at their disposal. AI engines could create hypotheses on the genetic origin of diseases, or the causes for global warming, test these hypotheses by means of plowing through the vast amount of data that is produced on a daily basis, and so to arrive at better and more detailed explanations of the world.

Shifting the culture around data sharing to create better AI

A challenge here is that sharing data is not yet part of the narrative-based scholarly culture. Traditionally, information is shared and credit earned in the form of published articles and books, not in the underlying observational and experimental data.

Important reasons for data not being made available is the fear of being scooped and the lack of incentives, as the latest State of Open Data report showed. Thankfully in recent years efforts have been made to stimulate or even mandate the sharing of research data. Although these offers are primarily driven by the need to make science more transparent and reproducible, enhancing the opportunity for AI engines to access this data is a promising and welcome side-effect.

Like the necessary advancement of NLP techniques, making research data structurally accessible and AI-ready will take years to come to fruition. In the meantime, AI is being applied in science and research in narrower domains, assisting scientists and publishers in specific steps in their workflows. AI can build better language editing tools, such as in the case of Writefull, who we will hear from in the next article in this series. Publishers can apply AI to perform technical checks, such as in Unsilo, scan submitted methods sections for assessing the reproducibility of research, the way Ripeta and SciScore do, and analyze citations, like Scite. Tools are being developed to scan images of submitted manuscripts to detect manipulation and duplication, and of course scientists benefit from generic AI applications such as search engines and speech and image recognition tools. Experiments have also been done with tools that help editors in making decisions to accept or reject papers. The chance of publishing a highly cited paper is predicted based on factors including the subject area, authorship and affiliation, and the use of language. This last application exposes an essential characteristic of machine learning that should make us cautious.

Breaking barriers, not reinforcing them

Roughly speaking, in machine learning, computers learn by means of identifying patterns in existing data. A program goes through vast numbers of texts to determine the predominant context in which words occur, and uses that knowledge to determine what words are likely to follow. In the case of the tools that support editors in their decision to accept or reject papers, it identifies factors that characterize successful papers, and makes predictions based on the occurrence of these factors in submitted papers. This logically implies that these patterns will be strengthened. If a word is frequently used in combination with another word, the engine subsequently suggesting this word to a user will lead to that word being used even more frequently. If an author was successful, or a particular theory or topic influential, AI will make these even more so. And if women or people from developing countries have historically published less than their male counterparts from Western countries, AI can keep them underperforming.
In other words, AI has the risk of consolidating the contemporary structures and paradigms. But as the philosopher of science Thomas Kuhn showed, real breakthroughs are characterized by replacing breaking patterns and replacing paradigms with new ones. Think of the heliocentric worldview of Kepler, Copernicus and Galileo, Darwin’s theory of natural selection, and Einstein’s theory of relativity. Real progress in science takes place by means of the novel, the unexpected, and sometimes even the unwelcome. Humans are conservative and biased enough. We have to make sure that machines don’t make us even more so.

DOI: https://doi.org/10.6084/m9.figshare.12092403.v1

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: AI in Science; the Promise, the Challenge, and the Risk appeared first on Digital Science.

]]>

NLP Series: Applying Natural Language Processing to a Global Patent Database

Tue, 31 Mar 2020 11:42:59 +0000

The latest article in our blog series on Natural Language Processing is from Catherine Suski, Director of Marketing at IFI CLAIMS Patent Services. Catherine has a passion for technology, and enjoys working in an area where she can see the direct impacts of implementing new tech into existing processes. Here Catherine will be talking about the benefits of using NLP to create an inclusive global patent database.

The Role of NLP in Inclusive Data Curation of Patent Information

CLAIMS Direct is a global patent database created by IFI CLAIMS Patent Services (IFI). NLP allows for the vast amount of information contained in patents to be applied to many situations. Through the curation of data, such as the standardisation of organisations, data can be amalgamated from a range of original sources. Using NLP, this patent information can also be translated into English from over 40 languages. By curating the data in this way, researchers can quickly access information from a broad range of original sources.

IFI receives inquiries from companies that require access to patent information for a range of use cases. From discovering important new invention types for use in investment decisions, to analysing the effects of government programmes on regional economic stimulus, the analysis of patent documents is becoming more widespread.

The growth of inexpensive and ever more powerful computing has led to easier methods for extracting meaningful data from patents, and NLP is a prime example of this. This technology is absolutely vital because, according to the 2019 report from the World Intellectual Property Organization (WIPO), 3.3 million patent applications were filed globally in 2018. This is almost twice the 1.85 million filed in 2008. There are more than 14 million active patents globally. With this many applications, it would be impossible to manually search for relevant information. Enter NLP.

Using NLP to Overcome the Language Barrier of Global Patent Information

With so many global patents that can contain important information, accurate translations are a must. Machine translation, or the use of computer software to perform translations, has been used for decades to translate patents. Recent advances employing NLP are speeding up this process. Early attempts looked at each word or phrase and translated it, however new techniques look at the overall context to provide higher quality results.

CLAIMS Direct, the global patent database and platform from IFI, uses Google Translate to convert documents in 48 languages to English. Based on neural network technology, one of the several driving forces behind NLP, Google Translate offers an exceptional level of accuracy. It overcomes problems found in most older phrase-based machine translation systems that do not sample a large enough segment of text to produce a proper translation. Using a large end-to-end network, this technology translates whole sentences or paragraphs at a time to provide context, and uses machine learning to continually make improvements over time.

Patent documents are often used by organisations and individuals who seek to patent something themselves. To be awarded a patent, the concept cannot infringe on another patent, and must also be a novel idea. Making a mistake by missing an existing publication or previously granted patent can lead to costly infringement lawsuits. The stakes are high and there will always be a big incentive to get it right. It is, therefore, common for many people to be involved in researching previous patent data, often employing multiple search methods.

While the exact format of a patent can vary by region, they have a number of structured data elements in common including invention title, inventor, submission data, active or inactive status, etc. This information, stored in named fields, is accessible in databases and is easy to search for. However, the body of a patent can contain far more useful free range text, or unstructured data, that is not parsed into fields and is difficult to search for with keywords and legacy search engines.

Search tools that use NLP can reveal crucial ideas contained in patent literature more easily than traditional methods which rely on keyword matches. Patent documents can be written using language which is meant to obscure the true nature of the invention, with the aim of keeping the subject matter hidden from competitors. Sometimes even technical subject matter experts cannot clearly see the idea being put forward. With the use of semantic and NLP algorithms, improved accuracy is achieved by ingesting large areas of text, examining the context, and making connections that are not otherwise obvious. The use of synonyms can also uncover new and relevant documents. Search intent of the user is better understood, and uniting all of these capabilities saves a huge amount of time.

Traditional Use Cases for NLP in Patent Documents

In business activities where intellectual property (IP) traditionally plays a large role, such as engineering and developing new drugs, some very successful new products incorporating NLP are improving the patent search process. Many clients of IFI have used CLAIMS Direct to build features such as:

Integration with other data sources: In addition to patents, searchable indexes can include scientific publications, internal research, websites, and other industry specific knowledge sources.

Text mining with specific vocabulary: This is especially important to the related industries of life sciences, biotechnology, and pharmaceuticals. For example, when developing new therapies, gene and disease target scanning can find research from another company that may be applicable to a new invention.

Clustering and categorisation: While patents from most countries use a common classification system, it is limited, and not industry specific. Some applications use pre-built tools tailored to different business requirements, while others allow users to set up their own requirements. The resulting visualisations provide quick insights about the latest inventions in any given field.

Relevancy scoring; With traditional search tools, results are ranked. Taking this a step further and providing a percentage score for relevancy shows the user a more finely-tuned answer.
Results delivered in an interactive framework; Search results can be refined by choosing “more like” in a field of related concepts. For example, when searching for “wind” a semantic application could give results that include wind turbines, wind-up clocks, and wind speed. The user can then select the most relevant category.

New Use Cases for NLP in Patents

Advances in NLP have resulted in it becoming a lot easier to extract important information which used to be hidden in patent documents. This has led to a range of new use cases.

Patents are making their way onto the trading floor. Fund managers want to know which technologies are on the verge of quick growth, and who owns them, in order to inform investment decisions. Here, well indexed, easy to search patent data is crucial. By adding a data source such as CLAIMS Direct to their fast-moving algorithmic trading systems, they are utilising NLP to find hidden tips, enabling analysts to create better reports.

Management consulting companies are getting in on the action too. They need to keep clients informed about the most up-to-date technology and competitive intelligence across the globe. Knowing when relevant patents have been published or granted can be a game changer. NLP offers consultants the ability to quickly uncover trends important to their clients, while improving efficiency through automated workflows. Clustering visualisations makes the information easier to understand.

As the technology continues to evolve, more use cases for patent information will emerge. We look forward to implementing these advances into our processes at IFI CLAIMS Patent Services, to continue to be as inclusive of, and useful to, the wider research community as we can possibly be.

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: Applying Natural Language Processing to a Global Patent Database appeared first on Digital Science.

]]>

NLP Series: NLP and Digital Science

Fri, 27 Mar 2020 13:35:36 +0000

Continuing our blog series on Natural Language Processing, today’s article is from Steve Scott, Director of Portfolio Development at Digital Science. As a member of the founding management team, Steve has been involved in the majority of Digital Science’s early-stage portfolio investments, taking founders through product and business model validation to launch and growth. He has given out 32 Catalyst Grant awards since their inception, with five recipients going on to become Digital Science portfolio companies. An entrepreneur himself, Steve has founded, or been involved in setting up, three of his own companies. In his spare time, Steve enjoys building and riding his own bikes.

The value of NLP

In 1950 Alan Turing wrote a paper, “Computing Machinery and Intelligence” in which he outlines what we now know as the Turing Test. In it he says, “A computer would deserve to be called intelligent if it could deceive a human into believing that it was human.” While examples of his test portrayed in films usually use speech as the communication mechanism, text, the main ingredient for NLP, is equally suitable as a Turing Test. If a computer can write a prize-winning novel, or can analyse a researcher’s writing and act as their editor, it would pass a form of Turing Test. As computer vision and speech recognition have improved dramatically over the last 10 years, NLP is widely seen as the next key challenge in deep learning, and allowing computers to make sense of human language in ways that are valuable.

From a layman’s perspective, NLP allows non-programmers to extract useful information from computer systems. Think of the way Gmail automatically sorts your inbox into different categories and controls your spam folder, or how Alexa or Google Home can translate your voice into commands that return music, answers to questions, or switch on a light in your home. Smart homes of the future will also be more energy-efficient as they learn their inhabitants’ patterns and behaviours.

NLP attempts to make sense of unstructured data, for example, as text, and that data comes in an almost endless variety of forms, including papers, emails, abstracts, grant applications, etc. Our challenge is to find real-world problems and apply NLP to help overcome them.

From a Digital Science perspective, the two companies that best highlight the application of NLP to research challenges, Dimensions and Ripeta, share a number of benefits and features that capitalise on NLP to benefit their customers.

Over the last 10 years, Digital Science has funded and supported solutions to address the rapid growth in data generated by scientific research. The application of AI and Machine Learning to this data, in the form of unstructured textual data, has become a key focus for us. Our solutions allow for, among other things, better job-matching, improved conference identification, improved written English in papers and automated reports evaluation reproducibility. I want to focus on two examples of the application of NLP in action.

Dimensions

Dimensions is a scholarly search database that focuses on the broader set of use cases that academics now face. By including awarded grants, patents, and clinical trials alongside publication and Altmetric attention data, Dimensions goes beyond the standard publication-citation ecosystem to give the user a much greater sense of context of a piece of research. All entities in the knowledge graph may be linked to all other entities. Thus, a patent may be linked to a grant, if an appropriate reference is made. Books, book chapters, and conference proceedings are included in the publication index. All entities are treated as first-class objects and are mapped to a database of research institutions and a standard set of research classifications via machine-learning techniques.

One of the challenges faced by the Dimensions development team was how to classify publications, grants, policy papers, clinical trials, and patents using a common approach across types. This is key to allowing cross-referencing between multiple content types. In Dimensions, standardized and reproducible subject categorization is achieved algorithmically using an NLP approach. The team started by giving a subject expert the capacity to build a classification based on a set of search terms. Starting with a general search term, or a longer constructed search string, the expert starts to amass an inclusive set of objects that fall into the presumptive category. Concepts are extracted from the corpus that has been returned and the expert can then boost particular keywords, re-ranking the search results to produce a different relevance score, or they can exclude objects that include particular terms. After repeating this process the expert (who is an expert in the subject but not an expert in computer coding) can define a field in a way that a computer can understand. This approach allows the computer to codify a set of rules that can be applied reproducibly to any content.

One of the problems in categorizing articles is the human factor. We are constantly learning and changing our opinion, so in order to have a standardized basis for analysis of categories we need to remove the vagaries of the human classifier. Using NLP, we can build a useful, reproducible definition of an arbitrary categorization system that will automatically be applied to any new content that is brought into Dimensions.

Ripeta

In a similar fashion to Dimensions, Ripeta has been trained on research outputs. Ripeta aims to improve the transparent and responsible reporting of research, allowing stakeholders to effectively take stock of their reproducibility and responsible reporting programme and enhance their practices. Analysing over 100 variables within a text that relate to reproducibility, Ripeta gives the user an assessment of the likelihood of being able to reproduce the results of that paper. Looking for things like study purpose; code acknowledgements; data availability statements; software programmes used (along with version numbers) gives what in effect is a credit score for that paper. Publishers and grant funding bodies can now analyse their archives and future grants and publications in order to ensure that funding is being used to conduct transparent and reproducible science.

Looking ahead

What these companies offer are ways to increase efficiency, reduce costs and ultimately support better research. In the case of Dimensions that means giving the users a much greater sense of context of a piece of research. For Ripeta, that means shining a light on funded research to ensure it’s improving its efforts around reproducibility.

In the next ten years, we will see NLP capabilities expand and be embedded in new products and services, helping researchers navigate ever-expanding data outputs and allowing new ways to extract and interpret meaningful analysis from past and present papers.

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: NLP and Digital Science appeared first on Digital Science.

]]>

NLP Series: What is Natural Language Processing?

Tue, 24 Mar 2020 15:04:56 +0000

We’re continuing our blog series on Natural Language Processing with a brief guide to what it is, where it is being used, and why it is exciting news for research. Later this week, we will be hearing from Steve Scott, Director of Portfolio Development at Digital Science, to find out a bit more about why Digital Science are excited by NLP.

What is NLP?

Not to be confused with neuro-linguistic programming, natural language processing, or NLP, is the way technology can interact with humans through words. These words could be written, spoken or heard, as input or output. NLP is a subset of artificial intelligence and machine learning, whereby systems are, in this case, able to ‘learn’ words in a language by analysing a range of input sources, or training data. The system can start to make sense of the patterns in text and dialogue through statistical analysis and the formation of algorithms. The system does not need programming; it simply picks up the ability to create words or a sequence of words that seem statistically likely given the contents of the training data and the context of the query.

Where have I encountered it in my everyday life?

Though you may not have heard of the term NLP, you are highly likely to have used it in your everyday lives. NLP has the capacity to add a level of efficiency to many tasks. Take good old autocarrot. I mean, autocorrect! The little bit of tech that thinks it knows best can often be a useful tool when you mistype a word or aren’t sure of the spelling. It is, however, widely regarded as a source of great hilarity when it does get things wrong, or just hasn’t yet learned a new word in context; one example being my own name, Suze, which frequently autocorrects to ‘Size’ – ironic as I am of rather diminutive stature. Even spelling and grammar checkers are based on NLP technology. These programs are constantly reading and referencing the words we write. They compare them to the likelihood of these words being correct based on patterns that have been determined by the same programs reading ‘training data’ from a range of sources. Similarly, having learned not only the spelling of words but the likely order of words based on rules of grammar and sentence structure, predictive text and autocomplete are also examples of NLP used in everyday life.

Beyond the ability to monitor and suggest, NLP also has the ability to translate, whether that is from speech into text, such as the dictation feature on many phones, or whether it is your favourite language-to-language translation service. The latter of these features is being used by IFI Claims to ensure that their patent database is as inclusive as possible, by extracting information in languages other than English and indexing that extracted information appropriately. The same application of NLP is used in apps that help you learn a new language. There are some limitations though, as a quick search for ‘funny Duolingo phrases’ will attest to; while the sentence structure of some of Duolingo’s best offerings make sense, the meaning can sometimes be lost in translation, so it certainly wouldn’t be able to pass a Turing test any time soon!

The one where the Duolingo app has been using the sitcom Friends as training data.

Mimicking human conversation is however a common application of NLP. If you have recently asked for online help with an issue, you may be directed to a live chat function that will triage your query as best it can. Often these first stages are led entirely by NLP, for example when you are asked what your query is regarding, which order it relates to, and what the problem is. Based on your responses, it will offer up a range of solutions, before asking whether your query has been resolved. If you are unsatisfied with the help offered, only then will you be transferred to an actual human assistant who is often already prepped with the key information about your query, increasing the efficiency of service offered to you.

This increase in efficiency of processes that can be analysed for expected patterns and routines is where NLP is most commonly applied, but where have we seen NLP being used in research?

Applications of NLP in research

NLP can be applied in many stages of research. Steve Scott, Digital Science’s Director of Portfolio Development will be diving into some of his favourite case studies from the Digital Science family of portfolios in our next article in the series. Steve will be covering everything from NLP’s ability to pick out keywords in published research and forming links, as seen within Dimensions, to the way that Ripeta can ‘read’ a research paper and look for key components that indicate the robustness and repeatability of the research carried out.

However NLP features in many more ways across the Digital Science family, such as the IFI Claims patent database that translates patent information from a range of source languages to create the most inclusive resource possible, to Writefull’s ability to create suggestions on how scientific writing can be improved based on similar text that it has ‘read’. Catalyst Grant winners Paper Digest’s tool can also ‘read’ a journal article and create a paragraph abstract of the key points in the paper in layman’s terms, in order to allow researchers and communicators of research alike to quickly determine whether a paper is of relevance to them or not.

Some of our portfolio’s tools support the research community using NLP-based add-on programs, such as chemRxiv, powered by figshare, which utilised iThenticate to detect plagiarism in submitted articles by ‘reading’ the articles and comparing them to other available resources for matching sentences and paragraphs.

What can NLP do for me in the future?

The brains behind these amazing innovations will be contributing longer pieces to this blog series where they dive into their successes and challenges of implementing NLP within their systems. We will also be hearing from Scismic who will discuss how they hope to implement NLP into their inclusive research recruitment tool to make it even better, while Joris van Rossum will discuss some of the challenges we still face when using NLP, and how we will be able to overcome these.

The ultimate goal of NLP is to make things more efficient, and therefore more productive, whether that is through more inclusive gathering and better linking of research information, or by making research information easier to understand quickly, by improving the quality of research outputs through checking for repeatable research or appropriate use of scientific language, or even by checking for plagiarism. However, this is just the start. NLP is already being used as a research tool, to identify patterns and narrow down statistically likely positive results in a range of scenarios. At Digital Science, we can’t wait to learn from, nurture and support the next wave of machine learning innovations, and to share the results of the more productive research that results from it.

SEE MORE POSTS IN THIS NLP SERIES

The post NLP Series: What is Natural Language Processing? appeared first on Digital Science.

]]>