PaperTrail - Powered by IBM Watson

On the final semester of my MSc program at Columbia SEAS, I was lucky enough to be able to attend a seminar course taught by Alfio Gliozzo entitled Q&A with IBM Watson. A significant part of the course is dedicated to learning how to leverage the services and resources available on the Watson Developer Cloud. This post describes the course project my team developed, the PaperTrail application.

Project Proposal

Create an application to assist in the development of future academic papers. Based on a paper’s initial proposal, Paper Trail predicts publications to be used as references or acknowledgement of prior art and provides a trend analysis of major topics and methods.

The objective is to speed the discovery of relevant papers early in the research process, and allow for early assessment of the depth of prior research concerning the initial proposal.

Meet the Team

Wesley Bruning, Software Engineer, MSc. in Computer Science

Xavier Gonzalez, Industrial Engineer, MSc. in Data Science

Juliana Louback, Software Engineer, MSc. in Computer Science

Aaron Zakem, Patent Attorney, MSc. in Computer Science

Prior Art

A significant amount of attention has been given to this topic over the past few decades. The table below shows the work the team deemed most relevant due to recency, accuracy and similarity of functionality.

priorArt

The variation in accuracy displayed is a result of experimentation with different dataset sizes and algorithm variations. More information and details can be found in the prior art report.

The main differential of PaperTrail is providing a form of access to the citation prediciton and trend analysis algorithm. With the exception of the project by McNee et al., these algorithmns aren’t currently available for general use. The application on researchindex.net is open to use but its objective is to rank publications and authors for given topics.

Algorithm

Citation Prediction: PaperTrail builds on the work done by Wolski’s team in Fall 2014. This algorithmn builds a reference graph used to define research communities, with an associated vector of topic scores generated by an LDA model. The papers in each research community are then ranked by importance within the community with a custom ranking algorithm. When a target document is given to algorithm as input, the LDA model is used to generate a vector of topics that are present in the document. The communities with the most similar topic vectors are selected and the publications within these communities with highest rank and greatest similarity to the input document are recommended as references. A more detailed description can be found here.

Trend Analysis: Initially, the idea was to use the AlchemyData News API to obtain statistics pertaining to the amount of publications on a given topic over time. However, with the exception of buzz-words (i.e. ‘big data’), many more specialized topics appeared very infrequently in news articles, if at all. This isn’t entirely surprising given the target audience of PaperTrail. As a work around, we use the Alchemy Language API to extract keywords from the abstracts in the dataset, in addition to relevance scores. The PaperTrail database could then be queried for entry counts for a given year and keyword to provide an indication of publication trends in academia. Note that the Alchemy Language API extracts multiple-word ‘keywords’ as well as single words.

Data

To maintain consistency with Wolski’s project, we are using the DBLP data as made available on aminer.org. The DBLP-Citation-network V5 dataset contains 1,572,277 entries; we are limited to the use of entries that contain both abstracts and citations, bringing the dataset size down to 265,865 entries.

Architecture

A high-level visualization of the project architecture is displayed below. Before launching PaperTrail, it’s necessary to train Wolski’s algorithm offline. Currently any documentation with regard to the performance of said algorithm is unavailable; the PaperTrail project will include an evaluation phase and report the findings made.

The PaperTrail app and database will be hosted on the Bluemix Platform.

ptArchitecture

Status Report

Phases completed:

Project design
Prior art research
Data cleansing
Development and deployment of an alpha version of the PaperTrail app

Phases under development:

Algorithm training and evaluation
Keyword extraction
MapReduce of publication frequency by year and topic
Data visualization component