[A list of external news/blog posts about Paperscape can be found here.]
Paperscape is a tool to visualise the arXiv, an open, online repository for scientific research papers. The Paperscape map currently includes all (non-withdrawn) papers from the arXiv and is updated daily.
Each paper in the map is represented by a circle, with the area of the circle proportional to the number of citations that paper has. In laying out the map, an N-body algorithm is run to determine positions based on references between the papers. There are two “forces” involved in the N-body calculation: each paper is repelled from all other papers using an anti-gravity inverse-distance force, and each paper is attracted to all of its references using a spring modelled by Hooke’s law. We further demand that there is no overlap of the papers.
The map is rendered simply as a solid circle for each paper. The colour of the circle denotes the arXiv category of the paper, and the brightness indicates age. Brightness is sometime difficult to discern, and we are working on adding a heat-map overlay to indicate clearly the areas of the map which have the most recent activity.
As you zoom in on the map labels will start to appear on individual papers. These labels are (mostly) automatically extracted by analysing word frequency in the title and abstract of the paper, and are generally indicative of the subject matter of that paper. Zooming in closer also shows the author(s) of the paper. If a paper is deemed to be a review paper, or a set of lectures, this is noted.
References (and citation counting) are extracted by processing the TeX/LaTeX and PDF source obtained from the arXiv. This is done automatically each morning, and the map is finished updating about 3 or 4 hours after the arXiv’s new listing is announced. Some categories (noticeably
hep-ph) have better reference extraction than others and so the map for these areas has more variation in paper size and more structure. We are working on improving the reference extraction.
17 thoughts on “About Paperscape”
Very interesting! You have done a great job. I’d like to point out that the author search is solely based on surname. It would be great if you could do something about it. Perhaps, you could use the already existing author search on arXiv. Although, that search itself sometimes fails or gives wrong results. I guess arXiv has a solution for this: it allows authors to link their papers to their account. I look forward to seeing the more enhanced version of this. Keep up the good work!
Thanks Sedigh! We do support initials in author search, but not full first names. For example,
s.d.m.white. I agree it would be good to have full first names, and we are working on this feature.
“I guess arXiv has a solution for this: it allows authors to link their papers to their account. ”
It is true: arxiv allows user to sign in and identify/own their research papers. This features can be used with great advantage.
I really liked the idea, but I have a question. From the description, it seems that you use only the labeled category of the paper, but why don’t you come up with indicative labels for clusters using something similar to what you do for labelling individual papers. Since you represent each paper in 2D space in terms of a position and a volume, you can easily apply clustering to form groups. Then you can apply something similar to “HEADY: News headline abstraction through event pattern clustering”, something that is a matter of active research by Google (and probably an overkill for your project, but in the long run, I think this would provide useful and interesting results) to summarize each group.
Afterwards, the map that you get would be much interesting as the labels are learned from the data.
Also as you use colour and size for representation, I would expect to to have older and important papers towards the center and younger ones towards the edges.
Along with the automatic label detection, in the end you could get an idea about the trends in different branches of research and which branches are getting closer and relating to each other. For instance, nowadays, at some edge of machine learning may be getting closer to neuroscience, simply because neural networks are related to both and things that can be done using artificial neural networks can be proven to be done by the brain as well, such as Kalman filtering.