Content Mine

Kyle: Peter Murray-Rust holds a doctorate from Oxford University with interests in crystallography and informatics. He is currently Reader Emeritus at the University of Cambridge and formerly a Senior Research Fellow at Churchill College. In addition to his work in chemistry, he is also known for his advocacy of Open Access and open data which lead him to found ContentMine - a project that we'll discuss today which uses machines to liberate more than 100 million facts from scientific journals. Peter, welcome to Data Skeptic.

Peter: Thank you very much for inviting me.

Kyle: I'm really glad to have you. Well maybe we could jump right in and you could share the purpose and the mission of ContentMine.

Peter: Well first of all, ContentMine is about justice. The access to scientific information in the world is fundamentally unjust, with only a very small proportion of the world's population being able to access any significant amount. ContentMine is there to redress the balance as far as it can. More generally, we're creating information from the whole of the scientific literature, making it available to everybody. In this process we hope to build a community . We think communities are very important in the current century and this will be a community of not only scientists but also anyone interested - what people call "curious minds" - and they will come from all walks of life because they are keen to get scientific information and make it useful. Finally, much of the information is currently only available for sighted humans. If you cannot read it, you cannot access it properly and we're interested in increasing the machine processability and accessibility.

Kyle: I think this is a fantastic mission, and in my opinion a necessary one in the current day and age. For anyone who might not be familiar with scientific journals and the process of publishing, can you summarize how scientific findings go from researchers and laboratories where they are discovered and into academic publications.

Peter: Most people are either funded by a number of public, or sometimes private, funders or they're students studying for Masters or Doctorate degrees. Regardless of that, every scientist should keep a lab book where they record everything they do. Both their observations and also their conclusions. And when they feel that they have got something that the world needs to know about, they then write a draft paper. Now this is usually done in a group, so it normally includes a supervisor, some colleagues, and maybe people from other institutions. So, although there are a number of single author papers, most of them are published by a group of people. They write a draft and when they're happy with it, they choose a journal that they think is useful to publish it in. They submit it. The journal sends it out for review to a number of reviewers, somewhere between two and five, who then make comments. Now sometimes, the comments are that this paper is not acceptable for publication, in which case the authors may challenge this, and you will get an extended debate. Sometimes the reviewers suggest minor corrections and the authors correct them. Sometimes the reviewers say that there should be more experimental work done and so on. Finally, the authors will have a draft which the journal will accept and that draft is submitted to the publisher who then turns it into a production copy. Now, that's the current position.

I have serious concerns with some of it. The idea that you only publish at particular points during research is something which is increasingly out of spec with the way that this century thinks. In software, we publish our software several times a day to repositories, and there is a growing movement of people, myself included, who want to do Open Notebook Science where the notebooks are visible to the whole world as the experiment is done, and we hope to do this in ContentMine. We hope that anybody can see what we're doing at any stage, comment, and get involved. But the main stream, 99.9% of science is done in groups who do pieces of work, write it up, and then send it off to journals. I should also say that publication is not guaranteed. Many journals have a rejection rate of somewhere around about 50% of more. And some of them reject 95% or more. The rejection isn't always because the work is poor. It's because the work doesn't fit the excitement that this journal requires from its authors and, again, many of us feel this is a huge waste of effort to require people to resubmit papers several times until they find a journal which finds their work fit for publication.

Kyle: Interesting. The open notebooks is an especially novel concept, I feel. Although I imagine you can't speak for every scientist, but is there a concern that someone working on interesting stuff might have their work, you know, sort of, scoped out from under them by someone whose able to work a little faster because they're sharing notes.

Peter: There is certainly concern about that. It will take a long time before the majority of people come around to an open view. And indeed, it may well be that the majority never does come around. But one example which is being pursued at the moment is Matt Todd who is a chemist in Sydney Australia, and he's running an open [source] malaria project, developing new chemical compounds to fight malaria (that is, fighting the parasite of malaria). This project is run completely openly and as a result he has a lot of people contributing who wouldn't normally be involved in a scientific project.

Kyle: That's a very valuable trade-off I would say. Let's for a moment, make the obviously incorrect assumption that people have access to scientific publications, which I guess would usually mean I either have a printed version of the journal, or a PDF or postscript file. Those file formats are pretty easy to view with free software. What else might someone might want to have available that isn't readily available in that final publication version of a paper.

Peter: PDF is the normal mechanism that people use to communicate their published research. I'll note that it isn't what they use in a thesis for example where they use Word as an authoring tool. Or in more physics and maths communities where they use LaTeX. But generally, the result is a PDF. PDF has one advantage which is that it looks pretty and is immutable. It's often referred to as the "version of record". But it has many disadvantages. It's a very difficult standard to standardize, so having a PDF doesn't mean that you can cut and paste from it. It could mean that it's simply a photograph, a scanned copy of some typescript, or whatever. So its a very very general term indeed. The main thing that it doesn't offer is machine processability. Machines generally can't process PDF and extract useful information from it. Since it requires human eyes, it can be very slow to access the data. To give an example, we have one group who we're collaborating with which has to read 30,000 papers a year for systematic reviews of trials in the literature. That means 1 paper every 3 minutes. Now, since some papers are 20 pages, you can see that this isn't effective. And so, we're building software which can read this and come up with the key points in two or three seconds. The final thing you don't get is supporting data. So PDF is not a good way of publishing data, even if the data is in the PDF, you don't know where it is from a machine point of view. It can't be cut and pasted. Some journals, not all, support this with supporting data or supplemental information. We are able to download this systematically, and that's often very valuable.

Kyle: Absolutely. From what I understand, there are also tools... with a table, i guess if I had infinite time, I could copy... sort of transpose a table, but figures leave me often wanting the data that is there. Are there tools available that can help me liberate the data that is behind the plot that appears in the document?

Peter: This depends on the format that it's in. We should say that documents can range all the way from handwritten or typed manuscripts through to text in PDF through to machine processable files. It depends very much what they are. The most general ones that we deal with in the last 15 years of publication contain PDF text which is extractable, but here you have to remember that PDF has no sense of order. So if I take the word "table", you read this as t-a-b-l-e. Right? And in Word or HTML, those letters produced in that order. In PDF, they can be in any order. It's where they are on the page that matters. Which means that in the first instance, you have to create a tool which works out where they are on the page and if there's enough space between them to separate words or not. So, it can be very difficult. When you come to tables, it's even more difficult because you've got things in columns, but there's nothing in the PDF that says "this is a column". Sometimes you have some lines between columns and that helps a lot, but often they're just justified with white space running down the columns. Many people have thrown themselves at extracting tables. I would estimate that probably several hundred person-years have been spent in the world on trying to extract tables from PDFs.

Kyle: Yeah, I've contributed some of that time myself.

Peter: Yep, exactly!

Kyle: Can you share a little bit about what tools are available if someone is struggling with that at the moment and they might be able to leverage some of the stuff you guys have developed?

Peter: It's probably a good idea to come and find our tools because we deal with most of the cases. The problem is that some PDFs are quite easy to deal with and some are extremely difficult. Until you're familiar with that, you might flounder. So what you will find out by visiting us is:

  • Is it only available as a bitmap or raster?
  • Does it have characters inside it?
  • Does it have any form of organized order and spacing or characters in the words?
  • Are the figures present as a bitmap or as a vector diagram?

...and the latter is really valuable. If you actually have vector diagrams, which you do if you create the documents in Word and in LaTeX, then you can get a huge amount of data out. But unfortunately the publication process often turns these into rasters because the publishers think that's a better way and that destroys all the vector information, making it much less valuable.

Kyle: Interesting, so while we're on the topic. Is contentmine.org the best place to go for people to learn more about the tools and resources available?

Peter: Absolutely. We started off by running workshops, and [the idea of these] we still do; [but] we started off by saying "to find out how to do ContentMine then you need to come to a workshop and we will show you the tools" and so on. However, our tools have progressed. Our documentation has progressed. We've been delighted with how quickly people pick this up. So we're getting an increasing number of people who are coming directly to the site and finding out how to do things. As a result, we are frantically writing documentation to cover this new sect[ion] of community who come straight to us and use just the site, rather than the workshops.

Kyle: So I believe you coined the phrase "the right to read is the right to mine". Personally, I agree with this statement very strongly. Can you share some of your thoughts on why you think this is the case.

Peter: There are ethical and moral aspects to this. First of all, I actually think that everybody should have a right to the published scientific literature. There was a case earlier this year where the Ministry of Health in Liberia said that they'd found a paper 35 years old which, if they had known about it, could have been used to prevent or limit the Ebola outbreak. And they simply hadn't been able to find it because it was behind a paywall. That's the most important thing I would say. There is a right to read the scientific literature. But having said that, if you have a right to read it - and by that I mean a legal right - then we believe you should have a legal right to mine it. Unfortunately a lot of content owners - as I would call them - are contesting that and saying if you have a legal right to read it, then you have to pay us more money for permission to mine it, even though they hadn't even realized five years ago that this was a valuable thing to do. So they want to create a new market out of simply sitting on their established information. It's rather like people who make new movies out of creating clips of old movies or mashing up music. That is, the original owners will say, okay, you have go to pay us more money because you're doing more and exciting things with it. So that's why we challenged it. The phrasing of this comes from an important protest movement in the 1930s in UK where people were not able to walk on the mountains of Scotland and England, so they went on a mass trespass, and their phrase was "the right to roam". That became embedded in law in the UK and I'm very pleased to see that "the right to mine" is becoming used in the discussions of legislators in copyright and content mining.

Kyle: I was very disappoint[ed] in my life when I left university. Unbeknownst to me, I had benefitted, from all these payments that my university was taking care of. As an independent person, I no longer had all the access i used to. Now I don't want to put you in the awkward position of being asked to defend a point of view you don't agree with, but, why don't we have more scientific access? Is there any legitimate reason someone might think that paywall needs to stay in place?

Peter: The only reason that anybody cites is that we need a well-funded publishing industry to communicate science properly. Nobody except a very few argue that paywalls limit it to those people who ought to know. So we've had it argued [that] "only doctors should be able to read the medical literature" and that paywalls are a good way of limiting it. Paywalls are not set up to do this. They were done there to raise revenue for the publishers and they've never been used - and I don't they ever should be used - as a way of controlling access. I should say that there are certain types of information [where] it's legitimate to control [them], but not through paywalls. Those are things like: science with human data; science with rare breeding animals; Possibly things related to certain types of security problems and things like that, but generally, there is no excuse in not making it available to anyone. If, for example, you're a patient with a rare disease, you may very well know more about that disease by reading the literature than 99.9% of the medical profession.

Kyle: Yeah, and the Ebola example you gave is especially striking one. It's sad to hear that there is human suffering as a result of these paywalls. In a paper you wrote.. were an author on which I'll link to in the show notes, titled "Responsible Content Mining", you outlined a number of really good best practices for people who plan to crawl or otherwise assemble large datasets. Can you share some of that advice?

Peter: Content mining is meant to be a universally accepted, legal operation. There are some places where it isn't yet legal or where it's a grey area. But in the UK, it is now legal to carry out content mining for non-commercial research purposes. We want to stay within the law. This is not a war against publishers, but it is a challenge to those people who wish to limit us carrying out our legal rights.

So the first thing is that we should have good web server etiquette. In other words, if we're going to read something from a publisher website, then we should give it a relatively low load. We shouldn't hit it with something that could be seen as a denial of service attack. In practice, any interval of a few seconds is more than enough and of course, if you're doing this from a number of publishers, you can alternate between publishers and so forth. In practice, the amount of load on publishers' servers is minute compared with, say, a single mention in Reddit or Slashdot which can bring a publisher's site down. There's no excuse for that claim, that it will burn their servers out.

The second is that we should have good scientific practice and good manners. If we mine something from a site, we should say where it came from and if there's a danger that this doesn't represent the purpose of the site properly, then we should go out of our way to make that clear. So for example, if we mine something from a site and let us say they're a few errors in what we do, particularly if we are doing OCR, then we should make it clear that this is not the definitive version. That it [will] may contain up to, let us say, 1% of incorrect characters, something of that sort. In which case, the readers know that we are not claiming that this is a precise article. That if you needed to go into court to justify something, copyright, plagiarism, or something, you wouldn't do it with our copy and so on. With collaborating publishers, we have no problems, and I think good manners and scientific practice will ensure that.

The final thing is observing copyright. This is very hard to be 100% compliant because copyright is so complex. It differs in every country. It is unclear for many documents whether they are under copyright, and if so what. We take the view that [there are copyright publishers...] there are toll-access publishers where copyright is non-permissive and there are Open Access publishers where copyright is permissive. Normally CC-BY. And if we have CC-BY, then we can download and do more or less whatever we like with the paper so long as we respect the scientific process. When we are downloading something, we are potentially infringing copyright because we are copying it. Even if we never make it available to anybody else. In the UK, we have the right to do this. Now we have to be careful that large numbers or indeed any copyrighted material doesn't by mistake get published because we're not in the business of being "pirates". Although some publishers believe that mining is a threat in this direction. If the pirates want to copy the whole of this, they almost certainly have already done it. So its a red herring to introduce piracy. We would try and make sure that anything we do was either held on one site or more likely was actually only copied transiently, processed, and then discarded.

Kyle: I'm glad we're talking about some of the legal challenges because in my opinion it's somewhat ambiguous what is or is not permitted and maybe some of that is because technology changing so far but there's also a precedent I would say, at least here in the United States there was the case of Aaron Schwartz which I would imagine you're familiar with.

Peter: Indeed

Kyle: A university student who was just trying to download a[nd] liberate, if I recall correctly, it was JSTOR papers.

Peter: Correct

Kyle: And was brought very strict and stringent criminal charges against him. So thats kind of a... I don't know if that was done deliberately to say this is how were going to handle things but that's sort of the precedent. Do you think that people are trying to scare individuals away from data mining?

Peter: I hesitate speak authoritatively on American law and justice. My own view is that this was deliberately over the top. That it is unclear whether Schwartz had committed any offense, in fact. He was never brought to trial. He was arrested by the Federal authorities and not by the civil court or civil processes. There's some suggestion that the prosecutor wanted to make a thing of this either as an example or for their political career. I can't comment on that. I'm simply relying on what other people have said. But it is certainly unclear whether Schwartz committed a crime or not, or even whether he broke copyright. So I'll go onto the things we have to be careful about.

First of all, I don't think any of this constitutes a criminal offense. There are things that happen with DRM [Digital Rights Management] which I believe are potentially criminal offenses. So if, for example, we try and break DRM on somebody's site then it may be that we can be pursued by the government and authorities or not. I'm also speaking here generally, because laws in different countries are different and you have to realize that this is incredibly complicated because there are probably 100 different countries and jurisdictions to deal with and some of them are incredibly arcane. But generally, the question [is] three pieces of law that you have to worry about. One is copyright. The second is contract law. And the third, which only holds in Europe is the so called sui generis database rights.

Let's start with copyright. Copyright says that you do not have permission to copy a document unless you obtain the permission of the copyright owner. In some countries there is a doctrine of "fair use" which says that you can copy bits of it and so on. But even in those cases, it can be difficult to know whether you have the right to copy it. We in ContentMine are only copying it for the purpose of mining it for research purposes and we're only copying it temporarily. We're not republishing copyrightable parts. Of course, nobody knows what is a copyrightable part of a document. We have the general doctrine [that] facts are not copyrightable. That's in the [Berne] declaration. If I say "the temperature outside is 22 degrees celsius". That is not a copyrightable statement because there's no other way of expressing it. If on the other hand I say "Oh what a lovely morning", I might be infringing Rodgers and Hammerstein. So you have to realize that we're sticking with factual material which is not copyrightable. It's unclear for example whether an abstract of a paper is copyrighted. Some publishers probably will claim it is copyright. So we do not reproduce abstracts by default. Its that sort of thing. Because its unclear, what generally happens is that if a publisher challenges something then they can go to the DMCA and ask for a take-down. And this is given automatically and the alleged offender has to take down and has to argue their case as to why this is not in fact an infringement of the law. Now this in my view is vastly weighted towards the copyright owner. It's guilty until proven innocent. In some countries like France if you offend three times, as alleged by the owner, then you can be banned from the internet. The so called Hadopi Law.

Kyle: Wow

Peter: Oh yes. It is a very difficult area. However, in the UK, we have a legal right to do it. The only other country which has this right is Japan. In the US there are, I would say de-facto rights, but they're not necessarily legal rights.

Fair use is something that you may well be able to argue but Larry Lessig has said that it's "the right to call a lawyer in your defense". It's no stronger than that. It might very well mitigate your sentence or whatever.

The second is contracts. Most universities have signed contracts with publishers which forbid text and data mining and they have a phrase, something like "not withstanding X you may not crawl, spider, index, download, etc, etc, etc", and the universities have by and large signed this. Now first of all, I think they've been highly irresponsible in doing this. They haven't brought this to public attention. It's against natural justice. Secondly, in the UK, the new legislation expressly says that this has no legal force. So in other words, we are going to mine stuff from the University of Cambridge, regardless what has been signed with the publishers, and we have the University library on our side with this. We cannot be held that we're breaking contracts because Hargreaves explicitly says that the new legislation overrides any contract. Exactly what it does with pseudo-DRM material, we don't know.

The final thing is "sui generis" which is only applicable in the European Union. Sui generis was passed about 15 years ago - the database right - and it's says that a database and its contents are protectable by this law, effectively copyrighting it. This might mean that something like a collection of telephone numbers was copyrightable in Europe, but not necessarily copyrightable in the US. Again, I think it's a highly skewed law. I don't think it brings any benefit. What is does is holds back progress. We argue, and its not been test[ed] in court to my knowledge, that a journal is not a database. It is a collection of documentary material for other purposes and is therefore not protectable by sui generis.

Kyle: I would suspect, just me guessing, that the reason for any protection like that would be if let's say I spent a year of my time and went out and did field research and built up that data, that I should have, maybe some oversight on the data... not just that authors have submitted their papers to a publisher who is building a wall around them. Do you think that's a fair perspective on things?

Peter: It's a very commonly held one. I think that the conventional way of doing science is that if you collect data, you have a right to use that data until you've extracted anything useful out of it, and then republish it. That view is under great threat at the moment, and rightly so. As one example: in the UK, a scientist at Queen's University Belfast had collected data on tree rings (dendrochronology) over 30 years. He was retired and [was] asked for this data [through] Freedom of Information requests, and the University declined to release it. The Freedom of Information Commissioner then overruled this and said that in fact the data belonged to the University - and not to the scientist - and that the University had a duty to release this data. You can see that the balance is changing. Its also true that funders are extremely [keen] that their data should be made available.

Kyle: In my opinion, the accessibility of research results should be dictated by the researchers themselves and perhaps to some degree by those that fund them. Yet it seems like the publishers are having the most control in this situation. Do you find that to be correct? Also, what's your personal opinion about whom should have control over access to scientific literature and data?

Peter: I am not as strong on saying that control of access to research should be done by the researchers themselves, particularly if it's publicly funded.

Kyle: Ah, true.

Peter: As you probably know, in this country, we had a big public storm called Climategate where people wanted climate data from the University of East Anglia to reanalyze and the university declined to let them have it and there was a great deal of bad feeling, emails which had unfortunate sentiments in them, and so forth. Now my view would be if that project were being funded now, the funders would probably put much more stringent explicit requirements that that data be made available.

Kyle: That makes sense to me. I think maybe as the Open Access movement takes off, we'll see projects being funded from the start that way. Perhaps that will enable better collaboration and community efforts. One such project that's caught my attention is the text2genome project which annotates the human genome with, as I understand it, papers relevant to specific areas. So if one research is kind of looking at one part of the genome, they might learn who else has published about that area, and I could see how this would be tremendously valuable to them. I think that can only be available due to data mining and that sort of annotation. Do you see that as a success in the same way I have? Are there similar other success stories you're aware of?

Peter: I know the people involved in that. Max Haussler who is now at San Diego and Casey Bergman who is at Manchester. And yes, very useful project. There are a lot of people who are looking for textual annotation of biomedical literature with roughly 50/50 between the new genome stuff and medicine in general. Lots of people want to analyze the literature to annotate genomes because the sequence of the genome is known but not necessarily all the functions of it - what all the regions of it do and so on. This is often described in free text in the literature so that you get something like 'this region regulates the expression of some protein' or whatever, and that's in textual format, so you want to be able to tie that protein to that region of genome. That's one thing. At the other end, people want to look through medical records and clinical trials to come up with patterns of disease or treatment or whatever. So we're working with the clinical trials group Open Trials to look at how we can help and also with the Cochrane Collaboration to see how we can help with systematic reviews of the medical literature to pull out those pieces of papers on trials which are sufficiently valuable to be systematized into a resource.

Kyle: So I noticed there's a lot of things in particular the arXiv coming out of Cornell that's been a great resource to me personally, and I think a lot of other people. My sense is that the sentiments that ContentMine has are starting to become more popular and that we're seeing perhaps the beginning of an Open Access movement. That doesn't mean we don't have a long way to go still liberating data from paywalls and walled garden communities, but do you think the scientific community is starting to be one the right track for Open Access?

Peter: Some days I think yes; some days I think no. I think it's probably true to say none of the main [toll]-access publishers is interested in having all of their literature Open Access and having this as the mainstream approach. And they will find ways, I believe, to keep closed access as a critically important part of what they are selling. It's a model which they're familiar with, which they know how to operate. At the moment what they're doing is generally making their "glamour journals" closed access. So in bioscience, this is Cell, Nature, and Science. I am quite sure those will remain closed for a long period. Most publishers have Open Access offerings which are competent. Nothing wrong with that. But they're not where most scientists will aim to publish their important results. So, I don't actually think that we're going to see universal Open Access anytime soon. Having said that, the meme is out there. The funders are very keen on Open Access and they want people to publish in Open Access - make their stuff available. So there's a conflict here. It's one where money is one of the important things. The publishers have now gotten a market of about 15 billion (with a "B") dollars a year. That means that if we were to go Open Access, we have to switch that amount of money from the closed subscription mechanism to author-side funding and that's going to be incredibly complicated and I don't see anybody stepping up to cut that Gordian knot. So whether it will slowly change, I don't know. It's an awful lot of money to shift. Also with the subscription model, the publishers have the say in who gets what and how things are rated. The publishers are selling reputation. This is not based on any intrinsic measure of reputation. It's based on counting citations which is about as valuable as counting the number of notes in a piece of music to tell you how good it is. But that's how it's done and they will want to keep that because it's cheap to operate and very lucrative. On the other hand, I think arXiv is wonderful. To give you an example, it costs totally $7 to publish a paper in arXiv. For many purposes that's sufficient to communicate results to the community. It needs community comment. Call it peer review. But I would call it post-publication peer review. And in my view that's the ideal way to publish where you've got a publication and then the world adds on what it thinks about it and the cost of publication is trivial. Compare that with Nature where they say it costs [us] 40,000 dollars to publish a paper. So that's 40,000 against 7. Something is wrong there.

Kyle: Yeah. So I want to get back a little bit more on ContentMine. I can really appreciate some of the challenges you guys have. I've tried to do some liberation of data myself and I know how difficult it can be, and that's only the particular fields I know and am interested in. I can't imagine how difficult it must be to scale out to lots of different academic pursuits. So I'm curious about how the variety of data might affect the the challenges you guys face, especially when you want to extract things that vary by field.

Peter: Very good question indeed. We're concentrating on the published scholarly literature which is about 1.5 million articles a year. Somewhere around about a few thousand a day. And they're published from a huge variety of publishers. Let's say one thousand with probably 15 major publishers publishing the bulk of that. So that's the mechanism. I would also say we musn't forget an incredibly valuable resource which is student theses. Students put a lot of work into their thesis. They're heavily peer reviewed, and the examiners know that. Many of them are not reused. Now they often contain huge amounts of unpublished data or other data which is not published elsewhere, because in many cases, the student leaves, the supervisor moves elsewhere, and they just don't write up all the work that the student has done. So the thesis is often the primary record of that. So I'm very keen in doing theses and I'm particularly keen in countries like Netherlands and France who've got all their theses in one place.

First of all discovery is one of the challenges. Its remarkable that even if we spend 15 billion dollars a year on publishing, we don't have an index of the published literature. We don't have an open one. We have Thomson Reuter's Web of Science which comes out of Current Content, many years ago ISI. But it is selective. It doesn't cover much of the global south. What we desperately need, and it seems to me almost trivial to create if we have the will, is actually a record of what has been published. The best that we have at the moment is CrossRef, probably, but CrossRef is a publisher funded organization and although I get on very well with the CrossRef people, they're always subject to the fact that they're dependent upon publisher funding. I really do believe that the world scholarly library community has a duty to make an index - a believable index - of the world's scientific literature, not just from the rich North. So that's the first thing: discovery. When you've discovered it, then you want to be able to search it. In time, ContentMine results will be used to help with that, but what we're doing in the first instance is something called "getpapers" which is a tool by Richard Smith-Unna from our group, which goes to a collection of papers and allows you to ask a query through their API. For example, we support arXiv, we support Europe PubMed Central or NIH. We support repositories such as CORE. If you have a repository which has a lot of stuff which is valuable to you then "getpapers" is the place to start. And that means for example if you got to Europe Pubmed Central and ask for dinosaurs you'll get a few hundred papers in dinosaurs. But the "getpapers" tool will wrap them up in a way which is ideal for the next phase of the process.

Kyle: Excellent.

Peter: The next tool... I'll go through the four tools, no the five tools. The next tool is called "quickscrape", again by Richard. "quickscrape" allows you to put a DOI - or more normally a URL - into the system and download everything associated with this. So you'll remember I've talked about supplemental data. You can go to a paper, I'm going to take PLoSOne as an example, because it's got the largest number of Open Access papers. You can go to PLOS One with a URL and ask it to download the PDF, the HTML, the XML, the figures, and the supplemental data all in one go without putting any more effort into it. And you could download, for example, all the papers published today, which would come to somewhat over 100. It will wrap them all up, again, in exactly the same form that we need for the next phase of the process. The next tool is called Norma. Now, Norma shouldn't be necessary. What Norma does is normalize the publishers' format into a common semantic form. By Semantic, I mean machine processable so that a machine can read it without having to be told how to do it. This shouldn't be necessary because authors create something close to semantic information but commercial publishers - whether they are toll-access or Open Access - convert it into something that they feel is right for their purposes, so they'll turn the Word into PDF. They'll turn the images into bitmaps.

Kyle: (Grumble)

Peter: They'll spray the page with things about "how wonderful we are", "discover other papers.=", "find out what your rating is on Twitter" and so on. Nothing to do with science at all. And we strip all that stuff off. But what we also have to do in Norma is turn PDFs into HTML which is hard and lossy. Turn the bitmap images into SVG - Scalable Vector Graphics - and normalize the text so that we have sections that we understand. Now this is not 100% lossless, but in many cases its pretty good. When we finish Norma, we've actually got something which is fit for purpose. I wouldn't suggest that anybody do content mining unless they've got something which does the same function as Norma because otherwise they have to write different tools for every journal. Norma, like all our tools, is Open Source (Apache2) so anybody can download it. They can use it in commercial programs. They can use it for any lawful purpose.

Before I go on to the next ones, I should say that you raised the question of "isn't every journal different?" and unfortunately it is. We've had to build this into "quickscrape" and Norma. "quickscrape" has a per-journal scraper. This is not quite as bad as it sounds because many journals are owned by the same publisher, so if we've done a scraper for one BMC paper, we've done it for the lot, so that by writing a scraper for one journal, you've often get a few hundred that will use exactly the same format and therefore be scrapable. However, there are journals which do their own thing and we have to write scrapers for them and this is where the community comes in. We see this as being done through community activity - people who know the journal, are interested in mining it, so therefore they have an interest in building a mining tool.

The fourth is called AMI. AMI is basically a workflow for discipline-specific plugins. This provides an easy way of mining. There are lots of different things you might want to mine, and at the moment, we've got a whole list so we can do:

  • a bag of words, a word cloud on the paper which tells you what it's about,
  • we can do regular expressions which will find out where words are in a piece of text.
  • We can do chemistry.
  • We can do phylogenetics.
  • We can do species.
  • We can do sequences.
  • We can do identifiers.
  • We can do genes.

There's a whole lot of things we can do. There's a lot of things we can't do but we've built a platform where the only thing you need to worry about is your discipline. How you turn text into your material and you don't have to worry about the housekeeping. That's the advantage of a framework. So that's AMI. The results of AMI go into "Cat" which is a catalog based on Elastic Search, and there we have several million datasets stored and you can search this using a variety of tools based on faceted search, which is what Elastic Search provides. It also makes it very easy to start looking for concepts which co-occur, so, which author is connected with this journal and with which subject. That's the sort of thing which you can ask relatively straightforwardly in Elastic Search. Cat.

At that state, we have a tie up with Wikidata. We're going to offer our data to Wikidata in case they haven't got it. But we're also going to search Wikidata for data to give us another chance of validating if what we've extracted is correct and we're going to start doing that on species. So that's basically our framework. There's a sixth tool called Canary which is a UI to run the whole process from a graphical user interface so you can put in your own URLs and say 'please run Norma, AMI and put the result into Cat'. That sort of thing.

Kyle: In terms of community what opportunities exist for people to volunteer or make contributions.

Peter: We just set up a community process and platform. Our community manager is Graham Steele, who has been incredibly active in patient support organizations in the Open Access arena and so forth. The way we're doing it is inviting people to set up CMunities [ContentMine communities]. We're extremely keen on setting up CMunities which will have a large or even complete degree of autonomy over how they use the ContentMine tools to mine the literature. We will provide general support for them in terms of developing the next level of tools, of providing material for documentation. We'll also be running workshops. We've currently got the following groups who are interested in having a CMunity:

  • clinical trials,
  • animal tests,
  • chemistry,
  • high energy physics,
  • phylogenetics,
  • taxonomy,
  • plants,
  • neuroscience,
  • and crystallography.

The requirement for a CMunity is that there should be some person who is going to lead it with energy so that we know that it's going to be there in a month, in two months, whatever. That they've got the ability to pull other members of the community around them, and that they will keep up the interest by, let's say, running some sort of heartbeat process of regular mailings, possible stand ups, and other things which people use to develop this. Obviously they won't all progress at the same rate, but that's the general way. We will support it with probably two monthly catch ups with Graham, myself, and other technical people in helping them get over technical and other social problems. We're keen that some of them will apply for funding to do this. Because there's a lot of interest in content mining, there's a lot of interest in open data, in making data available, we think in some CMunities, it will be possible to have grants which help support this and ContentMine is very keen to partner with people, usually as a minor partner, to help develop this type of process.

Kyle: So when you've successfully liberated a lot of new data, how do you go about making that available to other people?

Peter: What we're not going to do is create a huge dump of the data with literally a billion facts in. First of all, we don't have the resources to do it. secondly, we might be challenged by people who have the right... who think they've got the right ... to have that data. What we're actually going to do is we're going to download and process the papers every day and then to publish the factual metadata that comes out of it on our website. Now that data may not hang around for more than a week, but other people can access it and build their own resources. For example, if we're pulling out species, and one of the things that we want to investigate is endangered species, so we might make a daily list of all mentions of endangered species and the facts associated with them. Then people interested in this, conservationists, might very well scrape those off our site every day or every week into their own database. That's actually an ideal way of doing it. It means that we don't have ongoing maintenance problems in lots of domains. The data don't get lost, of course, because these are in the primary literature and so we can in principle always go back and do it again. It's not like an experiment where you've lost a log book. Making it available on a daily basis is the primary tool. We will doing some of the things that we are particularly interested in ourselves and making small resources available and also we are going to work very closely with Wikidata on the one hand and in [Zurich , in Geneva] [actually Berne], with a group called Plazi, which is doing the taxonomy. That data is going to be stored in Zenodo, which is a free database run by CERN, the high energy physics community. So we're looking for all sorts of ways that we can make the data reliably available for a reasonable amount of time without costing the community other than marginal costs.

Kyle: That makes a lot of sense, and I really like the approach that there'll be other hopefully satellite organizations that see the value there and start to mirror that. Where it makes sense for them, those can become tertiary resources that pick up and maintain some of those datasets.

Peter: Absolutely. The other question you asked was about volunteers. Yes, we're very keen to have volunteers. There are two types of volunteers. One concerned with the domain specific tasks. We would see people coming from the taxonomy community who are interested in building the semantic resources for extracting taxonomic data from the literature. On the other hand we'd see people who are interested in general information extraction, so people who are interested in natural language processing tools to extract tables, analyzing diagrams in images. People who want to port this thing to different types of architecture. It doesn't have to run on Node.js and/or Java - which is what we use at the moment. If people want to port it to Python or something like that, that would be great, and so forth.

Kyle: Excellent. Peter this has been a really enriching conversation. I want to thank you so much for coming on the show. I'll encourage listeners to go and check out contentmine.org. That and many other things will be linked to in the show notes. There I'm sure you can learn more, you can find out about volunteering or joining one of those communities we talked about, or otherwise learn about the benefits of the tools and facts that ContentMine has extracted and cataloged. Until next week I want to remind everyone to keep thinking skeptically of and with data.