Google is a great search tool but terrible for data mining. Data mining the scientific literature is a challenge in itself: the publishing industry sits on a mountain of data and will resist any sharing. ContentMine.org is an initiative aimed at doing away with Google and making datamining the literature a reality, for now open-access based but sci-hub willing not limited in any way except for the lawyers. The challenge for this blog: get datamining running, for now on Windows 10.
Step 1 The package manager.
This step requires installation of node.js. Instructions here. Chocolately (another play on Java?) is the actual package manager, many exist but this one is geared towards Windows. Open Windows Powershell as an administrator and enter:
iex ((new-object net.webclient) .DownloadString('https://chocolatey.org/install.ps1'))
Check successful installation: enter “choco” and check output: “Chocolately v0.9.9.12”
Step 2. Datamining tools
With choco in hand the Norma tool can be installed:
choco install norma -s https://www.myget.org/F/contentmine/api/v2 -y
This tool converts pdf files to HTML or other formats.
Next up is getpapers, a tool for fetching open-source articles from EuropePMC , IEEE, ArXiv. This time the package manager is nmp:
npm install --global getpapers
Next up would be AMI which is a plugin labrary but all efforts at installation failed.
choco install ami -s https://www.myget.org/F/contentmine/api/v2 -
results in “The remote server returned an error: (404) Not Found.”. The AMI library is not a requirement.
Step 3 Data mine
From the command line it is now possible to get an idea how the data mining process works. Enter:
getpapers -q ABSTRACT:"cubane and synthesis" -n -o cubane
info: Searching using eupmc API info: Running in no-execute mode, so nothing will be downloaded info: Found 21 open access results
For retrieving the metadata enter:
getpapers -q ABSTRACT:"cubane and synthesis" -o cubane
info: Searching using eupmc API info: Found 21 open access results Retrieving results [==============================] 100% (eta 0.0s) info: Done collecting results info: Saving result metadata info: Full EUPMC result metadata written to eupmc_results.json info: Individual EUPMC result metadata records written info: Extracting fulltext HTML URL list (may not be available for all articles) info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
For retrieving the actual papers enter:
getpapers -q ABSTRACT:"cubane and synthesis" -o cubane -x
and all papers are collected as xml files each in a separate folder.
There you have it: all open-access published research on cubane synthesis ready for datamining!