Google is a great search tool but terrible for data mining. Data mining the scientific literature is a challenge in itself: the publishing industry sits on a mountain of data and will  resist any sharing. ContentMine.org is an initiative aimed at doing away with Google and making datamining the literature a reality, for now open-access based but sci-hub willing not limited in any way except for the lawyers. The challenge for this blog: get datamining running, for now on Windows 10.

Step 1 The package manager.

This step requires installation of  node.js. Instructions here. Chocolately (another play on Java?) is the actual package manager, many exist but this one is geared towards Windows.  Open Windows Powershell as an administrator and enter:

iex ((new-object net.webclient)
.DownloadString('https://chocolatey.org/install.ps1'))

Check successful installation: enter “choco” and check output: “Chocolately v0.9.9.12”

Step 2. Datamining tools

With choco in hand the Norma tool can be installed:

choco install norma -s https://www.myget.org/F/contentmine/api/v2 -y

This tool converts pdf files to HTML or other formats.

Next up is getpapers, a tool for fetching open-source articles from EuropePMC , IEEE, ArXiv. This time the package manager is nmp:

npm install --global getpapers

Next up would be AMI which is a plugin labrary but all efforts at installation failed.

choco install ami -s https://www.myget.org/F/contentmine/api/v2 -

results in “The remote server returned an error: (404) Not Found.”. The AMI library is not a requirement.

Step 3 Data mine

From the command line it is now possible to get an idea how the data mining process works. Enter:

getpapers -q ABSTRACT:"cubane and synthesis" -n -o cubane

Outputs:

info: Searching using eupmc API
 info: Running in no-execute mode, so nothing will be downloaded
 info: Found 21 open access results

For retrieving the metadata enter:

getpapers -q ABSTRACT:"cubane and synthesis"  -o cubane

Outputs:

info: Searching using eupmc API
 info: Found 21 open access results
 Retrieving results [==============================] 100% (eta 0.0s)
 info: Done collecting results
 info: Saving result metadata
 info: Full EUPMC result metadata written to eupmc_results.json
 info: Individual EUPMC result metadata records written
 info: Extracting fulltext HTML URL list (may not be available for all articles)
 info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt

For retrieving the actual papers enter:

getpapers -q ABSTRACT:"cubane and synthesis"  -o cubane -x

and all papers are collected as xml files each in a separate folder.

There you have it: all open-access published research on cubane synthesis ready for datamining!

Advertisements