Objective: download the Wikipedia database (Windows10/Xampp)

The basic instructions can be found here. The basic download is a compressed XML file worth 13 Gb (enwiki-20170401-pages-articles.xml.bz2).  Wikimedia advises the use of a bittorrent client. A strait download did not work at all,  the BitTyrant torrent is admirable (your download speed democratically equals your upload speed) but impossible to get working (complaints about FTP ports).  Another GPL client Deluge did the job as advertised (steady 5 MB/s).

Decompression (7-zip) resulted in a 60 Gb enwiki-20170401-pages-articles.xml file. Next up was conversion to a mysql dump file using the mwdumper java program:

java -jar mwdumper.jar enwiki-20170401-pages-articles.xml --format=sql:1.5 > enwiki-20170401-pages-articles.xml.sql

This process failed after 5 million pages but the official number of pages is 5,395,624.

5.119.000 pages (7.018,02/sec), 5.119.000 revs (7.018,02/sec)
5.120.000 pages (7.018,631/sec), 5.120.000 revs (7.018,631/sec)
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 2048
        at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(Unknown Source)
        at org.mediawiki.importer.XmlDumpReader.readDump(XmlDumpReader.java:88)
        at org.mediawiki.dumper.Dumper.main(Dumper.java:142)

As for the database , the schema can be found here. Problem is that the schema cannot load without errors in the latest version of Xampp. This may have to do with a mismatch between Wikipedia MariaDb and Xampp MariaDb. And by the way , why does MariaDb cloack itself as MySql?

# mysql --version
mysql  Ver 15.1 Distrib 10.1.19-MariaDB, for Win32 (AMD64)

Importing the database schema eventually did work in an older xampp installation with mysql.  As for the actual database import Git Bash  does not return any output for this command:

mysql -p -u me wiki < enwiki-20170401-pages-articles.xml.sql

This is a known issue but an alternative returns:

 winpty mysql -p -u me wiki < enwiki-20170401-pages-articles.xml.sql
stdin is not a tty

As a workaround there is a dump method in the mediawiki application. After installing mediawiki this process gave the same result:

cd /c/xampp/htdocs/mediawiki/maintenance
$ php importDump.php < ../../wiki/enwiki-20170401-pages-articles.xml/enwiki-20170401-pages-articles.xml.sql
stdin is not a tty

With Powershell then:

C:\xampp\htdocs\wiki> mysql -p -u root wiki < enwiki-20170401-pages-articles.xml.sql
At line:1 char:23
+ mysql -p -u me wiki < enwiki-20170401-pages-articles.xml.sql
The '<' operator is reserved for future use.

This is an unresolved Powershell issue, the reverse:

mysql -p -u me wiki > wiki_db.sql

is unproblematic.

As an another alternative Git CMD:

mysql -p -u me wiki < enwiki-20170401-pages-articles.xml.sql
Enter password:
ERROR 1054 (42S22) at line 93: Unknown column 'page_counter' in 'field list'

We are getting somewhere:

ALTER TABLE `page` ADD `page_counter` INT(11) NOT NULL AFTER `page_lang`;

And retry:

mysql -p -u root wiki < enwiki-20170401-pages-articles.xml.sql
Enter password:
ERROR 2006 (HY000) at line 597: MySQL server has gone away

The page table record number is now exactly 10,000 but then the process dies. Tricky one, the max_execution_time in the php.ini is set to zero for the CLI SAPI. The max_updates value in the mysql.user table is also already set to zero. The only remaining suspect , the max_allowed_packet setting in the my.ini mysql configuration file was then set to 10 MB from 4 MB.

Final result: 5,120,000 records in around 2 hours time at 20 Gb.  The process again exits with an error:

ERROR 1064 (42000) at line 59916: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near Category:Syriac Christianity work group articles by quali at line

so the original mwdumper error must have been a filtering problem in one of the articles.

Links

Advertisements