Bulk Import Wiktionary data to SQL Server 2016
Account Home | Help | Blog | Contact us | Log Out


Welcome to Kbytes > Articles

Bulk Import Wiktionary data to SQL Server 2016

Posted By: siteadmin on 31/05/2020 15:44:00

Today's task was the Bulk Import Wiktionary data to SQL Server 2016.  

Surprisingly little evidence of other people's attempts on this available on the Internet.   

First I tried a node by node approach parsing the XML.

Dim fileRoot = "D:\wiktionary"
Dim reader As XmlReader = XmlReader.Create(fileRoot & "/enwiktionary-latest-pages-articles.xml")

But this was going to take far too long - the current Wiktionary English file from https://dumps.wikimedia.org/enwiktionary/latest/ is over 6gb decompressed, and I estimate it would take over a month slowly iterating through the XML and importing one by one.  If the routine fails for any reason it would be to be restarted from scratch - hardly an optimal solution!

So the recommend method is via OPENROWSET BULK.

First create the destination SQL DB table.  I didn't import all the information, but you should be able to add new columns if needed.

CREATE TABLE [dbo].[wiktionaryArticles](
[title] [nvarchar](450) NOT NULL,
[pageId] [int] NOT NULL,
[text] [nvarchar](max) NOT NULL,
[revisionId] [int] NOT NULL,
[timestamp] [varchar](20) NOT NULL,
[userIP] [varchar](35) NULL,
[userUsername] [varchar](255) NULL,
[userId] [int] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO
 
Then download the enwiktionary-latest-pages-articles.xml.bz2 file from the Dump folder from https://dumps.wikimedia.org/
 
Extract and save it somewhere on your local disk.
 
The structure looks something like this
 
<?xml version="1.0"?>
-<mediawiki xml:lang="en" version="0.10" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.mediawiki.org/xml/export-0.10/">
-<siteinfo>
<sitename>Wiktionary</sitename>
<dbname>enwiktionary</dbname>
<base>https://en.wiktionary.org/wiki/Wiktionary:Main_Page</base>
<generator>MediaWiki 1.35.0-wmf.32</generator>
<case>case-sensitive</case>
+<namespaces>
</siteinfo>
<page>
<title>ordnet unter</title>
<ns>0</ns>
<id>7624631</id>
-<revision>
<id>59392147</id>
<timestamp>2020-05-20T11:55:01Z</timestamp>
-<contributor>
<ip>2003:CF:3F15:FC91:FD:D50D:CB6C:F0CE</ip>
</contributor>
<comment>Created page with "==German== ===Verb=== {{head|de|verb form}} # {{de-verb form of|unterordnen|2|p|g}} # {{de-verb form of|unterordnen|2|p|k1}} # {{de-verb form of|unterordnen|3|s|g}} # {{de-v..."</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="204">==German== ===Verb=== {{head|de|verb form}} # {{de-verb form of|unterordnen|2|p|g}} # {{de-verb form of|unterordnen|2|p|k1}} # {{de-verb form of|unterordnen|3|s|g}} # {{de-verb form of|unterordnen|i|p}}</text>
<sha1>amm8ajs32jfrc8b7qdmk4cebv9sr2o7</sha1>
</revision>
</page>
</mediawiki>
 
 
The following SQL script imports the data.
 
DECLARE @input XML 
 
SELECT @input = CAST(x AS XML)
FROM OPENROWSET
     (BULK 'D:\wiktionary\enwiktionary-latest-pages-articles.xml',
      SINGLE_BLOB) AS T(X)
 
INSERT INTO [dbo].[wiktionaryArticles]
           ([title]
           ,[pageId]
           ,[text]
           ,[revisionId]
           ,[timestamp]
           ,[userIP]
           ,[userUsername]
           ,[userId])
    SELECT
        Pages.value('(*:title)[1]', 'NVARCHAR(450)'),
Pages.value('(*:id)[1]', 'INT'),
Revisions.value('(*:text)[1]','NVARCHAR(MAX)'),
Revisions.value('(*:id)[1]','INT'),
Revisions.value('(*:timestamp)[1]','VARCHAR(20)'),
contibutor.value('(*:ip)[1]','VARCHAR(35)'),
contibutor.value('(*:username)[1]','VARCHAR(255)'),
contibutor.value('(*:id)[1]','INT')
    FROM
        @input.nodes('*:mediawiki/*:page') AS Tbl(Pages)
  CROSS APPLY
        Pages.nodes('*:revision') AS TblC(Revisions)
  CROSS APPLY
        Revisions.nodes('*:contributor') AS TblR(contibutor)
 

There is a 2gb limit on XML file importation so I had to split the XML file into 4 parts each < 2gb.  As this was a "once of" import I did manually with the amazing EMEDITOR - if it need to become a regular occurence then using the XMLREADER first and saving out < 2gb temporary files might be way forward.  

Let me know if you write a file splitter to do this!

 


blog comments powered by Disqus

Kbytes Home | Privacy Policy | Contact us | Testing Area

© 2004 - 2024 1 Oak Hill Grove Surbiton Surrey KT6 6DS Phone: +44(020) 8123 1321