Account Home | Help | Blog | Contact us | Log Out

Welcome to Kbytes > Articles

Bulk Import Wiktionary data to SQL Server 2016

Posted By: siteadmin on 31/05/2020 15:44:00

Today's task was the Bulk Import Wiktionary data to SQL Server 2016.

Surprisingly little evidence of other people's attempts on this available on the Internet.

First I tried a node by node approach parsing the XML.

Dim fileRoot = "D:\wiktionary"

Dim reader As XmlReader = XmlReader.Create(fileRoot & "/enwiktionary-latest-pages-articles.xml")

But this was going to take far too long - the current Wiktionary English file from https://dumps.wikimedia.org/enwiktionary/latest/ is over 6gb decompressed, and I estimate it would take over a month slowly iterating through the XML and importing one by one. If the routine fails for any reason it would be to be restarted from scratch - hardly an optimal solution!

So the recommend method is via OPENROWSET BULK.

First create the destination SQL DB table. I didn't import all the information, but you should be able to add new columns if needed.

CREATE TABLE [dbo].[wiktionaryArticles](

[title] [nvarchar](450) NOT NULL,

[pageId] [int] NOT NULL,

[text] [nvarchar](max) NOT NULL,

[revisionId] [int] NOT NULL,

[timestamp] [varchar](20) NOT NULL,

[userIP] [varchar](35) NULL,

[userUsername] [varchar](255) NULL,

[userId] [int] NULL

) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]

Then download the enwiktionary-latest-pages-articles.xml.bz2 file from the Dump folder from https://dumps.wikimedia.org/

Extract and save it somewhere on your local disk.

The structure looks something like this

<?xml version="1.0"?>

-<mediawiki xml:lang="en" version="0.10" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.10/ http://www.mediawiki.org/xml/export-0.10.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.mediawiki.org/xml/export-0.10/">

-<siteinfo>

<sitename>Wiktionary</sitename>

<dbname>enwiktionary</dbname>

<base>https://en.wiktionary.org/wiki/Wiktionary:Main_Page</base>

<generator>MediaWiki 1.35.0-wmf.32</generator>

<case>case-sensitive</case>

+<namespaces>

</siteinfo>

<page>

<title>ordnet unter</title>

-<revision>

-<contributor>

</contributor>

<comment>Created page with "==German== ===Verb=== {{head|de|verb form}} # {{de-verb form of|unterordnen|2|p|g}} # {{de-verb form of|unterordnen|2|p|k1}} # {{de-verb form of|unterordnen|3|s|g}} # {{de-v..."</comment>

<model>wikitext</model>

<text xml:space="preserve" bytes="204">==German== ===Verb=== {{head|de|verb form}} # {{de-verb form of|unterordnen|2|p|g}} # {{de-verb form of|unterordnen|2|p|k1}} # {{de-verb form of|unterordnen|3|s|g}} # {{de-verb form of|unterordnen|i|p}}</text>

</revision>

</page>

</mediawiki>

The following SQL script imports the data.

DECLARE @input XML

SELECT @input = CAST(x AS XML)

FROM OPENROWSET

(BULK 'D:\wiktionary\enwiktionary-latest-pages-articles.xml',

SINGLE_BLOB) AS T(X)

INSERT INTO [dbo].[wiktionaryArticles]

([title]

,[pageId]

,[text]

,[revisionId]

,[timestamp]

,[userIP]

,[userUsername]

,[userId])

SELECT

Pages.value('(*:title)[1]', 'NVARCHAR(450)'),

Pages.value('(*:id)[1]', 'INT'),

Revisions.value('(*:text)[1]','NVARCHAR(MAX)'),

Revisions.value('(*:id)[1]','INT'),

Revisions.value('(*:timestamp)[1]','VARCHAR(20)'),

contibutor.value('(*:ip)[1]','VARCHAR(35)'),

contibutor.value('(*:username)[1]','VARCHAR(255)'),

contibutor.value('(*:id)[1]','INT')

FROM

@input.nodes('*:mediawiki/*:page') AS Tbl(Pages)

CROSS APPLY

Pages.nodes('*:revision') AS TblC(Revisions)

CROSS APPLY

Revisions.nodes('*:contributor') AS TblR(contibutor)

There is a 2gb limit on XML file importation so I had to split the XML file into 4 parts each < 2gb. As this was a "once of" import I did manually with the amazing EMEDITOR - if it need to become a regular occurence then using the XMLREADER first and saving out < 2gb temporary files might be way forward.

Let me know if you write a file splitter to do this!

blog comments powered by Disqus

Kbytes Home | Privacy Policy | Contact us | Testing Area