Hi alex. I added some notes below based on my experience. (I'm the developer for XOWA (http://gnosygnu.github.io/xowa/) which generates offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list or off-list if you are interested. Thanks.
On Mon, Sep 21, 2015 at 3:09 PM, v0id null <[email protected]> wrote: > #1: mwdumper has not been updated in a very long time. I did try to use it, > but it did not seem to work properly. I don't entirely remember what the > problem was but I believe it was related to schema incompatibility. xml2sql > comes with a warning about having to rebuild links. Considering that I'm > just in a command line and passing in page IDs manually, do I really need > to worry about it? I'd be thrilled not to have to reinvent the wheel here. > > #2: Is there some way to figure it out? as I showed in a previous reply, > the template that it can't find, is there in the page table. > > As brion indicated, you need to strip the namespace name. The XML dump also has a "namespaces" node near the beginning. It lists every namespace in the wiki with "name" and "ID". You can use a rule like "if the title starts with a namespace and a colon, strip it". Hence, a title like "Template:Date" starts with "Template:" and goes into the page table with a title of just "Date" and a namespace of "10" (the namespace id for "Template"). > #3: Those lua modules, are they stock modules included with the mediawiki > software, or something much more custom? If the latter, are they available > to download somewhere? > > Yes, these are articles with a title starting with "Module:". They will be in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto set up on your wiki, or else it won't use them. See: https://www.mediawiki.org/wiki/Extension:Scribunto > #4: I'm not any expert on mediawiki, but it seems when that the titles in > the xml dump need to be formatted, mainly replacing spaces with > underscores. > > Yes, surprisingly, the only change you'll need to make is to replace spaces with underscores. Hope this helps. > thanks for the response > --alex > > On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <[email protected]> > wrote: > > > A few notes: > > > > 1) It sounds like you're recreating all the logic of importing a dump > into > > a SQL database, which may be introducing problems if you have bugs in > your > > code. For instance you may be mistakenly treating namespaces as text > > strings instead of numbers, or failing to escape things, or missing > > something else. I would recommend using one of the many existing tools > for > > importing a dump, such as mwdumper or xml2sql: > > > > https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper > > > > 2) Make sure you've got a dump that includes the templates and lua > modules > > etc. It sounds like either you don't have the Template: pages or your > > import process does not handle namespaces correctly. > > > > 3) Make sure you've got all the necessary extensions to replicate the > wiki > > you're using a dump from, such as Lua. Many templates on Wikipedia call > Lua > > modules, and won't work without them. > > > > 4) Not sure what "not web friendly" means regarding titles? > > > > -- brion > > > > > > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <[email protected]> wrote: > > > > > Hello Everyone, > > > > > > I've been trying to write a python script that will take an XML dump, > and > > > generate all HTML, using Mediawiki itself to handle all the > > > parsing/processing, but I've run into a problem where all the parsed > > output > > > have warnings that templates couldn't be found. I'm not sure what I'm > > doing > > > wrong. > > > > > > So I'll explain my steps: > > > > > > First I execute the SQL script maintenance/table.sql > > > > > > Then I remove some indexes from the tables to speed up insertion. > > > > > > Finally I go through the XML which will execute the following insert > > > statements: > > > > > > 'insert into page > > > (page_id, page_namespace, page_title, page_is_redirect, > page_is_new, > > > page_random, > > > page_latest, page_len, page_content_model) values (%s, %s, %s, %s, > > %s, > > > %s, %s, %s, %s)' > > > > > > 'insert into text (old_id, old_text) values (%s, %s)' > > > > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text, > > > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid, > > > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len, > > > rc_deleted, > > > rc_logid) > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, > > %s, > > > %s, %s)' > > > > > > 'insert into revision > > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text, > > rev_timestamp, > > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1) > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)' > > > > > > All IDs from the XML dump are kept. I noticed that the titles are not > web > > > friendly. Thinking this was the problem I ran the > > > maintenance/cleanupTitles.php script but it didn't seem to fix any > thing. > > > > > > Doing this, I can now run the following PHP script: > > > $id = 'some revision id' > > > $rev = Revision::newFromId( $id ); > > > $titleObj = $rev->getTitle(); > > > $pageObj = WikiPage::factory( $titleObj ); > > > > > > $context = RequestContext::newExtraneousContext($titleObj); > > > > > > $popts = ParserOptions::newFromContext($context); > > > $pout = $pageObj->getParserOutput($popts); > > > > > > var_dump($pout); > > > > > > The mText property of $pout contains the parsed output, but it is full > of > > > stuff like this: > > > > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1" > > class="new" > > > title="Template:Date (page does not exist)">Template:Date</a> > > > > > > > > > I feel like I'm missing a step here. I tried importing the > templatelinks > > > SQL dump, but it also did not fix anything. It also did not include any > > > header or footer which would be useful. > > > > > > Any insight or help is much appreciated, thank you. > > > > > > --alex > > > _______________________________________________ > > > Wikitech-l mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > > Wikitech-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
