#1: mwdumper has not been updated in a very long time. I did try to use it,
but it did not seem to work properly. I don't entirely remember what the
problem was but I believe it was related to schema incompatibility. xml2sql
comes with a warning about having to rebuild links. Considering that I'm
just in a command line and passing in page IDs manually, do I really need
to worry about it? I'd be thrilled not to have to reinvent the wheel here.

#2: Is there some way to figure it out? as I showed in a previous reply,
the template that it can't find, is there in the page table.

#3: Those lua modules, are they stock modules included with the mediawiki
software, or something much more custom? If the latter, are they available
to download somewhere?

#4: I'm not any expert on mediawiki, but it seems when that the titles in
the xml dump need to be formatted, mainly replacing spaces with
underscores.

thanks for the response
--alex

On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <[email protected]> wrote:

> A few notes:
>
> 1) It sounds like you're recreating all the logic of importing a dump into
> a SQL database, which may be introducing problems if you have bugs in your
> code. For instance you may be mistakenly treating namespaces as text
> strings instead of numbers, or failing to escape things, or missing
> something else. I would recommend using one of the many existing tools for
> importing a dump, such as mwdumper or xml2sql:
>
> https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
>
> 2) Make sure you've got a dump that includes the templates and lua modules
> etc. It sounds like either you don't have the Template: pages or your
> import process does not handle namespaces correctly.
>
> 3) Make sure you've got all the necessary extensions to replicate the wiki
> you're using a dump from, such as Lua. Many templates on Wikipedia call Lua
> modules, and won't work without them.
>
> 4) Not sure what "not web friendly" means regarding titles?
>
> -- brion
>
>
> On Mon, Sep 21, 2015 at 11:50 AM, v0id null <[email protected]> wrote:
>
> > Hello Everyone,
> >
> > I've been trying to write a python script that will take an XML dump, and
> > generate all HTML, using Mediawiki itself to handle all the
> > parsing/processing, but I've run into a problem where all the parsed
> output
> > have warnings that templates couldn't be found. I'm not sure what I'm
> doing
> > wrong.
> >
> > So I'll explain my steps:
> >
> > First I execute the SQL script maintenance/table.sql
> >
> > Then I remove some indexes from the tables to speed up insertion.
> >
> > Finally I go through the XML which will execute the following insert
> > statements:
> >
> >  'insert into page
> >     (page_id, page_namespace, page_title, page_is_redirect, page_is_new,
> > page_random,
> >      page_latest, page_len, page_content_model) values (%s, %s, %s, %s,
> %s,
> > %s, %s, %s, %s)'
> >
> > 'insert into text (old_id, old_text) values (%s, %s)'
> >
> > 'insert into recentchanges (rc_id, rc_timestamp, rc_user, rc_user_text,
> >    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid, rc_last_oldid,
> >    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
> > rc_deleted,
> >    rc_logid)
> >    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> %s,
> > %s, %s)'
> >
> > 'insert into revision
> >     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
> rev_timestamp,
> >      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
> >       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
> >
> > All IDs from the XML dump are kept. I noticed that the titles are not web
> > friendly. Thinking this was the problem I ran the
> > maintenance/cleanupTitles.php script but it didn't seem to fix any thing.
> >
> > Doing this, I can now run the following PHP script:
> >     $id = 'some revision id'
> >     $rev = Revision::newFromId( $id );
> >     $titleObj = $rev->getTitle();
> >     $pageObj = WikiPage::factory( $titleObj );
> >
> >     $context = RequestContext::newExtraneousContext($titleObj);
> >
> >     $popts = ParserOptions::newFromContext($context);
> >     $pout = $pageObj->getParserOutput($popts);
> >
> >     var_dump($pout);
> >
> > The mText property of $pout contains the parsed output, but it is full of
> > stuff like this:
> >
> > <a href="/index.php?title=Template:Date&action=edit&redlink=1"
> class="new"
> > title="Template:Date (page does not exist)">Template:Date</a>
> >
> >
> > I feel like I'm missing a step here. I tried importing the templatelinks
> > SQL dump, but it also did not fix anything. It also did not include any
> > header or footer which would be useful.
> >
> > Any insight or help is much appreciated, thank you.
> >
> > --alex
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to