Note that Kiwix's "mw-offliner" script (
http://www.openzim.org/wiki/Build_your_ZIM_file#MWoffliner ) does a
pretty good job of converting a bunch of wiki pages to HTML, although
it starts from a live wiki instance (and a properly-configured Parsoid
pointed at it) rather than an XML dump.  Zim-format dumps (for
example, from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ )
can also be unpacked into a directory tree of HTML files.

There are also the "HTML dumps" that the service team is involved
with.  This following links have more information:
https://phabricator.wikimedia.org/T88728
https://phabricator.wikimedia.org/T93396

Perhaps your use case could inform the ongoing design of that service.
 --scott

On Mon, Sep 21, 2015 at 3:12 PM, Brion Vibber <[email protected]> wrote:
> On Mon, Sep 21, 2015 at 12:09 PM, v0id null <[email protected]> wrote:
>
>> #1: mwdumper has not been updated in a very long time. I did try to use it,
>> but it did not seem to work properly. I don't entirely remember what the
>> problem was but I believe it was related to schema incompatibility. xml2sql
>> comes with a warning about having to rebuild links. Considering that I'm
>> just in a command line and passing in page IDs manually, do I really need
>> to worry about it? I'd be thrilled not to have to reinvent the wheel here.
>>
>
> You would need to rebuild link tables if you need them for either mwdumper
> or xml2sql. For your case it doesn't sound like you'd need them.
>
>
>> #2: Is there some way to figure it out? as I showed in a previous reply,
>> the template that it can't find, is there in the page table.
>>
>
> As noted in previous reply, your import process is buggy and the page
> record's page_title field is incorrect, so it cannot be found. You need to
> correctly parse the incoming title into namespace and base title portions
> and store them correctly into page_namespace numeric ID and page_title text
> portion.
>
>
>
>> #3: Those lua modules, are they stock modules included with the mediawiki
>> software, or something much more custom? If the latter, are they available
>> to download somewhere?
>>
>
> They are on the wiki, in the 'Module' namespace. Should be included with a
> complete dump. I have no idea about the 'articles' dump, but I would assume
> it *should* include them.
>
>
>>
>> #4: I'm not any expert on mediawiki, but it seems when that the titles in
>> the xml dump need to be formatted, mainly replacing spaces with
>> underscores.
>>
>
> That's another thing your import process needs to do. I recommend using
> existing code that already has all this logic. :)
>
> -- brion
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
(http://cscott.net)

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to