Note that Kiwix's "mw-offliner" script ( http://www.openzim.org/wiki/Build_your_ZIM_file#MWoffliner ) does a pretty good job of converting a bunch of wiki pages to HTML, although it starts from a live wiki instance (and a properly-configured Parsoid pointed at it) rather than an XML dump. Zim-format dumps (for example, from https://dumps.wikimedia.org/other/kiwix/zim/wikipedia/ ) can also be unpacked into a directory tree of HTML files.
There are also the "HTML dumps" that the service team is involved with. This following links have more information: https://phabricator.wikimedia.org/T88728 https://phabricator.wikimedia.org/T93396 Perhaps your use case could inform the ongoing design of that service. --scott On Mon, Sep 21, 2015 at 3:12 PM, Brion Vibber <[email protected]> wrote: > On Mon, Sep 21, 2015 at 12:09 PM, v0id null <[email protected]> wrote: > >> #1: mwdumper has not been updated in a very long time. I did try to use it, >> but it did not seem to work properly. I don't entirely remember what the >> problem was but I believe it was related to schema incompatibility. xml2sql >> comes with a warning about having to rebuild links. Considering that I'm >> just in a command line and passing in page IDs manually, do I really need >> to worry about it? I'd be thrilled not to have to reinvent the wheel here. >> > > You would need to rebuild link tables if you need them for either mwdumper > or xml2sql. For your case it doesn't sound like you'd need them. > > >> #2: Is there some way to figure it out? as I showed in a previous reply, >> the template that it can't find, is there in the page table. >> > > As noted in previous reply, your import process is buggy and the page > record's page_title field is incorrect, so it cannot be found. You need to > correctly parse the incoming title into namespace and base title portions > and store them correctly into page_namespace numeric ID and page_title text > portion. > > > >> #3: Those lua modules, are they stock modules included with the mediawiki >> software, or something much more custom? If the latter, are they available >> to download somewhere? >> > > They are on the wiki, in the 'Module' namespace. Should be included with a > complete dump. I have no idea about the 'articles' dump, but I would assume > it *should* include them. > > >> >> #4: I'm not any expert on mediawiki, but it seems when that the titles in >> the xml dump need to be formatted, mainly replacing spaces with >> underscores. >> > > That's another thing your import process needs to do. I recommend using > existing code that already has all this logic. :) > > -- brion > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- (http://cscott.net) _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
