You might consider pointing a Parsoid instance at your "simple PHP
server".  Using the Parsoid-format HTML DOM has several benefits over
using the output of the PHP parser directly.  Categories are much
easier to extract, for instance.

See 
https://commons.wikimedia.org/wiki/File%3ADoing_Cool_Things_with_Wiki_Content_(Parsoid_Power!).pdf
(recording at https://youtu.be/3WJID_WC7BQ) and
https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi for some more
hints on running queries over the Parsoid DOM.
 --scott

On Wed, Sep 23, 2015 at 2:25 PM, v0id null <[email protected]> wrote:
> Thanks for the input everyone. I was not aware that importing the XML dumps
> was so involved.
>
> In the end I used xml2sql, but it required two patches, and a bit more work
> on my end, to get it to work. I also had to strip out the
> <DiscussionThreading> tag from the xml dump. But nevertheless it is very
> fast.
>
> For those wondering, I'm toying around with an automated news categorizer
> and wanted to use Wikinews as a corpus. Not perfect, but this is just
> hobbyist level stuff here. I'm using nltk so I wanted to keep things
> python-centric, but I've written up a PHP script that runs as a simple tcp
> server that my python script can connect to and ask for the HTML output. My
> python script first downloads mediawiki, the right xml dump, unzips
> everything, sets up LocalSettings.php, compiles xml2sql, runs it then
> imports the sql into the database. So essentially automates making an
> offline installation of what I assume is any mediawiki xml dump. Then it
> starts that simple PHP server (using plain sockets), and just sends it page
> IDs and it responds with the fully rendered HTML including headers and
> footers.
>
> I figure this approach, I can run a few forks on the python and php side to
> speed up the process.
>
> then I use python to parse through the HTML to get whatever I need from the
> page, which for now are the categories and the article content, which I can
> then use to train classifiers from nltk.
>
> maybe not the easiest approach, but it does make it easy to use. I've
> looked at the python parsers but none of them seem like they will be as
> successful or as correct as using Mediawiki itself.
>
> ---alex
>
> On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <[email protected]> wrote:
>
>> Hi alex. I added some notes below based on my experience. (I'm the
>> developer for XOWA (http://gnosygnu.github.io/xowa/) which generates
>> offline wikis from the Wikimedia XML dumps) Feel free to follow up on-list
>> or off-list if you are interested. Thanks.
>>
>> On Mon, Sep 21, 2015 at 3:09 PM, v0id null <[email protected]> wrote:
>>
>> > #1: mwdumper has not been updated in a very long time. I did try to use
>> it,
>> > but it did not seem to work properly. I don't entirely remember what the
>> > problem was but I believe it was related to schema incompatibility.
>> xml2sql
>> > comes with a warning about having to rebuild links. Considering that I'm
>> > just in a command line and passing in page IDs manually, do I really need
>> > to worry about it? I'd be thrilled not to have to reinvent the wheel
>> here.
>> >
>>
>>
>> > #2: Is there some way to figure it out? as I showed in a previous reply,
>> > the template that it can't find, is there in the page table.
>> >
>> > As brion indicated, you need to strip the namespace name. The XML dump
>> also has a "namespaces" node near the beginning. It lists every namespace
>> in the wiki with "name" and "ID". You can use a rule like "if the title
>> starts with a namespace and a colon, strip it". Hence, a title like
>> "Template:Date" starts with "Template:" and goes into the page table with a
>> title of just "Date" and a namespace of "10" (the namespace id for
>> "Template").
>>
>>
>> > #3: Those lua modules, are they stock modules included with the mediawiki
>> > software, or something much more custom? If the latter, are they
>> available
>> > to download somewhere?
>> >
>> > Yes, these are articles with a title starting with "Module:". They will
>> be
>> in the pages-articles.xml.bz2 dump. You should make sure you have Scribunto
>> set up on your wiki, or else it won't use them. See:
>> https://www.mediawiki.org/wiki/Extension:Scribunto
>>
>>
>> > #4: I'm not any expert on mediawiki, but it seems when that the titles in
>> > the xml dump need to be formatted, mainly replacing spaces with
>> > underscores.
>> >
>> > Yes, surprisingly, the only change you'll need to make is to replace
>> spaces with underscores.
>>
>> Hope this helps.
>>
>>
>> > thanks for the response
>> > --alex
>> >
>> > On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <[email protected]>
>> > wrote:
>> >
>> > > A few notes:
>> > >
>> > > 1) It sounds like you're recreating all the logic of importing a dump
>> > into
>> > > a SQL database, which may be introducing problems if you have bugs in
>> > your
>> > > code. For instance you may be mistakenly treating namespaces as text
>> > > strings instead of numbers, or failing to escape things, or missing
>> > > something else. I would recommend using one of the many existing tools
>> > for
>> > > importing a dump, such as mwdumper or xml2sql:
>> > >
>> > >
>> https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
>> > >
>> > > 2) Make sure you've got a dump that includes the templates and lua
>> > modules
>> > > etc. It sounds like either you don't have the Template: pages or your
>> > > import process does not handle namespaces correctly.
>> > >
>> > > 3) Make sure you've got all the necessary extensions to replicate the
>> > wiki
>> > > you're using a dump from, such as Lua. Many templates on Wikipedia call
>> > Lua
>> > > modules, and won't work without them.
>> > >
>> > > 4) Not sure what "not web friendly" means regarding titles?
>> > >
>> > > -- brion
>> > >
>> > >
>> > > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <[email protected]>
>> wrote:
>> > >
>> > > > Hello Everyone,
>> > > >
>> > > > I've been trying to write a python script that will take an XML dump,
>> > and
>> > > > generate all HTML, using Mediawiki itself to handle all the
>> > > > parsing/processing, but I've run into a problem where all the parsed
>> > > output
>> > > > have warnings that templates couldn't be found. I'm not sure what I'm
>> > > doing
>> > > > wrong.
>> > > >
>> > > > So I'll explain my steps:
>> > > >
>> > > > First I execute the SQL script maintenance/table.sql
>> > > >
>> > > > Then I remove some indexes from the tables to speed up insertion.
>> > > >
>> > > > Finally I go through the XML which will execute the following insert
>> > > > statements:
>> > > >
>> > > >  'insert into page
>> > > >     (page_id, page_namespace, page_title, page_is_redirect,
>> > page_is_new,
>> > > > page_random,
>> > > >      page_latest, page_len, page_content_model) values (%s, %s, %s,
>> %s,
>> > > %s,
>> > > > %s, %s, %s, %s)'
>> > > >
>> > > > 'insert into text (old_id, old_text) values (%s, %s)'
>> > > >
>> > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user,
>> rc_user_text,
>> > > >    rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
>> rc_last_oldid,
>> > > >    rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len, rc_new_len,
>> > > > rc_deleted,
>> > > >    rc_logid)
>> > > >    values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
>> %s,
>> > > %s,
>> > > > %s, %s)'
>> > > >
>> > > > 'insert into revision
>> > > >     (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
>> > > rev_timestamp,
>> > > >      rev_minor_edit, rev_deleted, rev_len, rev_parent_id, rev_sha1)
>> > > >       values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
>> > > >
>> > > > All IDs from the XML dump are kept. I noticed that the titles are not
>> > web
>> > > > friendly. Thinking this was the problem I ran the
>> > > > maintenance/cleanupTitles.php script but it didn't seem to fix any
>> > thing.
>> > > >
>> > > > Doing this, I can now run the following PHP script:
>> > > >     $id = 'some revision id'
>> > > >     $rev = Revision::newFromId( $id );
>> > > >     $titleObj = $rev->getTitle();
>> > > >     $pageObj = WikiPage::factory( $titleObj );
>> > > >
>> > > >     $context = RequestContext::newExtraneousContext($titleObj);
>> > > >
>> > > >     $popts = ParserOptions::newFromContext($context);
>> > > >     $pout = $pageObj->getParserOutput($popts);
>> > > >
>> > > >     var_dump($pout);
>> > > >
>> > > > The mText property of $pout contains the parsed output, but it is
>> full
>> > of
>> > > > stuff like this:
>> > > >
>> > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1"
>> > > class="new"
>> > > > title="Template:Date (page does not exist)">Template:Date</a>
>> > > >
>> > > >
>> > > > I feel like I'm missing a step here. I tried importing the
>> > templatelinks
>> > > > SQL dump, but it also did not fix anything. It also did not include
>> any
>> > > > header or footer which would be useful.
>> > > >
>> > > > Any insight or help is much appreciated, thank you.
>> > > >
>> > > > --alex
>> > > > _______________________________________________
>> > > > Wikitech-l mailing list
>> > > > [email protected]
>> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> > > _______________________________________________
>> > > Wikitech-l mailing list
>> > > [email protected]
>> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> > >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > [email protected]
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>> >
>> _______________________________________________
>> Wikitech-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
(http://cscott.net)

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to