Looking at https://www.mediawiki.org/wiki/Parsoid/Setup
It seems that I need a web server set up for mediawiki, and nodejs and I'd
have to go through the Parsoid API which I guess is going through Meidawiki
anyhow.
Right now I use xpath to find everything I need. Getting categories for
example is as simple as:
$xpath = new DOMXPath($dom);
$contents = $xpath->query("//div[@id='mw-normal-catlinks']//li/a");
$categories = [];
foreach ($contents as $el) {
$categories[] = $el->textContent;
}
Is there information that Parsoid makes available that isn't available from
Mediawiki output directly?
thanks,
-alex
On Wed, Sep 23, 2015 at 2:49 PM, C. Scott Ananian <[email protected]>
wrote:
> You might consider pointing a Parsoid instance at your "simple PHP
> server". Using the Parsoid-format HTML DOM has several benefits over
> using the output of the PHP parser directly. Categories are much
> easier to extract, for instance.
>
> See
> https://commons.wikimedia.org/wiki/File%3ADoing_Cool_Things_with_Wiki_Content_(Parsoid_Power!).pdf
> (recording at https://youtu.be/3WJID_WC7BQ) and
> https://doc.wikimedia.org/Parsoid/master/#!/guide/jsapi for some more
> hints on running queries over the Parsoid DOM.
> --scott
>
> On Wed, Sep 23, 2015 at 2:25 PM, v0id null <[email protected]> wrote:
> > Thanks for the input everyone. I was not aware that importing the XML
> dumps
> > was so involved.
> >
> > In the end I used xml2sql, but it required two patches, and a bit more
> work
> > on my end, to get it to work. I also had to strip out the
> > <DiscussionThreading> tag from the xml dump. But nevertheless it is very
> > fast.
> >
> > For those wondering, I'm toying around with an automated news categorizer
> > and wanted to use Wikinews as a corpus. Not perfect, but this is just
> > hobbyist level stuff here. I'm using nltk so I wanted to keep things
> > python-centric, but I've written up a PHP script that runs as a simple
> tcp
> > server that my python script can connect to and ask for the HTML output.
> My
> > python script first downloads mediawiki, the right xml dump, unzips
> > everything, sets up LocalSettings.php, compiles xml2sql, runs it then
> > imports the sql into the database. So essentially automates making an
> > offline installation of what I assume is any mediawiki xml dump. Then it
> > starts that simple PHP server (using plain sockets), and just sends it
> page
> > IDs and it responds with the fully rendered HTML including headers and
> > footers.
> >
> > I figure this approach, I can run a few forks on the python and php side
> to
> > speed up the process.
> >
> > then I use python to parse through the HTML to get whatever I need from
> the
> > page, which for now are the categories and the article content, which I
> can
> > then use to train classifiers from nltk.
> >
> > maybe not the easiest approach, but it does make it easy to use. I've
> > looked at the python parsers but none of them seem like they will be as
> > successful or as correct as using Mediawiki itself.
> >
> > ---alex
> >
> > On Tue, Sep 22, 2015 at 11:09 PM, gnosygnu <[email protected]> wrote:
> >
> >> Hi alex. I added some notes below based on my experience. (I'm the
> >> developer for XOWA (http://gnosygnu.github.io/xowa/) which generates
> >> offline wikis from the Wikimedia XML dumps) Feel free to follow up
> on-list
> >> or off-list if you are interested. Thanks.
> >>
> >> On Mon, Sep 21, 2015 at 3:09 PM, v0id null <[email protected]> wrote:
> >>
> >> > #1: mwdumper has not been updated in a very long time. I did try to
> use
> >> it,
> >> > but it did not seem to work properly. I don't entirely remember what
> the
> >> > problem was but I believe it was related to schema incompatibility.
> >> xml2sql
> >> > comes with a warning about having to rebuild links. Considering that
> I'm
> >> > just in a command line and passing in page IDs manually, do I really
> need
> >> > to worry about it? I'd be thrilled not to have to reinvent the wheel
> >> here.
> >> >
> >>
> >>
> >> > #2: Is there some way to figure it out? as I showed in a previous
> reply,
> >> > the template that it can't find, is there in the page table.
> >> >
> >> > As brion indicated, you need to strip the namespace name. The XML dump
> >> also has a "namespaces" node near the beginning. It lists every
> namespace
> >> in the wiki with "name" and "ID". You can use a rule like "if the title
> >> starts with a namespace and a colon, strip it". Hence, a title like
> >> "Template:Date" starts with "Template:" and goes into the page table
> with a
> >> title of just "Date" and a namespace of "10" (the namespace id for
> >> "Template").
> >>
> >>
> >> > #3: Those lua modules, are they stock modules included with the
> mediawiki
> >> > software, or something much more custom? If the latter, are they
> >> available
> >> > to download somewhere?
> >> >
> >> > Yes, these are articles with a title starting with "Module:". They
> will
> >> be
> >> in the pages-articles.xml.bz2 dump. You should make sure you have
> Scribunto
> >> set up on your wiki, or else it won't use them. See:
> >> https://www.mediawiki.org/wiki/Extension:Scribunto
> >>
> >>
> >> > #4: I'm not any expert on mediawiki, but it seems when that the
> titles in
> >> > the xml dump need to be formatted, mainly replacing spaces with
> >> > underscores.
> >> >
> >> > Yes, surprisingly, the only change you'll need to make is to replace
> >> spaces with underscores.
> >>
> >> Hope this helps.
> >>
> >>
> >> > thanks for the response
> >> > --alex
> >> >
> >> > On Mon, Sep 21, 2015 at 3:00 PM, Brion Vibber <[email protected]>
> >> > wrote:
> >> >
> >> > > A few notes:
> >> > >
> >> > > 1) It sounds like you're recreating all the logic of importing a
> dump
> >> > into
> >> > > a SQL database, which may be introducing problems if you have bugs
> in
> >> > your
> >> > > code. For instance you may be mistakenly treating namespaces as text
> >> > > strings instead of numbers, or failing to escape things, or missing
> >> > > something else. I would recommend using one of the many existing
> tools
> >> > for
> >> > > importing a dump, such as mwdumper or xml2sql:
> >> > >
> >> > >
> >>
> https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps#Using_mwdumper
> >> > >
> >> > > 2) Make sure you've got a dump that includes the templates and lua
> >> > modules
> >> > > etc. It sounds like either you don't have the Template: pages or
> your
> >> > > import process does not handle namespaces correctly.
> >> > >
> >> > > 3) Make sure you've got all the necessary extensions to replicate
> the
> >> > wiki
> >> > > you're using a dump from, such as Lua. Many templates on Wikipedia
> call
> >> > Lua
> >> > > modules, and won't work without them.
> >> > >
> >> > > 4) Not sure what "not web friendly" means regarding titles?
> >> > >
> >> > > -- brion
> >> > >
> >> > >
> >> > > On Mon, Sep 21, 2015 at 11:50 AM, v0id null <[email protected]>
> >> wrote:
> >> > >
> >> > > > Hello Everyone,
> >> > > >
> >> > > > I've been trying to write a python script that will take an XML
> dump,
> >> > and
> >> > > > generate all HTML, using Mediawiki itself to handle all the
> >> > > > parsing/processing, but I've run into a problem where all the
> parsed
> >> > > output
> >> > > > have warnings that templates couldn't be found. I'm not sure what
> I'm
> >> > > doing
> >> > > > wrong.
> >> > > >
> >> > > > So I'll explain my steps:
> >> > > >
> >> > > > First I execute the SQL script maintenance/table.sql
> >> > > >
> >> > > > Then I remove some indexes from the tables to speed up insertion.
> >> > > >
> >> > > > Finally I go through the XML which will execute the following
> insert
> >> > > > statements:
> >> > > >
> >> > > > 'insert into page
> >> > > > (page_id, page_namespace, page_title, page_is_redirect,
> >> > page_is_new,
> >> > > > page_random,
> >> > > > page_latest, page_len, page_content_model) values (%s, %s,
> %s,
> >> %s,
> >> > > %s,
> >> > > > %s, %s, %s, %s)'
> >> > > >
> >> > > > 'insert into text (old_id, old_text) values (%s, %s)'
> >> > > >
> >> > > > 'insert into recentchanges (rc_id, rc_timestamp, rc_user,
> >> rc_user_text,
> >> > > > rc_title, rc_minor, rc_bot, rc_cur_id, rc_this_oldid,
> >> rc_last_oldid,
> >> > > > rc_type, rc_source, rc_patrolled, rc_ip, rc_old_len,
> rc_new_len,
> >> > > > rc_deleted,
> >> > > > rc_logid)
> >> > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s,
> >> %s,
> >> > > %s,
> >> > > > %s, %s)'
> >> > > >
> >> > > > 'insert into revision
> >> > > > (rev_id, rev_page, rev_text_id, rev_user, rev_user_text,
> >> > > rev_timestamp,
> >> > > > rev_minor_edit, rev_deleted, rev_len, rev_parent_id,
> rev_sha1)
> >> > > > values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)'
> >> > > >
> >> > > > All IDs from the XML dump are kept. I noticed that the titles are
> not
> >> > web
> >> > > > friendly. Thinking this was the problem I ran the
> >> > > > maintenance/cleanupTitles.php script but it didn't seem to fix any
> >> > thing.
> >> > > >
> >> > > > Doing this, I can now run the following PHP script:
> >> > > > $id = 'some revision id'
> >> > > > $rev = Revision::newFromId( $id );
> >> > > > $titleObj = $rev->getTitle();
> >> > > > $pageObj = WikiPage::factory( $titleObj );
> >> > > >
> >> > > > $context = RequestContext::newExtraneousContext($titleObj);
> >> > > >
> >> > > > $popts = ParserOptions::newFromContext($context);
> >> > > > $pout = $pageObj->getParserOutput($popts);
> >> > > >
> >> > > > var_dump($pout);
> >> > > >
> >> > > > The mText property of $pout contains the parsed output, but it is
> >> full
> >> > of
> >> > > > stuff like this:
> >> > > >
> >> > > > <a href="/index.php?title=Template:Date&action=edit&redlink=1"
> >> > > class="new"
> >> > > > title="Template:Date (page does not exist)">Template:Date</a>
> >> > > >
> >> > > >
> >> > > > I feel like I'm missing a step here. I tried importing the
> >> > templatelinks
> >> > > > SQL dump, but it also did not fix anything. It also did not
> include
> >> any
> >> > > > header or footer which would be useful.
> >> > > >
> >> > > > Any insight or help is much appreciated, thank you.
> >> > > >
> >> > > > --alex
> >> > > > _______________________________________________
> >> > > > Wikitech-l mailing list
> >> > > > [email protected]
> >> > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >> > > _______________________________________________
> >> > > Wikitech-l mailing list
> >> > > [email protected]
> >> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >> > >
> >> > _______________________________________________
> >> > Wikitech-l mailing list
> >> > [email protected]
> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >> >
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> [email protected]
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> > _______________________________________________
> > Wikitech-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> (http://cscott.net)
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l