Danny B. wrote: > > I'm looking for any kind of tool which would take the XML dump (most probably > the pages-meta-current.xml.bz2, at least the pages-articles.xml.bz2) and > would return the list of page titles (or alternatively/configurably page ids) > of pages containing given string. > > Does anybody have such (kind of) tool and is willing to share? Both command > line or webpage interface are OK.
If you're only interested in page titles, why not just download all-titles-in-ns0.gz and grep it? Alternatively, if you want titles in other namespaces too, I have a small perl script I once wrote that can extract such a list from the page.sql.gz dump -- I can clean it up and put it online somewhere if you're interested. -- Ilmari Karonen _______________________________________________ Toolserver-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/toolserver-l
