Hi, On Sat, Dec 12, 2009 at 4:35 PM, David Gerard <[email protected]> wrote:
> > 2009/12/11 Behrang Saeedzadeh <[email protected]>: > > Hi, > > > > I have downloaded enwiki-latest-all-titles-in-ns0.gz and I want to > extract > > main titles and store them in another file. For example, some titles have > > meta information (e.g. disambiguation etc.) and I want these to be > removed. > > Can I remove all the text between parentheses from the titles to achieve > > this? > > > You have to parse it by hand. > > Also some titles start with the "!" character. and some are enclosed > between > > two or three of them such as !Adiso_Amigos!. What is the purpose of "!" > in > > such cases? It's part of the topic's name (in case of < http://en.wikipedia.org/wiki/%C2%A1Adios_Amigos!>, the band's name). The reverse exclamation mark is part of the Spanish language. > > Also why some titles are enclosed between two double quotes such > > as "400_Years_of_Telescope"? > Same case: The " are part of the topic's name (e.g. < http://en.wikipedia.org/wiki/%22Weird_Al%22_Yankovic>). Marco PS: Next time, please do correct copy&paste so people have a chance to see what you want. Both your supplied examples had to be corrected, the second one was missing a "the": <http://en.wikipedia.org/wiki/ "400_Years_of_the_Telescope"> -- VMSoft GbR Nabburger Str. 15 81737 München Geschäftsführer: Marco Schuster, Volker Hemmert http://vmsoft-gbr.de _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
