https://bugzilla.wikimedia.org/show_bug.cgi?id=18694

           Summary: Spanish wikipedia XML dump problems
           Product: Wikimedia
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: Normal
         Component: Downloads
        AssignedTo: [email protected]
        ReportedBy: [email protected]


I downloaded two sequential Spanish wikipedia XML dump files
(eswiki-20090504-pages-articles.xml.bz2 and before that
eswiki-20090421-pages-articles.xml.bz2). When I imported the file into wikitaxi
it showed a strange error on a large number of pages: the titles and the
content of the pages were mixed-up, that is, the title would be something and
the text itself would obviously be from a different page (or it would be a
combination of two pages). So I looked into the original XML file itself and
this is what I found, for example:

  <page>
    <title>Gómez Plata</title>
    <id>454035</id>
    <revision>
      <id>25156038</id>
      <timestamp>2009-03-28T06:38:04Z</timestamp>
      <contributor>
        <username>SajoR</username>
        <id>130444</id>
      </contributor>
      <minor />
      <comment>leve mejora</comment>
      <text xml:space="preserve">'''Montserrat Domínguez''' ([[Madrid]],
[[1963]]) es una [[periodismo|periodista]] [[España|española]].

Considera que la primera obligación de un periodista es ser crítico con el
poder y es optimista respecto a la situación actual del periodismo. Su trabajo
le ofrece, en su opinión, &quot;un motor de vida&quot;.

Es aficionada a la [[lectura]] y a los viajes.

== Biografía ==

Estudió [[Ciencias de la Información]] por la [[Universidad Complutense de
Madrid]]. Posteriormente cursó un Master en Periodismo por la [[Universidad de
Columbia]].


So the title of the page is Gómez Plata (a municipality in Colombia), but the
page is about a Spanish journalist.

This didn't happen when I downloaded other wikipedia dumps (en, de, nl, sv).
Could someone please look into this problem? Thank you.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to