Great, if you remove ETL from the way, the task is pretty simple: * Get a CSV dump of your table(s). * Use google refine to draw rules for cleaning up the data - I can provide you some, based on my very old database dump, and we might work that out together through Skype. * Reload the table(s) with the exported (fixed) CSV.
It's pretty easy! The great thing about it is that you get a record of "cleaning actions", which you can replay later - e.g., if you get a more updated database dump which didn't get cleaned up. -- Nuno Tavares Wikimedia Portugal http://www.wikimedia.pt Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter livre acesso ao somatório de todo o conhecimento humano. É isso o que estamos a fazer. Participe também: http://www.wikimedia.pt Vicenç Riullop escreveu: > Ok, thanks, I think I've got it. It will be better to clean the > municipality field moving any comment or multiple municipalities to > another field more suitable for notes. The pipped links and sorting keys > can be generated with an auxiliary template-table that can be useful > also for uniformity in other types of lists. > > Obtaining the source has been a PITA, so regeneration is not an issue. > We will think later how to check for updates. > > Vicenç > >> Date: Thu, 25 Aug 2011 12:33:35 +0100 >> From: [email protected] >> To: [email protected] >> Subject: Re: [Wiki Loves Monuments] Spanish database [Was: European > stats and other cool tools] >> >> Dear Vicenç [I believe all others should ignore this message], >> >> >> Yes, I've just noticed, e.g., "[[Ogassa]] i [[Ribes de Freser]]". My >> (professional) opinion is that this is a huge mistake, since that field >> represents 2 values, which is a clear violation of standard database >> rules. Having this in mind, I'd suggest splitting the object in two. >> >> I found a pattern for "multiple municipalities in the same object", >> which seems to be "]] i [[", and there seems to be only 6 of them. You >> can just wipe out one of them from the field, or manually duplicate the >> record them for splitting - the latter being more complex, think of a >> good ID disambiguation though. Regarding link disambiguations, you can >> just add a column for that - in a "working table", as they are of no use >> for a listing like this (alas, that kind of data is only relevant if you >> plan to regenerate the lists). >> >> >> We had similar problems - we even had descriptions names in wiki syntax >> already, our lists date back to 2006. As I wanted to preserve the links >> and original "names" (designations), I've done some transformations on >> the data before it gets on the list - this kind of approach was actually >> the reason to have our tools/plist [1] - notice multiple links near the >> end of each description, and notice that, if you export to wiki format, >> you'll find municipalities disambiguated as well. >> >> [1] http://www.wikilovesmonuments.org.pt/tools/plist/?region=Braga >> >> That's why you currently can "regenerate" the lists with their original >> descriptions and links. >> >> This is, however, a more complex approach which I won't have time to >> assist you, but I may point you directions if you know ETL. >> >> >> -- >> Nuno Tavares >> Wikimedia Portugal >> http://www.wikimedia.pt >> >> Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter >> livre acesso ao somatório de todo o conhecimento humano. É isso o que >> estamos a fazer. >> >> Participe também: http://www.wikimedia.pt >> >> >> >> Vicenç Riullop escreveu: >> > A list may include different municipalities so links, disambiguations >> > and sorting keys are included in the municipality field. Morever, some >> > large monuments may be in the border of two municipalities. Any >> > suggestion how to handdle this? >> > >> > Vicenç >> > >> >> Date: Thu, 25 Aug 2011 03:40:12 +0100 >> >> From: [email protected] >> >> To: [email protected] >> >> Subject: Re: [Wiki Loves Monuments] Spanish database [Was: European >> > stats and other cool tools] >> >> >> >> Dear spanish fellows: >> >> >> >> I've notice some quality problems in the municipality names of the >> >> objects on es/es-ct/es-vc lists. Sometimes they are [[like this]], >> >> others "like this" and also "{{ordre|like this}}". In my opinion, this >> >> is half-way for problems in managing the lists. >> >> >> >> >> > > http://www.wikilovesmonuments.org.pt/tools/stats-eu/index2.php?format=table&item[]=total&item[]=name&item[]=name_pct&item[]=address&item[]=address_pct&item[]=municipality&item[]=municipality_pct&item[]=coordinates&item[]=coordinates_pct&item[]=image&item[]=image_pct&width=1000&height=700&filter[]=es-vc&filter[]=es&filter[]=es-ct&breakdown=municipality&groupby=item&color=heatpct >> >> >> >> >> >> It's pretty easy to clean that up, but I'm still not sure if Maarten is >> >> loading the database from wiki lists (thus overriding any fixes we >> >> perform). Get in touch with me if you need help. >> >> >> >> PS: I don't have much opportunities to hang around IRC, but feel > free to >> >> ping me in Skype. >> >> >> >> PS2: Andorra and France seem to suffer the same symptoms. >> >> >> >> -- >> >> Nuno Tavares >> >> Wikimedia Portugal >> >> http://www.wikimedia.pt >> >> >> >> Imagine um mundo onde é dada a qualquer pessoa a possibilidade de ter >> >> livre acesso ao somatório de todo o conhecimento humano. É isso o que >> >> estamos a fazer. >> >> >> >> Participe também: http://www.wikimedia.pt >> >> >> >> Em 25-08-2011 02:49, Nuno Tavares escreveu: >> >> > Hey again, >> >> > >> >> > Didn't I say I love statistics? :-) >> >> > >> >> > You can now have your "heat" list adding the option "color=heatpct": >> >> > >> >> > >> > > http://www.wikilovesmonuments.org.pt/tools/stats-eu/index2.php?format=table&item[]=total&item[]=name&item[]=name_pct&item[]=address&item[]=address_pct&item[]=municipality&item[]=municipality_pct&item[]=coordinates&item[]=coordinates_pct&item[]=image&item[]=image_pct&width=1000&height=700&filter[]=pt&breakdown=municipality&groupby=item&color=heatpct >> >> > >> >> > >> >> > >> >> >> >> _______________________________________________ >> >> Wiki Loves Monuments mailing list >> >> [email protected] >> >> https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments >> >> http://www.wikilovesmonuments.eu >> > >> > >> > ------------------------------------------------------------------------ >> > >> > _______________________________________________ >> > Wiki Loves Monuments mailing list >> > [email protected] >> > https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments >> > http://www.wikilovesmonuments.eu >> >> _______________________________________________ >> Wiki Loves Monuments mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments >> http://www.wikilovesmonuments.eu > > > ------------------------------------------------------------------------ > > _______________________________________________ > Wiki Loves Monuments mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments > http://www.wikilovesmonuments.eu _______________________________________________ Wiki Loves Monuments mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikilovesmonuments http://www.wikilovesmonuments.eu
