Re: [Wikimediauk-l] Downloading a category from commons

Michael Maggs Fri, 18 Oct 2013 07:27:42 -0700

Fae

This is very interesting and useful: I have never come across "fileUrl()" 
before.


In the example below, you are using  source=page.fileUrl().  Is there a similar 
call that will get the full-size version of a file on Commons?  

So far as I can see, getting the file via the API requires knowledge of the 
URL, which itself means calculating an md5 hash on the image name.  In 
Applescript, this seems to work, so long as there are no odd characters, but 
I'm hoping there is a simpler call to use in Python:

                set imageName to findReplStr's findReplStr(imageName, " ", "_") 
 #replace spaces with underscores
                
                set hash to do shell script "md5 -q -s " & quoted form of 
imageName
                
                set sub2 to text 1 thru 2 of hash  #Applescript uses 1-based 
strings
                set sub1 to text 1 of sub2
                
                set imageURL1 to 
"http://upload.wikimedia.org/wikipedia/commons/";
                
                set imageURL to imageURL1 & sub1 & "/" & sub2 & "/" & imageName
                set imageURL to findReplStr's findReplStr(imageURL, " ", "_")

I am looking forward to your proposed workshops.

Michael


On 16 Oct 2013, at 15:10, Fæ wrote:

> I suggest that anyone with topics they would like to cover in a 
> python/pywikipediabot workshop, consider adding to discussion on the event 
> registration talk page, so that Jonathan can pull ideas and expected outcomes 
> together. He's trying to agree a new date for a workshop and I'm thinking of 
> the value of splitting it into a basics session for, say, 2 hours one evening 
> and a more advanced practical session one afternoon (you can then choose to 
> come to one rather than both). I would be happy for this to be either a 
> weekday or a weekend depending on what most people can make.
> 
> Go to 
> <https://wiki.wikimedia.org.uk/wiki/Python_and_Wikimedia_bots_workshop_Oct_2013>
>  to add your ideas on dates and content of a workshop(s).
> 
> I have pasted the code for a recursive dump of Wikipedia Takes Chester that I 
> cobbled together before breakfast below, but it's not all that helpful 
> without getting the basics of python modules, pywikipediabot and the 
> Wikimedia API (that it is built on) in your head first. It is badly written, 
> but works, and I can tweak this to be a general multi cat-dump routine with a 
> couple of minutes work. The idea of having a couple of workshops is to give a 
> group of contributors the basic "bot" writing skills and an effective kit-bag 
> of methods to write anything they can imagine, from clever analytical reports 
> to daily house-keeping bots, even if they use fairly poor code to do so ;-)
> 
> The main problem we have with pywikipediabot is that documentation is poor 
> (for example, I don't think the function "fileUrl()" is documented anywhere, 
> for several months I was using the API directly to do what this function 
> nicely does, as I didn't know it was available, and yet would probably fail 
> in mysterious ways if used on the wrong class of object, such as a category 
> rather than an image page, something that a manual ought to help the user 
> understand). It would be great if those interested in improving the manuals 
> could play around with the various commands and illustrate with example 
> working code (and highlight common errors!). I would hope that the outcome of 
> the workshop would be to achieve some of this, perhaps even laying down a few 
> short demonstration screen-capture videos of what these tools can do, and how 
> to go about setting yourself up to use them.
> 
> BTW the "unidecode" bit below was hacked on after the dump fell over trying 
> to write façade in a local file name. It neatly transcribes it into "facade", 
> a clever module for handling non-ascii international characters of all sorts.
> 
> Fae
> ----
> /* The main part of 'batchCatDump.py', treat as CC-BY-SA.
> This takes all images recursively under Commons category 'catname' and saves 
> the full size image along with the current text of its associated image page 
> in a local directory. In this case it generated 468 image files and the same 
> number of matching html files, taking just under 2GB on a usb stick. */
> 
> catname="Wikipedia Takes Chester"
> cat = catlib.Category(site, catname)
> gen = pagegenerators.CategorizedPageGenerator(cat,recurse=True)
> count=0
> 
> savedir="//Volumes/Fae_32GB/Wiki/"+catname+"/"
> if not os.path.exists(savedir):
>     os.makedirs(savedir)
> 
> for page in gen:
>         title=page.title()
>         if not title[0:5]=="File:": continue
>         count+=1
>         utitle=unidecode(title[5:])
>         saveas=savedir+utitle
>         if os.path.exists(saveas):
>                 continue
>         if utitle!=title[5:]:
>                 print "Transcribing title as", utitle
>         html = page.get()
>         source=page.fileUrl()
>         urllib.urlretrieve(source, saveas)
>         f=open(saveas+".html","w")
>         f.write(unidecode(html))
>         f.close()
> 
> -- 
> [email protected] http://j.mp/faewm
> 
> _______________________________________________
> Wikimedia UK mailing list
> [email protected]
> http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l
> WMUK: http://uk.wikimedia.org

_______________________________________________
Wikimedia UK mailing list
[email protected]
http://mail.wikimedia.org/mailman/listinfo/wikimediauk-l
WMUK: http://uk.wikimedia.org

Re: [Wikimediauk-l] Downloading a category from commons

Reply via email to