Hi,

I did some "testing" on Domas' pagecounts log files:

original file: pagecounts-20100910-040000.gz downloaded from: 
http://dammit.lt/wikistats/

the original file "pagecounts-20100910-040000.gz" was parsed to remove all 
lines except those 
beginning with "en File".  This shows what files were downloaded in that hour, 
mostly images but further
parsing is needed to remove non-image files (ie. *.ogg audio etc)

example parsed line from pagecounts-20100910-040000.gz:

en File:Alexander_Karelin.jpg 1 9238

the 1 indicates the file was downloaded once this hour, and the 9238 is the 
bytes transferred, which
depends on what image scaling was used

it is located at: "http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg"; and 
linked from the page: 
http://en.wikipedia.org/wiki/Aleksandr_Karelin

We also may want to parse out the lines that begin with "commons.m File" and 
"commons.m Image" from
the pagecounts file as they also contain image links

after we parse the pagecounts files down to image links only, then we can merge 
them together, the more 
we merge the better our image view data will be for sorting the image list 
generated by wikix by view 
frequency.

Wikix has the complete list of images for the wiki we are creating an image 
dump for, so any extra 
images from these pagecounts files that aren't in wikix's image list won't be 
added to the image dump, 
and also images that are in wikix's list but not in the pagecounts files will 
still be added to the image dump,
but can be put into a tar file showing they are infrequently accessed.

I did the parsing manually with a txt editor, but for the next step of merging 
the pagecounts files we will 
need to make some scripts.

I think in the end we will not use wikix as it doesn't create a simple image 
list from the wiki's xml file.

cheers,
Jamie




_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to