Re: [Wikitech-l] list of things to do for image dumps
On Thu, Sep 9, 2010 at 10:54 PM, Jamie Morken jmor...@shaw.ca wrote: Hi all, This is a preliminary list of what needs to be done to generate images dumps. If anyone can help with #2 to provide the access log of image usage stats please send me an email! 1. run wikix to generate list of images for a given wiki ie. enwiki 2. sort the image list based on usage frequency from access log files Hi, It will be great to have these image dumps ! I wonder if a different dump my be worth it for a different scenario: * User only wants to get the photos for a small set of ids i.e. 1000 pages What would be the proper way to get these photos without downloading large dumps ? a. Parse the actual html pages and get the actual image urls (plus license info and then download the images) ? b. Try to find the actual image urls using the commons wikitext dump (and parse license info, ..) ? Both approaches seem complicated so maybe a different dump would be helpful: Page id -- List of [ Image id | real url | type (original | dim_xy | thumb) | license ] regards ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] list of things to do for image dumps
2010/9/10 Jose jmal...@gmail.com: Both approaches seem complicated so maybe a different dump would be helpful: Page id -- List of [ Image id | real url | type (original | dim_xy | thumb) | license ] http://commons.wikimedia.org/w/api.php?action=queryprop=imageinfoiiprop=url|dimensionsiiurlwidth=200pageids=3786405|8801158|4120827|1478233 Returns image URL, width, height and thumbnail URL for a 200px thumbnail. Roan Kattouw (Catrope) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] list of things to do for image dumps
On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.katt...@gmail.com wrote: Both approaches seem complicated so maybe a different dump would be helpful: Page id -- List of [ Image id | real url | type (original | dim_xy | thumb) | license ] http://commons.wikimedia.org/w/api.php?action=queryprop=imageinfoiiprop=url|dimensionsiiurlwidth=200pageids=3786405|8801158|4120827|1478233 Returns image URL, width, height and thumbnail URL for a 200px thumbnail. Thanks, this may be useful. So let's say I want to get all images for the Ant page, the steps will be: 1. Parse the Ant page wikitext and get all Image: links 2. For every image link get it's commons page id (Can I issue the above query using the title ids instead on number ids ? . If not, then use the commons repository to map image title to number id) 3. Issue a query like the one you detail above (but the results don't show license info !). Still, I think having a small dump with metadata is better than sending a lot of api queries thanks ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] list of things to do for image dumps
On Fri, Sep 10, 2010 at 3:09 PM, Jose jmal...@gmail.com wrote: On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.katt...@gmail.com wrote: Both approaches seem complicated so maybe a different dump would be helpful: Page id -- List of [ Image id | real url | type (original | dim_xy | thumb) | license ] http://commons.wikimedia.org/w/api.php?action=queryprop=imageinfoiiprop=url|dimensionsiiurlwidth=200pageids=3786405|8801158|4120827|1478233 Returns image URL, width, height and thumbnail URL for a 200px thumbnail. Thanks, this may be useful. So let's say I want to get all images for the Ant page, the steps will be: Just use prop=images as a generator on en.wikipedia.org. This will yield the thumb urls as well as the urls of the commons pages, which can then be fetched separately. Bryan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] list of things to do for image dumps
2010/9/10 Bryan Tong Minh bryan.tongm...@gmail.com: Just use prop=images as a generator on en.wikipedia.org. This will yield the thumb urls as well as the urls of the commons pages, which can then be fetched separately. Concrete example: http://en.wikipedia.org/w/api.php?action=querygenerator=imagesgimlimit=maxtitles=Albert_Einsteinprop=imageinfoiiprop=url|dimensionsiiurlwidth=200 Licensing info is not available through the API because it's just some text or template on the image description page; it has no meaning to the MediaWiki software. Roan Kattouw (Catrope) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] CellsExtension feasibility?
Having ported an entire spreadsheet's cells into mediawiki templates, I can report a basic problem: A substantial percentage of the templates required the addition of 'subst:' to the template references in order to get the references to work. This, of course, trashes the original {{#expr:...}}, which is highly undesirable. On Tue, Jul 13, 2010 at 10:25 AM, Roan Kattouw roan.katt...@gmail.comwrote: 2010/7/13 James Bowery jabow...@gmail.com: What I would like is something very similar called CellsExtension which provides only the keyword #cell as in: --- {{#expr:{{#cell:pi}}+1}} --- However, it gets the value of pi from: http://somedomain.org/mediawiki/index.php?title=Pi Just putting 3.141592653589 (as opposed to 3.14159265418, whose last three digits differ from their counterparts in pi) in [[Template:Pi]] and using {{#expr:{{pi}}+1}} should have the same effect AFAIK. If you want [[Template:Pi]] to look more interesting than just the number, you could use noinclude and includeonly . Ideally, whenever a mediawiki rendered page is cached, dependency pointers are created from all pages from which cells fetched values during rendering of the page (implying the evaluation of #expr's. That way, when the mediawiki source for one of the cached pages is edited, not only is its cached rendering deleted, but so are all cached renderings that depend on it directly or indirectly. This is so that the next time those pages are accessed, they are rendered -- and cached -- again, freshly evaluating the formulas in the #expr's (which, of course, will contain #cell references such as {{#cell:pi}}). With the template transclusion method I described above, all of this is already handled by MediaWiki. Roan Kattouw (Catrope) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] list of things to do for image dumps
Hi Lars, are you going to upload more logs to Internet Archive? Domas website only shows the last 3 (?) months. I think that there are many of these files at Toolserver, but we must preserve this raw data in another secure (for posterity) place. 2010/9/10 Lars Aronsson l...@aronsson.se On 09/09/2010 10:54 PM, Jamie Morken wrote: Hi all, If anyone can help with #2 to provide the access log of image usage stats please send me an email! 2. sort the image list based on usage frequency from access log files The raw data is one file per hour, containing a list of page names and visit counts. From just one such file, you get statistics on what's the most visited pages during that particular hour. By combining more files, you can get statistics for a whole day, a week, a month, a year, all Mondays, all 7am hours around the year, the 3rd Sunday after Easter, or whatever. The combinations are almost endless. How do we boil this down to a few datasets that are most useful? Is that the total visit count per month? Or what? Are these visitor stats already in a database on the toolserver? If so, how are they organized? I wrote some documentation on the access log format here, http://www.archive.org/details/wikipedia_visitor_stats_200712 -- Lars Aronsson (l...@aronsson.se) Aronsson Datateknik - http://aronsson.se ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] image dump status update1
Hi, I did some testing on Domas' pagecounts log files: original file: pagecounts-20100910-04.gz downloaded from: http://dammit.lt/wikistats/ the original file pagecounts-20100910-04.gz was parsed to remove all lines except those beginning with en File. This shows what files were downloaded in that hour, mostly images but further parsing is needed to remove non-image files (ie. *.ogg audio etc) example parsed line from pagecounts-20100910-04.gz: en File:Alexander_Karelin.jpg 1 9238 the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes transferred, which depends on what image scaling was used it is located at: http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg; and linked from the page: http://en.wikipedia.org/wiki/Aleksandr_Karelin We also may want to parse out the lines that begin with commons.m File and commons.m Image from the pagecounts file as they also contain image links after we parse the pagecounts files down to image links only, then we can merge them together, the more we merge the better our image view data will be for sorting the image list generated by wikix by view frequency. Wikix has the complete list of images for the wiki we are creating an image dump for, so any extra images from these pagecounts files that aren't in wikix's image list won't be added to the image dump, and also images that are in wikix's list but not in the pagecounts files will still be added to the image dump, but can be put into a tar file showing they are infrequently accessed. I did the parsing manually with a txt editor, but for the next step of merging the pagecounts files we will need to make some scripts. I think in the end we will not use wikix as it doesn't create a simple image list from the wiki's xml file. cheers, Jamie ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Keeping record of imported licensed text
We are currently attempting to refactor some specific modifications to the standard MW code we use (1.13.2) into an extension so we can upgrade to a more recent maintained version. One modification we have keeps a flag in the revisions table specifying that article text was imported from WP. This flag generates an attribution statement at the bottom of the article that acknowledges the import. I don't want to start a discussion about the various legal issues surrounding text licensing. However, assuming we must acknowledge use of licensed text, a legitimate technical issue is how to associate state with an article in a way that records the import of licensed text. I bring this up here because I assume we are not the only site that faces this issue. Some of our users want to encode the attribution information in a template. The problem with this approach is anyone can come along and remove it. That would mean the organization legally responsible for the site would entrust the integrity of site content to any arbitrary author. We may go this route, but for the sake of this discussion I assume such a strategy is not viable. So, the remainder of this post assumes we need to keep such licensing state in the db. After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited. Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core. -- -- Dan Nessett ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] image dump status update1
On 9/10/2010 6:14 PM, Jamie Morken wrote: Hi, I did some testing on Domas' pagecounts log files: original file: pagecounts-20100910-04.gz downloaded from: http://dammit.lt/wikistats/ the original file pagecounts-20100910-04.gz was parsed to remove all lines except those beginning with en File. This shows what files were downloaded in that hour, mostly images but further parsing is needed to remove non-image files (ie. *.ogg audio etc) example parsed line from pagecounts-20100910-04.gz: en File:Alexander_Karelin.jpg 1 9238 the 1 indicates the file was downloaded once this hour, and the 9238 is the bytes transferred, which depends on what image scaling was used it is located at: http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg; and linked from the page: http://en.wikipedia.org/wiki/Aleksandr_Karelin We also may want to parse out the lines that begin with commons.m File and commons.m Image from the pagecounts file as they also contain image links after we parse the pagecounts files down to image links only, then we can merge them together, the more we merge the better our image view data will be for sorting the image list generated by wikix by view frequency. Wikix has the complete list of images for the wiki we are creating an image dump for, so any extra images from these pagecounts files that aren't in wikix's image list won't be added to the image dump, and also images that are in wikix's list but not in the pagecounts files will still be added to the image dump, but can be put into a tar file showing they are infrequently accessed. I did the parsing manually with a txt editor, but for the next step of merging the pagecounts files we will need to make some scripts. I think in the end we will not use wikix as it doesn't create a simple image list from the wiki's xml file. That won't really give you that stats you want. That only gives you pageviews for the file description page itself, and not articles that use the image. I don't think there's any publicly available stats for the latter, though you could estimate it rather well using the dumps for the imagelinks and page database tables, then correlating hits for articles with the images that they contain. -- Alex (wikipedia:en:User:Mr.Z-man) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Keeping record of imported licensed text
Dan Nessett wrote: (...) After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited. Seems a good approach. Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core. Many sites could benefit, but I'd place it into an extension for now. Preferably on our svn. Note that not everything that many people use belongs to core (eg. ParserFunctions). ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Keeping record of imported licensed text
Support for licenses in the database would be a huge boon to Wikimedia Commons, for all the reasons you state. Commons' licensing is not uniform and making it easy to search and sort would be better for everyone. Currently we display licenses in templates, which has many drawbacks. I'd like it to be more concrete than just a page_prop -- for instance, you also want to associate properties with the licenses themselves, such as requires attribution. So that would mean another table. On 9/10/10 4:11 PM, Dan Nessett wrote: We are currently attempting to refactor some specific modifications to the standard MW code we use (1.13.2) into an extension so we can upgrade to a more recent maintained version. One modification we have keeps a flag in the revisions table specifying that article text was imported from WP. This flag generates an attribution statement at the bottom of the article that acknowledges the import. I don't want to start a discussion about the various legal issues surrounding text licensing. However, assuming we must acknowledge use of licensed text, a legitimate technical issue is how to associate state with an article in a way that records the import of licensed text. I bring this up here because I assume we are not the only site that faces this issue. Some of our users want to encode the attribution information in a template. The problem with this approach is anyone can come along and remove it. That would mean the organization legally responsible for the site would entrust the integrity of site content to any arbitrary author. We may go this route, but for the sake of this discussion I assume such a strategy is not viable. So, the remainder of this post assumes we need to keep such licensing state in the db. After asking around, one suggestion was to keep the licensing state in the page_props table. This seems very reasonable and I would be interested in comments by this community on the idea. Of course, there has to be a way to get this state set, but it seems likely that could be achieved using an extension triggered when an article is edited. Since this post is already getting long, let me close by asking whether support for associating licensing information with articles might be useful to a large number of sites. If so, the perhaps it belongs in the core. -- Neil Kandalgaonkar ) ne...@wikimedia.org ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l