Re: [Wikitech-l] list of things to do for image dumps

2010-09-10 Thread Jose
On Thu, Sep 9, 2010 at 10:54 PM, Jamie Morken jmor...@shaw.ca wrote:
 Hi all,

 This is a preliminary list of what needs to be done to generate images 
 dumps.  If anyone can help with #2 to provide the access log of image usage 
 stats please send me an email!

 1. run wikix to generate list of images for a given wiki ie. enwiki

 2. sort the image list based on usage frequency from access log files

Hi,

It will be great to have these image dumps ! I wonder if a different
dump my be worth it for a different scenario:

* User only wants to get the photos for a small set of ids i.e. 1000 pages

What would be the proper way to get these photos without downloading
large dumps ?

a. Parse the actual html pages and get the actual image urls (plus
license info and then download the images) ?

b. Try to find the actual image urls using the commons wikitext
dump (and parse license info, ..) ?

Both approaches seem complicated so maybe a different dump would be helpful:

Page id  --  List of [ Image id | real url |   type (original |
dim_xy | thumb) | license ]

regards

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] list of things to do for image dumps

2010-09-10 Thread Roan Kattouw
2010/9/10 Jose jmal...@gmail.com:
 Both approaches seem complicated so maybe a different dump would be helpful:

 Page id  --  List of [ Image id | real url |   type (original |
 dim_xy | thumb) | license ]

http://commons.wikimedia.org/w/api.php?action=queryprop=imageinfoiiprop=url|dimensionsiiurlwidth=200pageids=3786405|8801158|4120827|1478233

Returns image URL, width, height and thumbnail URL for a 200px thumbnail.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] list of things to do for image dumps

2010-09-10 Thread Jose
On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.katt...@gmail.com wrote:
 Both approaches seem complicated so maybe a different dump would be helpful:

 Page id  --  List of [ Image id | real url |   type (original |
 dim_xy | thumb) | license ]

 http://commons.wikimedia.org/w/api.php?action=queryprop=imageinfoiiprop=url|dimensionsiiurlwidth=200pageids=3786405|8801158|4120827|1478233

 Returns image URL, width, height and thumbnail URL for a 200px thumbnail.

Thanks, this may be useful. So let's say I want to get all images for
the Ant page, the steps will be:

1. Parse the Ant page wikitext and get all Image: links

2. For every image link get it's commons page id (Can I issue the
above query using the title ids instead on number ids ? . If not, then
use the commons repository to map image title to number id)

3. Issue a query like the one you detail above (but the results don't
show license info  !).

Still, I think having a small dump with metadata is better than
sending a lot of api queries

thanks

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] list of things to do for image dumps

2010-09-10 Thread Bryan Tong Minh
On Fri, Sep 10, 2010 at 3:09 PM, Jose jmal...@gmail.com wrote:
 On Fri, Sep 10, 2010 at 2:44 PM, Roan Kattouw roan.katt...@gmail.com wrote:
 Both approaches seem complicated so maybe a different dump would be helpful:

 Page id  --  List of [ Image id | real url |   type (original |
 dim_xy | thumb) | license ]

 http://commons.wikimedia.org/w/api.php?action=queryprop=imageinfoiiprop=url|dimensionsiiurlwidth=200pageids=3786405|8801158|4120827|1478233

 Returns image URL, width, height and thumbnail URL for a 200px thumbnail.

 Thanks, this may be useful. So let's say I want to get all images for
 the Ant page, the steps will be:

Just use prop=images as a generator on en.wikipedia.org. This will
yield the thumb urls as well as the urls of the commons pages, which
can then be fetched separately.


Bryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] list of things to do for image dumps

2010-09-10 Thread Roan Kattouw
2010/9/10 Bryan Tong Minh bryan.tongm...@gmail.com:
 Just use prop=images as a generator on en.wikipedia.org. This will
 yield the thumb urls as well as the urls of the commons pages, which
 can then be fetched separately.

Concrete example:

http://en.wikipedia.org/w/api.php?action=querygenerator=imagesgimlimit=maxtitles=Albert_Einsteinprop=imageinfoiiprop=url|dimensionsiiurlwidth=200

Licensing info is not available through the API because it's just some
text or template on the image description page; it has no meaning to
the MediaWiki software.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] CellsExtension feasibility?

2010-09-10 Thread James Bowery
Having ported an entire spreadsheet's cells into mediawiki templates, I can
report a basic problem:

A substantial percentage of the templates required the addition of 'subst:'
to the template references in order to get the references to work.  This, of
course, trashes the original {{#expr:...}}, which is highly undesirable.

On Tue, Jul 13, 2010 at 10:25 AM, Roan Kattouw roan.katt...@gmail.comwrote:

 2010/7/13 James Bowery jabow...@gmail.com:
  What I would like is something very similar called CellsExtension
  which provides only the keyword #cell as in:
  ---
  {{#expr:{{#cell:pi}}+1}}
  ---
 
  However, it gets the value of pi from:
  http://somedomain.org/mediawiki/index.php?title=Pi
 
 Just putting 3.141592653589 (as opposed to 3.14159265418, whose last
 three digits differ from their counterparts in pi) in [[Template:Pi]]
 and using {{#expr:{{pi}}+1}} should have the same effect AFAIK. If you
 want [[Template:Pi]] to look more interesting than just the number,
 you could use noinclude and includeonly .

  Ideally, whenever a mediawiki rendered page is cached, dependency
  pointers are created from all pages from which cells fetched values
  during rendering of the page (implying the evaluation of #expr's. That
  way, when the mediawiki source for one of the cached pages is edited,
  not only is its cached rendering deleted, but so are all cached
  renderings that depend on it directly or indirectly.  This is so that
  the next time those pages are accessed, they are rendered -- and
  cached -- again, freshly evaluating the formulas in the #expr's
  (which, of course, will contain #cell references such as {{#cell:pi}}).
 
 With the template transclusion method I described above, all of this
 is already handled by MediaWiki.

 Roan Kattouw (Catrope)

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] list of things to do for image dumps

2010-09-10 Thread emijrp
Hi Lars, are you going to upload more logs to Internet Archive? Domas
website only shows the last 3 (?) months. I think that there are many of
these files at Toolserver, but we must preserve this raw data in another
secure (for posterity) place.

2010/9/10 Lars Aronsson l...@aronsson.se

 On 09/09/2010 10:54 PM, Jamie Morken wrote:
  Hi all,
 
  If anyone can help with #2 to provide the access log of image usage stats
 please send me an email!
  2. sort the image list based on usage frequency from access log files

 The raw data is one file per hour, containing a list of page names
 and visit counts. From just one such file, you get statistics on what's
 the most visited pages during that particular hour. By combining
 more files, you can get statistics for a whole day, a week, a month,
 a year, all Mondays, all 7am hours around the year, the 3rd Sunday
 after Easter, or whatever. The combinations are almost endless.

 How do we boil this down to a few datasets that are most useful?
 Is that the total visit count per month? Or what?

 Are these visitor stats already in a database on the toolserver?
 If so, how are they organized?

 I wrote some documentation on the access log format here,
 http://www.archive.org/details/wikipedia_visitor_stats_200712


 --
   Lars Aronsson (l...@aronsson.se)
   Aronsson Datateknik - http://aronsson.se



 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] image dump status update1

2010-09-10 Thread Jamie Morken
Hi,

I did some testing on Domas' pagecounts log files:

original file: pagecounts-20100910-04.gz downloaded from: 
http://dammit.lt/wikistats/

the original file pagecounts-20100910-04.gz was parsed to remove all 
lines except those 
beginning with en File.  This shows what files were downloaded in that hour, 
mostly images but further
parsing is needed to remove non-image files (ie. *.ogg audio etc)

example parsed line from pagecounts-20100910-04.gz:

en File:Alexander_Karelin.jpg 1 9238

the 1 indicates the file was downloaded once this hour, and the 9238 is the 
bytes transferred, which
depends on what image scaling was used

it is located at: http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg; and 
linked from the page: 
http://en.wikipedia.org/wiki/Aleksandr_Karelin

We also may want to parse out the lines that begin with commons.m File and 
commons.m Image from
the pagecounts file as they also contain image links

after we parse the pagecounts files down to image links only, then we can merge 
them together, the more 
we merge the better our image view data will be for sorting the image list 
generated by wikix by view 
frequency.

Wikix has the complete list of images for the wiki we are creating an image 
dump for, so any extra 
images from these pagecounts files that aren't in wikix's image list won't be 
added to the image dump, 
and also images that are in wikix's list but not in the pagecounts files will 
still be added to the image dump,
but can be put into a tar file showing they are infrequently accessed.

I did the parsing manually with a txt editor, but for the next step of merging 
the pagecounts files we will 
need to make some scripts.

I think in the end we will not use wikix as it doesn't create a simple image 
list from the wiki's xml file.

cheers,
Jamie




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Keeping record of imported licensed text

2010-09-10 Thread Dan Nessett
We are currently attempting to refactor some specific modifications to 
the standard MW code we use (1.13.2) into an extension so we can upgrade 
to a more recent maintained version. One modification we have keeps a 
flag in the revisions table specifying that article text was imported 
from WP. This flag generates an attribution statement at the bottom of 
the article that acknowledges the import.

I don't want to start a discussion about the various legal issues 
surrounding text licensing. However, assuming we must acknowledge use of 
licensed text, a legitimate technical issue is how to associate state 
with an article in a way that records the import of licensed text. I 
bring this up here because I assume we are not the only site that faces 
this issue.

Some of our users want to encode the attribution information in a 
template. The problem with this approach is anyone can come along and 
remove it. That would mean the organization legally responsible for the 
site would entrust the integrity of site content to any arbitrary author. 
We may go this route, but for the sake of this discussion I assume such a 
strategy is not viable. So, the remainder of this post assumes we need to 
keep such licensing state in the db.

After asking around, one suggestion was to keep the licensing state in 
the page_props table. This seems very reasonable and I would be 
interested in comments by this community on the idea. Of course, there 
has to be a way to get this state set, but it seems likely that could be 
achieved using an extension triggered when an article is edited.

Since this post is already getting long, let me close by asking whether 
support for associating licensing information with articles might be 
useful to a large number of sites. If so, the perhaps it belongs in the 
core.

-- 
-- Dan Nessett


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] image dump status update1

2010-09-10 Thread Alex
On 9/10/2010 6:14 PM, Jamie Morken wrote:
 Hi,
 
 I did some testing on Domas' pagecounts log files:
 
 original file: pagecounts-20100910-04.gz downloaded from: 
 http://dammit.lt/wikistats/
 
 the original file pagecounts-20100910-04.gz was parsed to remove all 
 lines except those 
 beginning with en File.  This shows what files were downloaded in that 
 hour, mostly images but further
 parsing is needed to remove non-image files (ie. *.ogg audio etc)
 
 example parsed line from pagecounts-20100910-04.gz:
 
 en File:Alexander_Karelin.jpg 1 9238
 
 the 1 indicates the file was downloaded once this hour, and the 9238 is the 
 bytes transferred, which
 depends on what image scaling was used
 
 it is located at: http://en.wikipedia.org/wiki/File:Alexander_Karelin.jpg; 
 and linked from the page: 
 http://en.wikipedia.org/wiki/Aleksandr_Karelin
 
 We also may want to parse out the lines that begin with commons.m File and 
 commons.m Image from
 the pagecounts file as they also contain image links
 
 after we parse the pagecounts files down to image links only, then we can 
 merge them together, the more 
 we merge the better our image view data will be for sorting the image list 
 generated by wikix by view 
 frequency.
 
 Wikix has the complete list of images for the wiki we are creating an image 
 dump for, so any extra 
 images from these pagecounts files that aren't in wikix's image list won't be 
 added to the image dump, 
 and also images that are in wikix's list but not in the pagecounts files will 
 still be added to the image dump,
 but can be put into a tar file showing they are infrequently accessed.
 
 I did the parsing manually with a txt editor, but for the next step of 
 merging the pagecounts files we will 
 need to make some scripts.
 
 I think in the end we will not use wikix as it doesn't create a simple image 
 list from the wiki's xml file.
 

That won't really give you that stats you want. That only gives you
pageviews for the file description page itself, and not articles that
use the image. I don't think there's any publicly available stats for
the latter, though you could estimate it rather well using the dumps for
the imagelinks and page database tables, then correlating hits for
articles with the images that they contain.

-- 
Alex (wikipedia:en:User:Mr.Z-man)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Keeping record of imported licensed text

2010-09-10 Thread Platonides
Dan Nessett wrote:
(...)
 
 After asking around, one suggestion was to keep the licensing state in 
 the page_props table. This seems very reasonable and I would be 
 interested in comments by this community on the idea. Of course, there 
 has to be a way to get this state set, but it seems likely that could be 
 achieved using an extension triggered when an article is edited.

Seems a good approach.

 Since this post is already getting long, let me close by asking whether 
 support for associating licensing information with articles might be 
 useful to a large number of sites. If so, the perhaps it belongs in the 
 core.

Many sites could benefit, but I'd place it into an extension for now.
Preferably on our svn. Note that not everything that many people use
belongs to core (eg. ParserFunctions).



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Keeping record of imported licensed text

2010-09-10 Thread Neil Kandalgaonkar
Support for licenses in the database would be a huge boon to Wikimedia 
Commons, for all the reasons you state. Commons' licensing is not 
uniform and making it easy to search and sort would be better for everyone.

Currently we display licenses in templates, which has many drawbacks.

I'd like it to be more concrete than just a page_prop -- for instance, 
you also want to associate properties with the licenses themselves, such 
as requires attribution. So that would mean another table.



On 9/10/10 4:11 PM, Dan Nessett wrote:
 We are currently attempting to refactor some specific modifications to
 the standard MW code we use (1.13.2) into an extension so we can upgrade
 to a more recent maintained version. One modification we have keeps a
 flag in the revisions table specifying that article text was imported
 from WP. This flag generates an attribution statement at the bottom of
 the article that acknowledges the import.

 I don't want to start a discussion about the various legal issues
 surrounding text licensing. However, assuming we must acknowledge use of
 licensed text, a legitimate technical issue is how to associate state
 with an article in a way that records the import of licensed text. I
 bring this up here because I assume we are not the only site that faces
 this issue.

 Some of our users want to encode the attribution information in a
 template. The problem with this approach is anyone can come along and
 remove it. That would mean the organization legally responsible for the
 site would entrust the integrity of site content to any arbitrary author.
 We may go this route, but for the sake of this discussion I assume such a
 strategy is not viable. So, the remainder of this post assumes we need to
 keep such licensing state in the db.

 After asking around, one suggestion was to keep the licensing state in
 the page_props table. This seems very reasonable and I would be
 interested in comments by this community on the idea. Of course, there
 has to be a way to get this state set, but it seems likely that could be
 achieved using an extension triggered when an article is edited.

 Since this post is already getting long, let me close by asking whether
 support for associating licensing information with articles might be
 useful to a large number of sites. If so, the perhaps it belongs in the
 core.


-- 
Neil Kandalgaonkar   ) ne...@wikimedia.org

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l