Re: [Wikitech-l] 100% open source stack (was Re: Bugzilla Vs other trackers.)

2010-01-08 Thread Domas Mituzas
 
 What was wrong with LVM snapshots?  Performance?

in zfs every write is 'copy on write', so snapshots have 'zero' cost, and 
multiple snapshots can use same data.
in LVM every snapshot is standalone and has all the information it needs. also 
LVM doesn't have snapshot-based replication, and DRBD+ wasn't opensource/free 
at that time either ;-

Though reasons of OpenSolaris vs Solaris are different. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Jamie Morken

Hello,

 Is the bandwidth used really a big problem? Bandwidth is pretty 
 cheap 
 these days, and given Wikipedia's total draw, I suspect the 
 occasional 
 dump download isn't much of a problem.

I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
are no longer available on the wikipedia dump anyway.  I am guessing they were 
removed partly because of the bandwidth cost, or else image licensing issues 
perhaps.

from:
http://en.wikipedia.org/wiki/Wikipedia_database#Images_and_uploaded_files

Currently Wikipedia does not allow or provide facilities to download all 
images. As of 17 May 2007 (2007 -05-17)[update],
Wikipedia disabled or neglected all viable bulk downloads of images
including torrent trackers. Therefore, there is no way to download
image dumps other than scraping Wikipedia pages up or using Wikix, which 
converts a database dump into a series of scripts to fetch the images.

Unlike most article text, images are not necessarily licensed under the GFDL  
CC-BY-SA-3.0. They may be under one of many free licenses, in the public 
domain, believed to be fair use, or even copyright infringements (which should 
be deleted).
In particular, use of fair use images outside the context of Wikipedia
or similar works may be illegal. Images under most licenses require a
credit, and possibly other attached copyright information. This
information is included in image description pages, which are part of
the text dumps available from download.wikimedia.org. In conclusion, download 
these images at your own risk (Legal)
 
 Bittorrent's real strength is when a lot of people want to 
 download the 
 same thing at once. E.g., when a new Ubuntu release comes out. 
 Since 
 Bittorrent requires all downloaders to be uploaders, it turns 
 the flood 
 of users into a benefit. But unless somebody has stats 
 otherwise, I'd 
 guess that isn't the problem here.

Bittorrent is simply a more efficient method to distribute files, especially if 
the much larger wikipedia image files were made available again.  The last dump 
from english wikipedia including images is over 200GB but is understandably not 
available for download. Even if there are only 10 people per month who download 
these large files, bittorrent should be able to reduce the bandwidth cost to 
wikipedia significantly.  Also I think that having bittorrent setup for this 
would cost wikipedia a small amount, and may save money in the long run, as 
well as encourage people to experiment with offline encyclopedia usage etc.  To 
make people have to crawl wikipedia with Wikix if they want to download the 
images is a bad solution, as it means that the images are downloaded 
inefficiently.  Also one wikix user reported that his download connection was 
cutoff by a wikipedia admin for remote downloading.  

Unless there are legal reasons for not allowing images to be downloaded, I 
think the wikipedia image files should be made available for efficient download 
again.  However since wikix can theoretically be used to download the images, I 
think it would also be legal to allow the image dump to be downloaded as well, 
thoughts?

cheers,
Jamie



 
 William
 
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l
 
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Bugzilla Vs other trackers.

2010-01-08 Thread Ilmari Karonen
Roan Kattouw wrote:
 2010/1/7 Trevor Parscal tpars...@wikimedia.org:
 Hmmm... Not being able to distinguish the difference between a bug
 tracker and a wiki based on the skins being similar is a point of view I
 have a hard time understanding.
 
 Having read quite a few bug reports written in wikitext (which mostly
 doesn't work in Bugzilla, except for [[links]]), I would encourage a
 clearer distinction between the wikis and the bug tracker. I don't
 want to give people the impression that what they're reporting bugs on
 is really a quirky wiki variant: the bug tracker not only uses
 different syntax, but also has different policies, procedures and
 protocols.

It occurs to me that one option would be going the other way: the 
CodeReview extension already seems to have about 50% of the features a 
basic but functional bug tracker would need, including a couple of nice 
ones that our Bugzilla currently lacks (like, you know, comment preview, 
ability to use wiki markup and, well, code review).  Yes, turning it 
into a full-featured issue tracker and project management tool would 
take some substantial work, but then, switching to a new project 
management tool and customizing it to fit our needs isn't quite a 15 
minute job either.  Just something to consider... :-)

-- 
Ilmari Karonen

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Bryan Tong Minh
On Fri, Jan 8, 2010 at 4:31 PM, Jamie Morken jmor...@shaw.ca wrote:

 Bittorrent is simply a more efficient method to distribute files, especially 
 if the much larger wikipedia image files were made available again.  The last 
 dump from english wikipedia including images is over 200GB but is 
 understandably not available for download. Even if there are only 10 people 
 per month who download these large files, bittorrent should be able to reduce 
 the bandwidth cost to wikipedia significantly.  Also I think that having 
 bittorrent setup for this would cost wikipedia a small amount, and may save 
 money in the long run, as well as encourage people to experiment with offline 
 encyclopedia usage etc.  To make people have to crawl wikipedia with Wikix if 
 they want to download the images is a bad solution, as it means that the 
 images are downloaded inefficiently.  Also one wikix user reported that his 
 download connection was cutoff by a wikipedia admin for remote downloading.


The problem with BitTorrent is that it is unsuitable for rapidly
changing data sets, such as images. If you want to add a single file
to the torrent, the entire torrent hash changes, meaning that you end
up with separate peer pools for every different data set, although
they mostly contain the same files.

That said, it could of course be benificial for an initial dump
download and is better than the current situation where there is
nothing available at all.


Bryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Aryeh Gregor
On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
 are no longer available on the wikipedia dump anyway.  I am guessing they 
 were removed partly because of the bandwidth cost, or else image licensing 
 issues perhaps.

I think we just don't have infrastructure set up to dump images.  I'm
very sure bandwidth is not an issue -- the number of people with a
terabyte (or is it more?) handy that they want to download a Wikipedia
image dump to will be vanishingly small compared to normal users.
Licensing wouldn't be an issue for Commons, at least, as long as it's
easy to link the images up to their license pages.  (I imagine it
would technically violate some licenses, but no one would probably
worry about it.)

 Bittorrent is simply a more efficient method to distribute files, especially 
 if the much larger wikipedia image files were made available again.  The last 
 dump from english wikipedia including images is over 200GB but is 
 understandably not available for download. Even if there are only 10 people 
 per month who download these large files, bittorrent should be able to reduce 
 the bandwidth cost to wikipedia significantly.

Wikipedia uses an average of multiple gigabits per second of
bandwidth, as I recall.  One gigabit per second adds up to about 10.5
terabytes per day, so say 300 terabytes per month.  I'm pretty sure
the average figure is more like five or ten Gbps than one, so let's
say a petabyte a month at least  Ten people per month downloading an
extra terabyte is not a big issue.  And I really doubt we'd see that
many people downloading a full image dump every month.

The sensible bandwidth-saving way to do it would be to set up an rsync
daemon on the image servers, and let people use that.  Then you could
get an old copy of the files from anywhere (including Bittorrent, if
you like) and only have to download the changes.  Plus, you could get
up-to-the-minute copies if you like, although probably some throttling
should be put into place to stop dozens of people from all running
rsync in a loop to make sure they have the absolute latest version.  I
believe rsync 2 doesn't handle such huge numbers of files acceptably,
but I heard rsync 3 is supposed to be much better.  That sounds like a
better direction to look in than Bittorrent -- nobody's going to want
to redownload the same files constantly to get an up-to-date set.

 Unless there are legal reasons for not allowing images to be downloaded, I 
 think the wikipedia image files should be made available for efficient 
 download again.

I'm pretty sure the reason there's no image dump is purely because not
enough resources have been devoted to getting it working acceptably.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Gregory Maxwell
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor
simetrical+wikil...@gmail.com wrote:
 On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 I am not sure about the cost of the bandwidth, but the wikipedia image dumps 
 are no longer available on the wikipedia dump anyway.  I am guessing they 
 were removed partly because of the bandwidth cost, or else image licensing 
 issues perhaps.

 I think we just don't have infrastructure set up to dump images.  I'm
 very sure bandwidth is not an issue -- the number of people with a

Correct. The space wasn't available for the required intermediate cop(y|ies).

 terabyte (or is it more?) handy that they want to download a Wikipedia
 image dump to will be vanishingly small compared to normal users.

s/terabyte/several terabytes/  My copy is not up to date, but it's not
smaller than 4.

 Licensing wouldn't be an issue for Commons, at least, as long as it's
 easy to link the images up to their license pages.  (I imagine it
 would technically violate some licenses, but no one would probably
 worry about it.)

We also dump the licensing information. If we can lawfully put the
images on website then we can also distribute them in dump form. There
is and can be no licensing problem.

 Wikipedia uses an average of multiple gigabits per second of
 bandwidth, as I recall.

http://www.nedworks.org/~mark/reqstats/trafficstats-daily.png

Though only this part is paid for:
http://www.nedworks.org/~mark/reqstats/transitstats-daily.png

The rest is peering, etc. which is only paid for in the form of
equipment, port fees, and operational costs.

 The sensible bandwidth-saving way to do it would be to set up an rsync
 daemon on the image servers, and let people use that.

This was how I maintained a running mirror for a considerable time.

Unfortunately the process broke when WMF ran out of space and needed
to switch servers.

On Fri, Jan 8, 2010 at 10:31 AM, Jamie Morken jmor...@shaw.ca wrote:
 Bittorrent is simply a more efficient method to distribute files,

No. In a very real absolute sense bittorrent is considerably less
efficient than other means.

Bittorrent moves more of the outbound traffic to the edges of the
network where the real cost per gbit/sec is much greater than at major
datacenters, because a megabit on a low speed link is more costly than
a megabit on a high speed link and a megabit on 1 mile of fiber is
more expensive than a megabit on 10 feet of fiber.

More over, bittorrent is topology unaware so the path length tends to
approach the internet average mean path length. Datacenters tend to be
more centrally located topology wise, and topology aware distribution
is easily applied to centralized stores. (E.g. WMF satisfies requests
from Europe in europe, though not for the dump downloads as there
simply isn't enough traffic to justify it)

Bittorrent also is a more complicated, higher overhead service which
requires more memory and more disk IO than traditional transfer
mechanisms.

There are certainly cases where bittorrent is valuable, such as the
flash mob case of a new OS release. This really isn't one of those
cases.

On Thu, Jan 7, 2010 at 11:52 AM, William Pietri will...@scissor.com wrote:
 On 01/07/2010 01:40 AM, Jamie Morken wrote:
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced. [...]


 Is the bandwidth used really a big problem? Bandwidth is pretty cheap
 these days, and given Wikipedia's total draw, I suspect the occasional
 dump download isn't much of a problem.

 Bittorrent's real strength is when a lot of people want to download the
 same thing at once. E.g., when a new Ubuntu release comes out. Since
 Bittorrent requires all downloaders to be uploaders, it turns the flood
 of users into a benefit. But unless somebody has stats otherwise, I'd
 guess that isn't the problem here.

We tried BT for the commons poty archive once while I was watching and
we never had a downloader stay connected long enough to help another
downloader... and that was only 500mb, much easier to seed.

BT also makes the server costs a lot higher: it has more cpu/memory
overhead, and creates a lot of random disk IO.  For low volume large
files it's often not much of a win.

I haven't seen the numbers for a long time, but when I last looked
download.wikimedia.org was producing fairly little traffic... and much
of what it was producing was outside of the peak busy hour for the
sites.  Since the transit is paid for on the 95th percentile and the
WMF still has a decent day/night swing out of peak traffic is
effectively free.  The bandwidth is nothing to worry about.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org

[Wikitech-l] CSS/javascript injection for AJAX requests

2010-01-08 Thread Carl (CBM)
I noticed today that livepreview does not pick up the
dynamically-generated CSS from the SyntaxHighlight_Geshi extension.
The same problem occurs in liquidthreads: when you add a comment with
a Geshi call in it, the CSS will not be picked up when the comment is
initially saved. The first full reload of the page will pick up the
css correctly neither case.

After some investigation, this is really an issue in core and will
apply to any extension that needs to add CSS and/or javascript to the
output HTML.  To fix the bugs with livepreview, we would need some
mechanism where AJAX calls receive not only new HTML, but also new CSS
and/or javascript, and can add that CSS and javascript to the current
page without a reload.  Adding the CSS and javascript dynamically may
be tricky from a compatibility standpoint, but having library
functions in our site javascript would help with that.

I have not investigated the cause of the problem in liquidthreads.

The code in EditPage.php shows scars from similar problems, in a
commented-out call to send a list of categories back to an AJAX
preview request.

- Carl

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Boing Boing applauds stats.grok.se!

2010-01-08 Thread David Gerard
http://www.boingboing.net/2010/01/07/wikibumps.html


- d.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Foundation-l] Boing Boing applauds stats.grok.se!

2010-01-08 Thread William Pietri
On 01/08/2010 09:02 AM, David Gerard wrote:
 http://www.boingboing.net/2010/01/07/wikibumps.html


And the poster, who is a Boing Boing guest editor, is one of our own, an 
English Wikipedia contributor since 2004:

http://en.wikipedia.org/wiki/User:Jokestress

William


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Foundation-l] Boing Boing applauds stats.grok.se!

2010-01-08 Thread Domas Mituzas

On Jan 8, 2010, at 7:02 PM, David Gerard wrote:
 http://www.boingboing.net/2010/01/07/wikibumps.html


Currently we're in talks with WM-DE, so they will provision some storage for 
long-term archives of raw data, and we will probably add image view statistics 
then. Good stuff, right? 

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Tomasz Finc
William Pietri wrote:
 On 01/07/2010 01:40 AM, Jamie Morken wrote:
 I have a
 suggestion for wikipedia!!  I think that the database dumps including
 the image files should be made available by a wikipedia bittorrent
 tracker so that people would be able to download the wikipedia backups
 including the images (which currently they can't do) and also so that
 wikipedia's bandwidth costs would be reduced. [...]

 
 Is the bandwidth used really a big problem? Bandwidth is pretty cheap 
 these days, and given Wikipedia's total draw, I suspect the occasional 
 dump download isn't much of a problem.

No, bandwidth is not really the problem here. I think the core issue is 
to have bulk access to images.

There have been a number of these requests in the past and after talking 
  back and forth, it has usually been the case that a smaller subset of 
the data works just as well.

A good example of this was the Deutsche Fotokek archive made late last 
year.

http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )

This provided an easily retrievable high quality subset of our image 
data which researchers could use.

Now if we were to snapshot image data and store them for a particular 
project the amount of duplicate image data would become significant. 
That's because we re-use a ton of image data between projects and 
rightfully so.

If instead we package all of commons into a tarball then we get roughly 
6T's of image data which after numerous conversation has been a bit more 
then most people want to process.

So what does everyone think of going down the collections route?

If we provide enough different and up to date ones then we could easily 
give people a large but manageable amount of data to work with.

If there is a page already for this then please feel free to point me to 
it otherwise I'll create one.

--tomasz


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Robert Rohde
On Fri, Jan 8, 2010 at 8:24 AM, Gregory Maxwell gmaxw...@gmail.com wrote:
 s/terabyte/several terabytes/  My copy is not up to date, but it's not
 smaller than 4.

Top most versions of Commons files are about 4.9 TB, files on enwiki
but not Commons add another 200 GB or so.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Bugzilla Vs other trackers.

2010-01-08 Thread Tei
On Thu, Jan 7, 2010 at 8:39 AM, Peter Gervai grin...@gmail.com wrote:
..
 Wouldn't be nice. First, it's an attitude thing: we want (and have to)
 promote open stuff.
 Second, it isn't nice to show something to the users they cannot use
 themselves. It's kind of against or basic principle of you can do
 what we do, you're free to do it, we just do it better :-)


It will be a good idea to pass the memo to the guys that design the
notability rules.

http://ioquake3.org/2009/02/20/ioquake3-entry-deleted-from-wikipedia/

Since most (all?) opensource proyects are webonly, and don't get in
the press, are on some obscure area of the web where something can
be wildly popular for these in-the-know, and invisible for these that
edit and delete articles.

I mean, I can write a bot to nominate *all* opensource projects
articles on wikipedia for speedy deletion, and few ones (maybe 6) will
survive that.

http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Ioquake3


Keep no matter how loud people and guidlines scream for reliable
sources, many, many people use it and work on it and that makes it
notable. If the press is not able to reliably represent this reality
it's not a fault of the project and reality is a higher standard than
reliable press. What do you need press for an Open Source project?
Just looking at the SVN log proves more than any article could ever
do. -- ioquake3 maintainer for the FreeBSD project







-- 
--
ℱin del ℳensaje.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bugzilla Vs other trackers.

2010-01-08 Thread Max Semenik
On 08.01.2010, 22:42 Tei wrote:

 It will be a good idea to pass the memo to the guys that design the
 notability rules.

 http://ioquake3.org/2009/02/20/ioquake3-entry-deleted-from-wikipedia/

 Since most (all?) opensource proyects are webonly, and don't get in
 the press, are on some obscure area of the web where something can
 be wildly popular for these in-the-know, and invisible for these that
 edit and delete articles.

 I mean, I can write a bot to nominate *all* opensource projects
 articles on wikipedia for speedy deletion, and few ones (maybe 6) will
 survive that.

offtopic severity=Will not engage in further flamewar on-list
FFS, how can one maintain an article without reliable sources? What
such an article will look like? Enough article-count-stacking,
emphasis on quality, even if that means systemic bias. Wikipedia is not
a registry of open-source projects. And those projects that an average
user might search for tend to have some sources, guess why?

As of counter examples of fancruft, there's one 100% recipe: remove
all in-universe crap and slap {{db-empty}} if there's nothing left.
/offtopic

-- 
Best regards,
  Max Semenik ([[User:MaxSem]])


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Bugzilla Vs other trackers.

2010-01-08 Thread Bryan Tong Minh
On Fri, Jan 8, 2010 at 8:42 PM, Tei oscar.vi...@gmail.com wrote:
 On Thu, Jan 7, 2010 at 8:39 AM, Peter Gervai grin...@gmail.com wrote:
 ..
 Wouldn't be nice. First, it's an attitude thing: we want (and have to)
 promote open stuff.
 Second, it isn't nice to show something to the users they cannot use
 themselves. It's kind of against or basic principle of you can do
 what we do, you're free to do it, we just do it better :-)


 It will be a good idea to pass the memo to the guys that design the
 notability rules.

Right. Notability guidelines do not apply to the Wikimedia Servers,
MediaWiki software or on which kind of bug tracker we are going to
use, so please take complaining about that somewhere else.


Bryan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Bugzilla Vs other trackers.

2010-01-08 Thread Aryeh Gregor
On Fri, Jan 8, 2010 at 2:42 PM, Tei oscar.vi...@gmail.com wrote:
 It will be a good idea to pass the memo to the guys that design the
 notability rules.

 http://ioquake3.org/2009/02/20/ioquake3-entry-deleted-from-wikipedia/

Notability is decided by each wiki individually.  The policies of the
English Wikipedia are irrelevant to this list, which is about
Wikimedia server administration and MediaWiki development.  The
correct list for this sort of comment would be wikien-l, or possibly
foundation-l.  Devs/sysadmins can't override wiki policies on things
like notability, so there's no point in telling wikitech-l.  Thanks.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Bilal Abdul Kader
I think having access to them on Commons repository is much easier to
handle. A subset should be good enough.

Having 11 TB of images needs huge research capabilities in order to handle
all of them and work with all of them.

Maybe a special API or advanced API functions would allow people enough
access and at the same time save the bandwidth and the hassle to handle this
behemoth collection.

bilal
--
Verily, with hardship comes ease.


On Fri, Jan 8, 2010 at 1:57 PM, Tomasz Finc tf...@wikimedia.org wrote:

 William Pietri wrote:
  On 01/07/2010 01:40 AM, Jamie Morken wrote:
  I have a
  suggestion for wikipedia!!  I think that the database dumps including
  the image files should be made available by a wikipedia bittorrent
  tracker so that people would be able to download the wikipedia backups
  including the images (which currently they can't do) and also so that
  wikipedia's bandwidth costs would be reduced. [...]
 
 
  Is the bandwidth used really a big problem? Bandwidth is pretty cheap
  these days, and given Wikipedia's total draw, I suspect the occasional
  dump download isn't much of a problem.

 No, bandwidth is not really the problem here. I think the core issue is
 to have bulk access to images.

 There have been a number of these requests in the past and after talking
  back and forth, it has usually been the case that a smaller subset of
 the data works just as well.

 A good example of this was the Deutsche Fotokek archive made late last
 year.

 http://download.wikipedia.org/images/Deutsche_Fotothek.tar ( 11GB )

 This provided an easily retrievable high quality subset of our image
 data which researchers could use.

 Now if we were to snapshot image data and store them for a particular
 project the amount of duplicate image data would become significant.
 That's because we re-use a ton of image data between projects and
 rightfully so.

 If instead we package all of commons into a tarball then we get roughly
 6T's of image data which after numerous conversation has been a bit more
 then most people want to process.

 So what does everyone think of going down the collections route?

 If we provide enough different and up to date ones then we could easily
 give people a large but manageable amount of data to work with.

 If there is a page already for this then please feel free to point me to
 it otherwise I'll create one.

 --tomasz


 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Boing Boing applauds stats.grok.se!

2010-01-08 Thread Lars Aronsson
David Gerard wrote:
 http://www.boingboing.net/2010/01/07/wikibumps.html

On sv.wikipedia there is a gadget that creates a stats tab
on each page. That's very useful. Why don't more languages
of Wikipedia have that gadget installed?


-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Aryeh Gregor
On Fri, Jan 8, 2010 at 3:28 PM, Bilal Abdul Kader bila...@gmail.com wrote:
 I think having access to them on Commons repository is much easier to
 handle. A subset should be good enough.

 Having 11 TB of images needs huge research capabilities in order to handle
 all of them and work with all of them.

 Maybe a special API or advanced API functions would allow people enough
 access and at the same time save the bandwidth and the hassle to handle this
 behemoth collection.

Well, if there were an rsyncd you could just fetch the ones you wanted
arbitrarily.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Domas Mituzas
 Well, if there were an rsyncd you could just fetch the ones you wanted
 arbitrarily.

rsyncd is fail for large file mass delivery, and it is fail when exposed to 
masses. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Robert Rohde
Can someone articulate what the use case is?

Is there someone out there who could use a 5 TB image archive but is
disappointed it doesn't exist?  Seems rather implausible.

If not, then I assume that everyone is really after only some subset
of the files.  If that's the case we should try to figure out what
kinds of subsets and the best way to handle them.

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Boing Boing applauds stats.grok.se!

2010-01-08 Thread Robert Rohde
On Fri, Jan 8, 2010 at 12:38 PM, Lars Aronsson l...@aronsson.se wrote:
 David Gerard wrote:
 http://www.boingboing.net/2010/01/07/wikibumps.html

 On sv.wikipedia there is a gadget that creates a stats tab
 on each page. That's very useful. Why don't more languages
 of Wikipedia have that gadget installed?

Local admins control the installation of gadgets.  On Enwiki the process is at:

http://en.wikipedia.org/wiki/Wikipedia:Gadget

-Robert Rohde

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unified gadgets (was: stats.grok.se)

2010-01-08 Thread Lars Aronsson
Robert Rohde wrote:
 Local admins control the installation of gadgets.  On Enwiki the process is 
 at:

 http://en.wikipedia.org/wiki/Wikipedia:Gadget

Exactly! This is poor design. I have an account (through SUL)
on the Ukrainian Wikipedia because I sometimes add interwiki
links there. I want the same gadgets there, but I don't speak
Ukrainian and I can't go around bothering local admins on
every language with this. Gadgets should follow the user, just
like the account name and password do. There must be a better
way than the current one.


-- 
  Lars Aronsson (l...@aronsson.se)
  Aronsson Datateknik - http://aronsson.se



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Unified gadgets (was: stats.grok.se)

2010-01-08 Thread Casey Brown
On Fri, Jan 8, 2010 at 4:14 PM, Lars Aronsson l...@aronsson.se wrote:
 Exactly! This is poor design. I have an account (through SUL)
 on the Ukrainian Wikipedia because I sometimes add interwiki
 links there. I want the same gadgets there, but I don't speak
 Ukrainian and I can't go around bothering local admins on
 every language with this. Gadgets should follow the user, just
 like the account name and password do. There must be a better
 way than the current one.


We should also make it possible to have global gadgets controlled on
Meta-Wiki.  This would be especially useful for hiding the Fundraising
banner. ;-)

-- 
Casey Brown
Cbrown1023

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] CSS/javascript injection for AJAX requests

2010-01-08 Thread Paul Copperman
The styles and js are already available in the parser output in
-mHeadItems. Should be trivial to expose them through the API via
action=parse.
So I've put this on bugzilla, see
https://bugzilla.wikimedia.org/show_bug.cgi?id=22061

P.Copp

On Fri, Jan 8, 2010 at 5:42 PM, Carl (CBM) cbm.wikipe...@gmail.com wrote:
 I noticed today that livepreview does not pick up the
 dynamically-generated CSS from the SyntaxHighlight_Geshi extension.
 The same problem occurs in liquidthreads: when you add a comment with
 a Geshi call in it, the CSS will not be picked up when the comment is
 initially saved. The first full reload of the page will pick up the
 css correctly neither case.

 After some investigation, this is really an issue in core and will
 apply to any extension that needs to add CSS and/or javascript to the
 output HTML.  To fix the bugs with livepreview, we would need some
 mechanism where AJAX calls receive not only new HTML, but also new CSS
 and/or javascript, and can add that CSS and javascript to the current
 page without a reload.  Adding the CSS and javascript dynamically may
 be tricky from a compatibility standpoint, but having library
 functions in our site javascript would help with that.

 I have not investigated the cause of the problem in liquidthreads.

 The code in EditPage.php shows scars from similar problems, in a
 commented-out call to send a list of categories back to an AJAX
 preview request.

 - Carl

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Bugzilla Vs other trackers.

2010-01-08 Thread Platonides
Tei wrote:
 It will be a good idea to pass the memo to the guys that design the
 notability rules.
 
 http://ioquake3.org/2009/02/20/ioquake3-entry-deleted-from-wikipedia/
 
 Since most (all?) opensource proyects are webonly, and don't get in
 the press, are on some obscure area of the web where something can
 be wildly popular for these in-the-know, and invisible for these that
 edit and delete articles.
 
 I mean, I can write a bot to nominate *all* opensource projects
 articles on wikipedia for speedy deletion, and few ones (maybe 6) will
 survive that.

*Many* opensource projects are relevant, to cite a few: Apache, PHP,
Python, Perl, Ruby, Postgresql, subversion, mercurial, git, bazaar...
Those are more than 6... :)
They are technologies widely known, there are books written about them...
As opposed, this is the first time I hear about ioquake3. It may be
relevant, it may be not.

Being in the web and free is not enough for warranting notability.

Even though script kiddies making its Linux ditro don't like it :)


 http://en.wikipedia.org/wiki/Wikipedia:Articles_for_deletion/Ioquake3
 
 
 Keep no matter how loud people and guidlines scream for reliable
 sources, many, many people use it and work on it and that makes it
 notable. If the press is not able to reliably represent this reality
 it's not a fault of the project and reality is a higher standard than
 reliable press. What do you need press for an Open Source project?
 Just looking at the SVN log proves more than any article could ever
 do. -- ioquake3 maintainer for the FreeBSD project

If they are relevant, why bother if wikipedia doesn't acknowledge that?
Suppose wikipedia didn't have an article about FreeBSD, would that make
it a worse OS?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Platonides
Gregory Maxwell wrote:
 Er. I've maintained a non-WMF disaster recovery archive for a long
 time, though its no longer completely current since the rsync went
 away and web fetching is lossy.

And the box run out of disk space. We could try until it fills again,
though.

A sysadmin fixing images with wrong hashes would also be nice
https://bugzilla.wikimedia.org/show_bug.cgi?id=17057#c3

 It saved our rear a number of times, saving thousands of images from
 irreparable loss. Moreover it allowed things like image hashing before
 we had that in the database, and it would allow perceptual lossy hash
 matching if I ever got around to implementing tools to access the
 output.

IMHO the problem is not accessing it, but hashing those terabytes of images.


 There really are use cases.  Moreover, making complete copies of the
 public data available as dumps to the public is a WMF board supported
 initiative.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 100% open source stack (was Re: Bugzilla Vs other trackers.)

2010-01-08 Thread Tim Starling
Platonides wrote:
 What were the reasons for replacing lighttpd with Sun Java System Web
 Server ?

Probably the same reason that the toolserver uses Confluence instead
of MediaWiki.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 100% open source stack (was Re: Bugzilla Vs other trackers.)

2010-01-08 Thread John Vandenberg
On Sat, Jan 9, 2010 at 12:10 PM, Tim Starling tstarl...@wikimedia.org wrote:
 Platonides wrote:
 What were the reasons for replacing lighttpd with Sun Java System Web
 Server ?

 Probably the same reason that the toolserver uses Confluence instead
 of MediaWiki.

It only contains one page, which points to the MediaWiki wiki.

https://confluence.toolserver.org/pages/listpages-dirview.action?key=main

Are there plans to make greater use of the Confluence wiki?

https://wiki.toolserver.org/view/Domains#confluence.toolserver.org

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Anthony
On Fri, Jan 8, 2010 at 10:56 AM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 The sensible bandwidth-saving way to do it would be to set up an rsync
 daemon on the image servers, and let people use that.


The bandwidth-saving way to do things would be to just allow mirrors to use
hotlinking.  Requiring a middle man to temporarily store images (many, and
possibly even most of which will never even be downloaded by end users) just
wastes bandwidth.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] downloading wikipedia database dumps

2010-01-08 Thread Anthony
On Fri, Jan 8, 2010 at 9:06 PM, Gregory Maxwell gmaxw...@gmail.com wrote:

 Yea, well, you can't easily eliminate all the internal points of
 failure. someone with root loses control of their access and someone
 nasty wipes everything is really hard to protect against with online
 systems.


Isn't that what the system immutable flag is for?

It's easy, as long as you're willing to put up with a bit of whining from
the person with root access.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] 100% open source stack (was Re: Bugzilla Vs other trackers.)

2010-01-08 Thread Tim Starling
John Vandenberg wrote:
 On Sat, Jan 9, 2010 at 12:10 PM, Tim Starling tstarl...@wikimedia.org wrote:
 Platonides wrote:
 What were the reasons for replacing lighttpd with Sun Java System Web
 Server ?
 Probably the same reason that the toolserver uses Confluence instead
 of MediaWiki.
 
 It only contains one page, which points to the MediaWiki wiki.
 
 https://confluence.toolserver.org/pages/listpages-dirview.action?key=main

I count 65 pages.

https://confluence.toolserver.org/pages/listpages-dirview.action?key=tech

Maybe you were confused by the unfamiliar UI.

 Are there plans to make greater use of the Confluence wiki?

Certainly not.

The reason for using SJWS on ms* was the same reason the toolserver
uses Confluence: River installed them both. River's contribution is
very much appreciated, but he does have his own way of doing things.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l