Re: [Wikitech-l] Firesheep

2010-10-26 Thread George Herbert
On Mon, Oct 25, 2010 at 11:23 PM, Ashar Voultoiz hashar+...@free.fr wrote:
 On 25/10/10 23:26, George Herbert wrote:
 I for one only use secure.wikimedia.org; I would like to urge as a
 general course that the Foundation switch to a HTTPS by default
 strategy...

 HTTPS means full encryption, that is either :
   - a ton of CPU cycles : those are wasted cycles for something else.
   - SSL ASIC : costly, specially given our gets/ bandwidth levels

 Meanwhile, use secure.wikimedia.org :-)

I don't want to be rude, but I'm a professional large website
infrastructure architect for my paying day job.

The current WMF situation is becoming quaint - pros use
secure.wikimedia.org, amateurs don't realize what they're exposing.
By professional standards, we're not keeping up with professional
industry expectations.  It's not nuclear bomb secrets (cough) or
missile designs (cough) but our internal function (in terms of keeping
more sensitive accounts private and not hacked) and our ability to
reassure people that they're using a modern and reliable site are
falling slowly.

It's just CPU cycles.  Those, of all the things today, are the
cheapest by far...  Please, hand me a tough problem, like needing
database storage bandwidth that only SSD can match and yet will last
for 5+ years reliably, or an N^2 or N^M or N! problem in the core
logic, or even using a database to store all the file-like objects and
not being able to clean up the database indexes.  Those are hard.  CPU
time, raw cycles?  Easy.


-- 
-george william herbert
george.herb...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread MZMcBride
George Herbert wrote:
 The current WMF situation is becoming quaint - pros use
 secure.wikimedia.org, amateurs don't realize what they're exposing.
 By professional standards, we're not keeping up with professional
 industry expectations.  It's not nuclear bomb secrets (cough) or
 missile designs (cough) but our internal function (in terms of keeping
 more sensitive accounts private and not hacked) and our ability to
 reassure people that they're using a modern and reliable site are
 falling slowly.

I don't understand what you're saying here. Most Wikimedia content is
intended to be distributed openly and widely. Certainly serving every page
view over HTTPS makes no sense given the cost vs. benefit currently.

As Aryeh notes, even those who act in an editing role (rather than in simply
a reader role) don't generally have valuable accounts. The pros you're
talking about are free to use secure.wikimedia.org (which is already set up
and has been for quite some time). If there were a secure site alternative,
I think you'd have a point. As it stands, I don't see what's very quaint
about this situation.

It'd be great to one day have http://en.wikipedia.org be the same as
https://en.wikipedia.org with the only noticeable difference being the
little lock icon in your browser. But there are a finite amount of resources
and this really isn't and shouldn't be a high priority.

If the goal is to reassure people that they're using a modern and reliable
site, there are lot of other features that could and should be implemented
first in my view, though the goal itself seems a bit dubious in any case.

MZMcBride



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread George Herbert
On Mon, Oct 25, 2010 at 11:59 PM, MZMcBride z...@mzmcbride.com wrote:
 George Herbert wrote:
 The current WMF situation is becoming quaint - pros use
 secure.wikimedia.org, amateurs don't realize what they're exposing.
 By professional standards, we're not keeping up with professional
 industry expectations.  It's not nuclear bomb secrets (cough) or
 missile designs (cough) but our internal function (in terms of keeping
 more sensitive accounts private and not hacked) and our ability to
 reassure people that they're using a modern and reliable site are
 falling slowly.

 I don't understand what you're saying here. Most Wikimedia content is
 intended to be distributed openly and widely. Certainly serving every page
 view over HTTPS makes no sense given the cost vs. benefit currently.

 As Aryeh notes, even those who act in an editing role (rather than in simply
 a reader role) don't generally have valuable accounts. The pros you're
 talking about are free to use secure.wikimedia.org (which is already set up
 and has been for quite some time). If there were a secure site alternative,
 I think you'd have a point. As it stands, I don't see what's very quaint
 about this situation.

 It'd be great to one day have http://en.wikipedia.org be the same as
 https://en.wikipedia.org with the only noticeable difference being the
 little lock icon in your browser. But there are a finite amount of resources
 and this really isn't and shouldn't be a high priority.

 If the goal is to reassure people that they're using a modern and reliable
 site, there are lot of other features that could and should be implemented
 first in my view, though the goal itself seems a bit dubious in any case.

 MZMcBride

I have no objection to us serving http traffic, especially as default
to logged-out users.  There's security sensitivity, and then there's
paranoia.

But I would prefer to move towards a logged-in user by default goes to
secure connection model.  That would include making secure a
multi-system, fully redundantly supported part of the environment, or
alternately just making https work on all the front ends.

Any login should be protected.  The casual eh attitude here is
unprofessional, as it were.  The nature of the site means that this
isn't something I would rush a crash program and redirect major
resources to fix immediately, but it's not something to think of as
desirable and continue propogating for more years.


-- 
-george william herbert
george.herb...@gmail.com

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread Nikola Smolenski
On 10/26/2010 08:59 AM, MZMcBride wrote:
 As Aryeh notes, even those who act in an editing role (rather than in simply
 a reader role) don't generally have valuable accounts. The pros you're
 talking about are free to use secure.wikimedia.org (which is already set up
 and has been for quite some time). If there were a secure site alternative,
 I think you'd have a point. As it stands, I don't see what's very quaint
 about this situation.

For a maximum security and minimal overhead, let the login always be 
over https. If a logged-in user is an admin or higher, use https for 
everything. Expand to all editors if easily possible.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread John Vandenberg
On Tue, Oct 26, 2010 at 6:24 PM, George Herbert
george.herb...@gmail.com wrote:
..
 But I would prefer to move towards a logged-in user by default goes to
 secure connection model.  That would include making secure a
 multi-system, fully redundantly supported part of the environment, or
 alternately just making https work on all the front ends.

 Any login should be protected.  The casual eh attitude here is
 unprofessional, as it were.  The nature of the site means that this
 isn't something I would rush a crash program and redirect major
 resources to fix immediately, but it's not something to think of as
 desirable and continue propogating for more years.

I agree.  Even if we still do drop users back to http after
authentication, and the cookies can be sniffed, that is preferable to
having authentication over http.

People often use the same password for many sites.

Their password may not have much value on WMF projects ('at worst they
access admin functions'), but it could be used to access their gmail
or similar.

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread Daniel Kinzler
On 26.10.2010 09:36, Nikola Smolenski wrote:
 On 10/26/2010 08:59 AM, MZMcBride wrote:
 As Aryeh notes, even those who act in an editing role (rather than in simply
 a reader role) don't generally have valuable accounts. The pros you're
 talking about are free to use secure.wikimedia.org (which is already set up
 and has been for quite some time). If there were a secure site alternative,
 I think you'd have a point. As it stands, I don't see what's very quaint
 about this situation.
 
 For a maximum security and minimal overhead, let the login always be 
 over https. If a logged-in user is an admin or higher, use https for 
 everything. Expand to all editors if easily possible.

This sounds like a sensible compromise. It protects the sensitive bits, and
doesn't cause massive load on https handling. I would very much like to see this
on the official roadmap.

By the way... where's the official road map?

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Firesheep

2010-10-26 Thread Conrad Irwin
There is no real massive load caused by https at runtime.  There is however
a significant chink of developer and sysadmin time needed to implement this
and make it work.

For now, at least, the only optimisations that should be considered are
those that make it easier all round.

Conrad

On 26 Oct 2010 08:44, Daniel Kinzler dan...@brightbyte.de wrote:

On 26.10.2010 09:36, Nikola Smolenski wrote:
 On 10/26/2010 08:59 AM, MZMcBride wrote:
 As Aryeh ...
This sounds like a sensible compromise. It protects the sensitive bits, and
doesn't cause massive load on https handling. I would very much like to see
this
on the official roadmap.

By the way... where's the official road map?

-- daniel


___
Wikitech-l mailing list
wikitec...@lists.wikimedia
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] InlineEditor new version (previously Sentence-Level Editing)

2010-10-26 Thread Alex Brollo
2010/10/25 Jan Paul Posma jp.po...@gmail.com

 Hi all,

 As presented last Saturday at the Hack-A-Ton, I've committed a new version
 of the InlineEditor extension. [1] This is an implementation of the
 sentence-level editing demo posted a few months ago.


Very interesting! Obviously I'll not see your work till  it will be
implemented into  Wikipedia  and  all other  Wikimedia  Foundation projects.
Please consider  too  specific needs of sister projects, t.i. poem
extensionhttp://www.mediawiki.org/wiki/Extension:Poemused by
wikisource and its poem... /poem tags; I guess that any sister
project has something particular to be considered from the beginning of any
work about a new editor.

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Platonides
Robert Rohde wrote:
 Many of the things done for the statistical analysis of database dumps
 should be suitable for parallelization (e.g. break the dump into
 chunks, process the chunks in parallel and sum the results).  You
 could talk to Erik Zachte.  I don't know if his code has already been
 designed for parallel processing though.

I don't think it's a good candidate since you are presumably using
compressed files, and its decompression linearises it (and is most
likely the bottleneck, too).


 Another option might be to look at the methods for compressing old
 revisions (is [1] still current?).
 
 I make heavy use of parallel processing in my professional work (not
 related to wikis), but I can't really think of any projects I have at
 hand that would be accessible and completable in a month.
 
 -Robert Rohde
 
 [1] http://www.mediawiki.org/wiki/Manual:CompressOld.php

It can be used, I am unsure if it is used by WMF.

Another thing that would be nice to have parallelised would be things
like parser tests. That would need adding cotasks to php or so. The most
similar extension I know is runkit which is the other way around:
several php scopes instead of several threads in one scope.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Jyothis Edathoot
Develop a new bot framework (may be interwiki processing to start with) for
high performance GPU cluster (nvidia or AMD) similar to what boinc based
projects does.  nvdia is more popular while AMD has more cores for the same
price

 :)


Regards,
Jyothis.

http://www.Jyothis.net

http://ml.wikipedia.org/wiki/User:Jyothis
http://meta.wikimedia.org/wiki/User:Jyothis
I am the first customer of http://www.netdotnet.com

woods are lovely dark and deep,
but i have promises to keep and
miles to go before i sleep and
lines to go before I press sleep

completion date = (start date + ((estimated effort x 3.1415926) / resources)
+ ((total coffee breaks x 0.25) / 24)) + Effort in meetings



On Sun, Oct 24, 2010 at 8:42 PM, Aryeh Gregor
simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com
 wrote:

 This term I'm taking a course in high-performance computing
 http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have
 to pick a topic for a final project.  According to the assignment
 http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf,
 The only real requirement is that it be something in parallel.  In
 the class, we covered

 * Microoptimization of single-threaded code (efficient use of CPU cache,
 etc.)
 * Multithreaded programming using OpenMP
 * GPU programming using OpenCL

 and will probably briefly cover distributed computing over multiple
 machines with MPI.  I will have access to a high-performance cluster
 at NYU, including lots of CPU nodes and some high-end GPUs.  Unlike
 most of the other people in the class, I don't have any interesting
 science projects I'm working on, so something useful to
 MediaWiki/Wikimedia/Wikipedia is my first thought.  If anyone has any
 suggestions, please share.  (If you have non-Wikimedia-related ones,
 I'd also be interested in hearing about them offlist.)  They shouldn't
 be too ambitious, since I have to finish them in about a month, while
 doing work for three other courses and a bunch of other stuff.

 My first thought was to write a GPU program to crack MediaWiki
 password hashes as quickly as possible, then use what we've studied in
 class about GPU architecture to design a hash function that would be
 as slow as possible to crack on a GPU relative to its PHP execution
 speed, as Tim suggested a while back.  However, maybe there's
 something more interesting I could do.

 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ariel T. Glenn
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
 Robert Rohde wrote:
  Many of the things done for the statistical analysis of database dumps
  should be suitable for parallelization (e.g. break the dump into
  chunks, process the chunks in parallel and sum the results).  You
  could talk to Erik Zachte.  I don't know if his code has already been
  designed for parallel processing though.
 
 I don't think it's a good candidate since you are presumably using
 compressed files, and its decompression linearises it (and is most
 likely the bottleneck, too).

If one were clever (and I have some code that would enable one to be
clever), one could seek to some point in the (bzip2-compressed) file and
uncompress from there before processing.  Running a bunch of jobs each
decompressing only their small piece then becomes feasible.  I don't
have code that does this for gz or 7z; afaik these do not do compression
in discrete blocks.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Firesheep

2010-10-26 Thread Aryeh Gregor
On Tue, Oct 26, 2010 at 2:23 AM, Ashar Voultoiz hashar+...@free.fr wrote:
 HTTPS means full encryption, that is either :
   - a ton of CPU cycles : those are wasted cycles for something else.
   - SSL ASIC : costly, specially given our gets/ bandwidth levels

HTTPS uses very few CPU cycles by today's standards.  See here:


In January this year (2010), Gmail switched to using HTTPS for
everything by default. Previously it had been introduced as an option,
but now all of our users use HTTPS to secure their email between their
browsers and Google, all the time. In order to do this we had to
deploy no additional machines and no special hardware. On our
production frontend machines, SSL/TLS accounts for less than 1% of the
CPU load, less than 10KB of memory per connection and less than 2% of
network overhead. Many people believe that SSL takes a lot of CPU time
and we hope the above numbers (public for the first time) will help to
dispel that.

http://www.imperialviolet.org/2010/06/25/overclocking-ssl.html

On Tue, Oct 26, 2010 at 3:24 AM, George Herbert
george.herb...@gmail.com wrote:
 Any login should be protected.  The casual eh attitude here is
 unprofessional, as it were.  The nature of the site means that this
 isn't something I would rush a crash program and redirect major
 resources to fix immediately, but it's not something to think of as
 desirable and continue propogating for more years.

It's not desirable, but given limited resources, undesirable things
are inevitable.  I don't know what the sysadmins are spending their
time on, but presumably it's something that they feel takes precedence
over this.  (None has commented so far here . . .)

On Tue, Oct 26, 2010 at 3:36 AM, Nikola Smolenski smole...@eunet.rs wrote:
 For a maximum security and minimal overhead, let the login always be
 over https. If a logged-in user is an admin or higher, use https for
 everything. Expand to all editors if easily possible.

This is an improvement, but not an ideal solution, because a MITM
could just change the HTTPS login link to be HTTP instead, and
translate the request to HTTPS themselves so Wikimedia doesn't see the
difference.  HTTPS for everything makes more sense, ideally with
Strict-Transport-Security.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] New installer is here

2010-10-26 Thread Chad
Good afternoon,

In r75437, r75438[0][1] I moved the old installer to old-index.php
and moved the new to index.php. At this stage in the process,
I don't see us backing this out before we branch 1.17. I really
want people to test it out and report any major breakages [2].

This has been a long development process for almost 2 years
now, and I'd like to thank Max, Mark H., Jure, Jeroen, Roan
and Siebrand for their invaluable help in working on this. And
especially thanks to Tim for starting the project and providing
feedback, as always. There is a *lot* of code in includes/installer,
and I'd like to highlight some of the major changes that you'll
need to know.

Database updaters: They have been moved from the gigantic
file in maintenance/updaters.inc (patchfiles still go in the same
place though). Each supported DB type has a class that needs
to subclass DatabaseUpdater. The format's very similar, only
it's operating on methods in the classes instead of global functions.
The globals $wgExtNewTables, etc. are retained for back compat
and will be for quite some time. However, you can pass more
advanced callbacks since the LoadExtensionSchemaUpdates
hook now passes the DatabaseUpdater subclass as a param.

DB2 and MSSQL have been dropped from the installer. The
implementations are far from complete and I'm not comfortable
advertising their use yet.

Other known issues:
- Some UI quirks still exist, but work is coming here
- Postgres and Oracle are *almost* done
- Stuff listed on mw.org[2]

-Chad

[0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/75437
[1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/75438
[2] http://www.mediawiki.org/wiki/New-installer_issues

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Erik Moeller
2010/10/26 Erik Moeller e...@wikimedia.org:
 A few quick notes:

And, sorry for duplicating stuff from the known issues list.
-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Brandon Harris

I am on ALL of these things, actually.  I have fixes for most of them 
pending.


On 10/26/10 10:41 AM, Erik Moeller wrote:
 2010/10/26 Chadinnocentkil...@gmail.com:
 Good afternoon,

 In r75437, r75438[0][1] I moved the old installer to old-index.php
 and moved the new to index.php. At this stage in the process,
 I don't see us backing this out before we branch 1.17. I really
 want people to test it out and report any major breakages [2].

 Congratulations. :-) It looks great.

 A few quick notes:

 1) On the admin/site name screen at least, when both aren't supplied,
 it only shows the error messages, not the form below. This may be a
 general issue with the form validation.
 Screenshot: http://tinypic.com/r/2po9vh0/7

 2) Checkbox alignment in general is a bit off, at least in Chrome, e.g.:
 http://tinypic.com/r/655n5x/7

 3) for the Extensions section, I would suggest adding a more visible
 warning: Warning: Most extensions require additional configuration
 beyond this step. Installing unreviewed extensions may expose your
 wiki to security vulnerabilities. I know the Help already explains
 the first point, but the simple installer may suggest to the user that
 ticking a checkbox is all that's required.

 4) It'd be great if we could change the design to Vector :-). In
 general it could use a bit more UI love -- perhaps Brandon will have
 time to take a quick look.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Erik Moeller
2010/10/26 Brandon Harris bhar...@wikimedia.org:

        I am on ALL of these things, actually.  I have fixes for most of them
 pending.

Awesome :-)


-- 
Erik Möller
Deputy Director, Wikimedia Foundation

Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Tisza Gergő
Aryeh Gregor Simetrical+wikilist at gmail.com writes:

 To clarify, the subject needs to 1) be reasonably doable in a short
 timeframe, 2) not build on top of something that's already too
 optimized.  It should probably either be a new project; or an effort
 to parallelize something that already exists, isn't parallel yet, and
 isn't too complicated.  So far I have the password-cracking thing,
 maybe dbzip2, and maybe some unspecified thing involving dumps.

Some PageRank-like metric to approximate Wikipedia article importance/quality?
Parallelizing eigenvalue calculations has a rich literature.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Tim Starling
On 24/10/10 17:42, Aryeh Gregor wrote:
 This term I'm taking a course in high-performance computing
 http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have
 to pick a topic for a final project.  According to the assignment
 http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf,
 The only real requirement is that it be something in parallel.  In
 the class, we covered
 
 * Microoptimization of single-threaded code (efficient use of CPU cache, etc.)
 * Multithreaded programming using OpenMP
 * GPU programming using OpenCL

I've occasionally wondered how hard it would be possible to
parallelize a parser. It's generally not done, despite the fact that
parsers are so slow and useful.

Some file formats can certainly be parsed in a parallel way, if you
partition them in the right way. For example, if you were parsing a
CSV file, you could partition on the line breaks. You can't do that by
scanning the whole file O(N) since that would defeat the purpose, but
you can seek ahead to a suitable byte position, and then scan forwards
for the next line break to partition at.

For more complex file formats, there are various approaches. Googling
tells me that this is a well-studied problem for XML.

Obviously for an assessable project, you don't want to dig yourself
into a hole too big to get out of. If you chose XML you could just
follow the previous work. JavaScript might be tractable. Attempting to
parse wikitext would be insane.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New installer is here

2010-10-26 Thread Brion Vibber
On Tue, Oct 26, 2010 at 10:00 AM, Chad innocentkil...@gmail.com wrote:

 This has been a long development process for almost 2 years
 now, and I'd like to thank Max, Mark H., Jure, Jeroen, Roan
 and Siebrand for their invaluable help in working on this. And
 especially thanks to Tim for starting the project and providing
 feedback, as always. There is a *lot* of code in includes/installer,
 and I'd like to highlight some of the major changes that you'll
 need to know.


My hat is off to you, sirs! You guys have put a lot of great work into this
-- absolutely blows away the old installer, that's for dang sure! Looks like
1.17 is going to be an awesome release... I feel like a proud grandpappy
getting the chance to see you guys' work shine... :)

-- brion
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Robert Rohde
On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn ar...@wikimedia.org wrote:
 Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε:
 Robert Rohde wrote:
  Many of the things done for the statistical analysis of database dumps
  should be suitable for parallelization (e.g. break the dump into
  chunks, process the chunks in parallel and sum the results).  You
  could talk to Erik Zachte.  I don't know if his code has already been
  designed for parallel processing though.

 I don't think it's a good candidate since you are presumably using
 compressed files, and its decompression linearises it (and is most
 likely the bottleneck, too).

 If one were clever (and I have some code that would enable one to be
 clever), one could seek to some point in the (bzip2-compressed) file and
 uncompress from there before processing.  Running a bunch of jobs each
 decompressing only their small piece then becomes feasible.  I don't
 have code that does this for gz or 7z; afaik these do not do compression
 in discrete blocks.

Actually the LMZA used by default in 7z can be partially parallelized
with some strong limitations:

1) The location of block N is generally only located by finding the
end of block N-1, so files have to be read serially.
2) The ability to decompress block N may or may not depend on already
having decompressed blocks N-1, N-2, N-3, etc., depending on the
details of the data stream.

Point 2 in particular tends to lead to a lot of conflicts that
prevents parallelization.  If block N happens to be independent of
block N-1 then they can be done in parallel, but in general this will
not be the case.  The frequency of such conflicts depends a lot on the
data stream and options given to the compressor.

Last year LMZA2 was introduced in 7z with the primary intent of
improving parallelization.  It actually produces slightly worse
compression in general, but can be operated to guarantee that block N
is independent of blocks N-1 ... N-k for a specified k, meaning that
k+1 blocks can always be considered in parallel.

I believe that gzip has similar constraints to LMZA that make
parallelization problematic, but I'm not sure about that.


Getting back to Wikimedia, it appears correct that the Wikistats code
is designed to run from the compressed files (source linked from [1]).
 As you suggest, one could use the properties of .bz2 format to
parallelize that.  I would also observe that parsers tend to be
relatively slow, while decompressors tend to be relatively fast.  I
wouldn't necessarily assume that the decompressing is the only
bottleneck.  I've run analyses on dumps that took longer to execute
than it took to decompress the files.  However, they probably didn't
take that many times longer (i.e. if the process were parallelized in
2 to 4 simultaneous chunks, then the decompression would be the
primary bottleneck again).

So it is probably true that if one wants to see a large increase in
the speed of stats processing one needs to consider parallelizing both
the decompression and the stats gathering.

-Robert Rohde

[1] http://stats.wikimedia.org/index_tabbed_new.html#fragment-14

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ángel González
Ariel T. Glenn wrote:
 If one were clever (and I have some code that would enable one to be
 clever), one could seek to some point in the (bzip2-compressed) file and
 uncompress from there before processing.  Running a bunch of jobs each
 decompressing only their small piece then becomes feasible.  I don't
 have code that does this for gz or 7z; afaik these do not do compression
 in discrete blocks.
 
 Ariel

The bzip2recover approach?
I am not sure how much will be the gain after so much bit moving.
Also, I was unable to continue from a flushed point, it may not be so easy.
OTOH, if you already have an index and the blocks end at page boundaries
(which is what I was doing), it becomes trivial.
Remember that the previous block MUST continue up to the point where the
next reader started processing inside the next block. And unlike what
ttsiod said, you do encounter tags split between blocks in a normal
compression.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Parallel computing project

2010-10-26 Thread Ariel T. Glenn
Στις 27-10-2010, ημέρα Τετ, και ώρα 00:05 +0200, ο/η Ángel González
έγραψε:
 Ariel T. Glenn wrote:
  If one were clever (and I have some code that would enable one to be
  clever), one could seek to some point in the (bzip2-compressed) file and
  uncompress from there before processing.  Running a bunch of jobs each
  decompressing only their small piece then becomes feasible.  I don't
  have code that does this for gz or 7z; afaik these do not do compression
  in discrete blocks.
  
  Ariel
 
 The bzip2recover approach?
 I am not sure how much will be the gain after so much bit moving.
 Also, I was unable to continue from a flushed point, it may not be so easy.
 OTOH, if you already have an index and the blocks end at page boundaries
 (which is what I was doing), it becomes trivial.
 Remember that the previous block MUST continue up to the point where the
 next reader started processing inside the next block. And unlike what
 ttsiod said, you do encounter tags split between blocks in a normal
 compression.

I am able (using python bindings to the bzip2 library and some fiddling)
to seek to an arbitrary point, find the first block after the seek
point, and uncompress it and the following blocks in sequence.  That is
sufficient for our work, when we are talking about 250GB size compressed
files.

We process everything by pages, so we ensure that any reader reads only
specified page ranges from the file.  This avoids overlaps.

We don't build an index; we're only talking about parallelizing 10-20
jobs at once, not all 21 million pages, so building an index would not
be worth it.

Ariel



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Commons ZIP file upload for admins

2010-10-26 Thread Maciej Jaros
@2010-10-26 03:45, Erik Moeller:
 2010/10/25 Brion Vibberbr...@pobox.com:
 In all cases we have the worry that if we allow uploading those funky
 formats, we'll either a) end up with malicious files or b) end up with lazy
 people using and uploading non-free editing formats when we'd prefer them to
 use freely editable formats. I'm not sure I like the idea of using admin
 powers to control being able to upload those, though; bottlenecking content
 reviews as a strict requirement can be problematic on its own.
 Yeah, I don't like the bottleneck approach either, but in the absence
 of better systems, it may be the best way to go as an immediate
 solution. We could do it for a list of whitelisted open formats that
 are requested by the community. And we'd see from usage which file
 types we need to prioritize proper support/security checks for.

 What I'd probably like to see is a more wide-open allowal of arbitrary
 'source files' which can be uploaded as attachments to standalone files. We
 could give them more limited access: download only, no inline viewing, only
 allowed if DLs are on separate safe domain, etc.
 It seems fairly straightforward to me to say: These free file formats
 are permitted to be uploaded. We haven't developed fully sophisticated
 security checks for them yet, so we're asking trusted users to do
 basic sanity checks until we've developed automatic checks. We can
 then prod people to convert any proprietary formats into free ones
 that are on that whitelist. And if they're free formats, I'm not sure
 why they shouldn't be first-class citizens -- as Michael mentioned,
 that makes it possible to plop in custom handlers at a later time. A
 COLLADA handler for 3D files may seem like a remote possibility, but
 it's certainly within the realm of sanity. ZIP files would have to be
 specially treated so they're only allowed if they contain only files
 in permitted formats.

 So, consistent with Michael's suggestion, we could define a
 'restricted-upload' right, initially given to admins only but possibly
 expanded to other users, which would allow files from the potentially
 insecure list of extensions to be uploaded, and for ZIP files, would
 ensure that only accepted file types are contained within the archive.
 The resultant review bottleneck would simply be a reflection that we
 haven't gotten around to adding proper support for these file types
 yet. On the plus side, we could add restricted upload support for new
 open formats as soon as there's consensus to do so.

 The main downside I would see is that users might end up being
 confused why these files get uploaded. To mitigate this, we could add
 a This file has a restricted filetype. Files of this type can
 currently only be uploaded by administrators for security reasons
 note on file description pages.

ODS, ODT and such should be fairly easy to check at least on a basic 
level. A very basic check would be to check if it contains Basic or 
Scripts folder. Bit more advanced would be to check if manifest.xml 
contains application/binary (to check if anyone tried to change 
default naming) and check if any file contains script:module (for the 
same reason).
If any of this would be true than there should be a warning.

I think we should also support Dia for diagrams and XCF for layered 
bitmaps. Don't know much about XCF, but Dia is a simple XML file (which 
might be zipped) and so shouldn't be dangerous at all. I guess it could 
even be unzipped upon loading because Dia supports both zipped and 
unzipped versions alike. There is/was also Extension:Dia which generates 
thumbnails... It seems to work fine even with 1.16 from the trunk and 
the latest Dia version. It doesn't work with zipped Dia files but this 
would be manageable.

Regards,
Nux.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] RT

2010-10-26 Thread a b
After the recent dicussions open open-ness and clarity with requests by
serveral people what is contained within the RT after several people have
asked and given answers like it's staff stuff.

So what is stored in it that can't be within either the staff or internal
wiki where it must be private or bugzilla for other matters?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Commons ZIP file upload for admins

2010-10-26 Thread John Vandenberg
On Tue, Oct 26, 2010 at 6:50 AM, Max Semenik maxsem.w...@gmail.com wrote:
 

 Instead of amassing social constructs around technical deficiency, I
 propose to fix bug 24230 [1] by implementing proper checking for JAR
 format. Also, we need to check all contents with antivirus and
 disallow certain types of files inside archives (such as .exe). Once
 we took all these precautions, I see no need to restrict ZIPs to any
 special group. Of course, this doesn't mean that we soul allow all the
 safe ZIPs, just several open ZIP-based file formats.

If we only want zip's for several formats, we should check that they
are of the expected type, _and_ that they consist of open file formats
within the zip.

e.g. Open Office XML (the MS format) can include binary files for OLE
objects and fonts (I think)

see Table 2. Content types in a ZIP container

http://msdn.microsoft.com/en-us/library/aa338205(office.12).aspx

OOXML can also include any other mimetype, which are registered
_within_ the zip, and linked into the main content file.

afaics, allowing only safe zip to be upload isn't difficult.

Expand the zip, and reject any zip which contains files on
$wgFileBlacklist, and not on $wgFileExtensions + $wgZipFileExtensions.

$wgZipFileExtensions would consist of array('xml')

Then check the mimetypes of the files in the zip, against
$wgMimeTypeBlacklist (with 'application/zip' removed), again allowing
desired XML mimetypes through.

--
John Vandenberg

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l