Re: [Wikitech-l] Firesheep
On Mon, Oct 25, 2010 at 11:23 PM, Ashar Voultoiz hashar+...@free.fr wrote: On 25/10/10 23:26, George Herbert wrote: I for one only use secure.wikimedia.org; I would like to urge as a general course that the Foundation switch to a HTTPS by default strategy... HTTPS means full encryption, that is either : - a ton of CPU cycles : those are wasted cycles for something else. - SSL ASIC : costly, specially given our gets/ bandwidth levels Meanwhile, use secure.wikimedia.org :-) I don't want to be rude, but I'm a professional large website infrastructure architect for my paying day job. The current WMF situation is becoming quaint - pros use secure.wikimedia.org, amateurs don't realize what they're exposing. By professional standards, we're not keeping up with professional industry expectations. It's not nuclear bomb secrets (cough) or missile designs (cough) but our internal function (in terms of keeping more sensitive accounts private and not hacked) and our ability to reassure people that they're using a modern and reliable site are falling slowly. It's just CPU cycles. Those, of all the things today, are the cheapest by far... Please, hand me a tough problem, like needing database storage bandwidth that only SSD can match and yet will last for 5+ years reliably, or an N^2 or N^M or N! problem in the core logic, or even using a database to store all the file-like objects and not being able to clean up the database indexes. Those are hard. CPU time, raw cycles? Easy. -- -george william herbert george.herb...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
George Herbert wrote: The current WMF situation is becoming quaint - pros use secure.wikimedia.org, amateurs don't realize what they're exposing. By professional standards, we're not keeping up with professional industry expectations. It's not nuclear bomb secrets (cough) or missile designs (cough) but our internal function (in terms of keeping more sensitive accounts private and not hacked) and our ability to reassure people that they're using a modern and reliable site are falling slowly. I don't understand what you're saying here. Most Wikimedia content is intended to be distributed openly and widely. Certainly serving every page view over HTTPS makes no sense given the cost vs. benefit currently. As Aryeh notes, even those who act in an editing role (rather than in simply a reader role) don't generally have valuable accounts. The pros you're talking about are free to use secure.wikimedia.org (which is already set up and has been for quite some time). If there were a secure site alternative, I think you'd have a point. As it stands, I don't see what's very quaint about this situation. It'd be great to one day have http://en.wikipedia.org be the same as https://en.wikipedia.org with the only noticeable difference being the little lock icon in your browser. But there are a finite amount of resources and this really isn't and shouldn't be a high priority. If the goal is to reassure people that they're using a modern and reliable site, there are lot of other features that could and should be implemented first in my view, though the goal itself seems a bit dubious in any case. MZMcBride ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
On Mon, Oct 25, 2010 at 11:59 PM, MZMcBride z...@mzmcbride.com wrote: George Herbert wrote: The current WMF situation is becoming quaint - pros use secure.wikimedia.org, amateurs don't realize what they're exposing. By professional standards, we're not keeping up with professional industry expectations. It's not nuclear bomb secrets (cough) or missile designs (cough) but our internal function (in terms of keeping more sensitive accounts private and not hacked) and our ability to reassure people that they're using a modern and reliable site are falling slowly. I don't understand what you're saying here. Most Wikimedia content is intended to be distributed openly and widely. Certainly serving every page view over HTTPS makes no sense given the cost vs. benefit currently. As Aryeh notes, even those who act in an editing role (rather than in simply a reader role) don't generally have valuable accounts. The pros you're talking about are free to use secure.wikimedia.org (which is already set up and has been for quite some time). If there were a secure site alternative, I think you'd have a point. As it stands, I don't see what's very quaint about this situation. It'd be great to one day have http://en.wikipedia.org be the same as https://en.wikipedia.org with the only noticeable difference being the little lock icon in your browser. But there are a finite amount of resources and this really isn't and shouldn't be a high priority. If the goal is to reassure people that they're using a modern and reliable site, there are lot of other features that could and should be implemented first in my view, though the goal itself seems a bit dubious in any case. MZMcBride I have no objection to us serving http traffic, especially as default to logged-out users. There's security sensitivity, and then there's paranoia. But I would prefer to move towards a logged-in user by default goes to secure connection model. That would include making secure a multi-system, fully redundantly supported part of the environment, or alternately just making https work on all the front ends. Any login should be protected. The casual eh attitude here is unprofessional, as it were. The nature of the site means that this isn't something I would rush a crash program and redirect major resources to fix immediately, but it's not something to think of as desirable and continue propogating for more years. -- -george william herbert george.herb...@gmail.com ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
On 10/26/2010 08:59 AM, MZMcBride wrote: As Aryeh notes, even those who act in an editing role (rather than in simply a reader role) don't generally have valuable accounts. The pros you're talking about are free to use secure.wikimedia.org (which is already set up and has been for quite some time). If there were a secure site alternative, I think you'd have a point. As it stands, I don't see what's very quaint about this situation. For a maximum security and minimal overhead, let the login always be over https. If a logged-in user is an admin or higher, use https for everything. Expand to all editors if easily possible. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
On Tue, Oct 26, 2010 at 6:24 PM, George Herbert george.herb...@gmail.com wrote: .. But I would prefer to move towards a logged-in user by default goes to secure connection model. That would include making secure a multi-system, fully redundantly supported part of the environment, or alternately just making https work on all the front ends. Any login should be protected. The casual eh attitude here is unprofessional, as it were. The nature of the site means that this isn't something I would rush a crash program and redirect major resources to fix immediately, but it's not something to think of as desirable and continue propogating for more years. I agree. Even if we still do drop users back to http after authentication, and the cookies can be sniffed, that is preferable to having authentication over http. People often use the same password for many sites. Their password may not have much value on WMF projects ('at worst they access admin functions'), but it could be used to access their gmail or similar. -- John Vandenberg ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
On 26.10.2010 09:36, Nikola Smolenski wrote: On 10/26/2010 08:59 AM, MZMcBride wrote: As Aryeh notes, even those who act in an editing role (rather than in simply a reader role) don't generally have valuable accounts. The pros you're talking about are free to use secure.wikimedia.org (which is already set up and has been for quite some time). If there were a secure site alternative, I think you'd have a point. As it stands, I don't see what's very quaint about this situation. For a maximum security and minimal overhead, let the login always be over https. If a logged-in user is an admin or higher, use https for everything. Expand to all editors if easily possible. This sounds like a sensible compromise. It protects the sensitive bits, and doesn't cause massive load on https handling. I would very much like to see this on the official roadmap. By the way... where's the official road map? -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
There is no real massive load caused by https at runtime. There is however a significant chink of developer and sysadmin time needed to implement this and make it work. For now, at least, the only optimisations that should be considered are those that make it easier all round. Conrad On 26 Oct 2010 08:44, Daniel Kinzler dan...@brightbyte.de wrote: On 26.10.2010 09:36, Nikola Smolenski wrote: On 10/26/2010 08:59 AM, MZMcBride wrote: As Aryeh ... This sounds like a sensible compromise. It protects the sensitive bits, and doesn't cause massive load on https handling. I would very much like to see this on the official roadmap. By the way... where's the official road map? -- daniel ___ Wikitech-l mailing list wikitec...@lists.wikimedia ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] InlineEditor new version (previously Sentence-Level Editing)
2010/10/25 Jan Paul Posma jp.po...@gmail.com Hi all, As presented last Saturday at the Hack-A-Ton, I've committed a new version of the InlineEditor extension. [1] This is an implementation of the sentence-level editing demo posted a few months ago. Very interesting! Obviously I'll not see your work till it will be implemented into Wikipedia and all other Wikimedia Foundation projects. Please consider too specific needs of sister projects, t.i. poem extensionhttp://www.mediawiki.org/wiki/Extension:Poemused by wikisource and its poem... /poem tags; I guess that any sister project has something particular to be considered from the beginning of any work about a new editor. Alex ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
Robert Rohde wrote: Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though. I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too). Another option might be to look at the methods for compressing old revisions (is [1] still current?). I make heavy use of parallel processing in my professional work (not related to wikis), but I can't really think of any projects I have at hand that would be accessible and completable in a month. -Robert Rohde [1] http://www.mediawiki.org/wiki/Manual:CompressOld.php It can be used, I am unsure if it is used by WMF. Another thing that would be nice to have parallelised would be things like parser tests. That would need adding cotasks to php or so. The most similar extension I know is runkit which is the other way around: several php scopes instead of several threads in one scope. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
Develop a new bot framework (may be interwiki processing to start with) for high performance GPU cluster (nvidia or AMD) similar to what boinc based projects does. nvdia is more popular while AMD has more cores for the same price :) Regards, Jyothis. http://www.Jyothis.net http://ml.wikipedia.org/wiki/User:Jyothis http://meta.wikimedia.org/wiki/User:Jyothis I am the first customer of http://www.netdotnet.com woods are lovely dark and deep, but i have promises to keep and miles to go before i sleep and lines to go before I press sleep completion date = (start date + ((estimated effort x 3.1415926) / resources) + ((total coffee breaks x 0.25) / 24)) + Effort in meetings On Sun, Oct 24, 2010 at 8:42 PM, Aryeh Gregor simetrical+wikil...@gmail.comsimetrical%2bwikil...@gmail.com wrote: This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, The only real requirement is that it be something in parallel. In the class, we covered * Microoptimization of single-threaded code (efficient use of CPU cache, etc.) * Multithreaded programming using OpenMP * GPU programming using OpenCL and will probably briefly cover distributed computing over multiple machines with MPI. I will have access to a high-performance cluster at NYU, including lots of CPU nodes and some high-end GPUs. Unlike most of the other people in the class, I don't have any interesting science projects I'm working on, so something useful to MediaWiki/Wikimedia/Wikipedia is my first thought. If anyone has any suggestions, please share. (If you have non-Wikimedia-related ones, I'd also be interested in hearing about them offlist.) They shouldn't be too ambitious, since I have to finish them in about a month, while doing work for three other courses and a bunch of other stuff. My first thought was to write a GPU program to crack MediaWiki password hashes as quickly as possible, then use what we've studied in class about GPU architecture to design a hash function that would be as slow as possible to crack on a GPU relative to its PHP execution speed, as Tim suggested a while back. However, maybe there's something more interesting I could do. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε: Robert Rohde wrote: Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though. I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too). If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks. Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Firesheep
On Tue, Oct 26, 2010 at 2:23 AM, Ashar Voultoiz hashar+...@free.fr wrote: HTTPS means full encryption, that is either : - a ton of CPU cycles : those are wasted cycles for something else. - SSL ASIC : costly, specially given our gets/ bandwidth levels HTTPS uses very few CPU cycles by today's standards. See here: In January this year (2010), Gmail switched to using HTTPS for everything by default. Previously it had been introduced as an option, but now all of our users use HTTPS to secure their email between their browsers and Google, all the time. In order to do this we had to deploy no additional machines and no special hardware. On our production frontend machines, SSL/TLS accounts for less than 1% of the CPU load, less than 10KB of memory per connection and less than 2% of network overhead. Many people believe that SSL takes a lot of CPU time and we hope the above numbers (public for the first time) will help to dispel that. http://www.imperialviolet.org/2010/06/25/overclocking-ssl.html On Tue, Oct 26, 2010 at 3:24 AM, George Herbert george.herb...@gmail.com wrote: Any login should be protected. The casual eh attitude here is unprofessional, as it were. The nature of the site means that this isn't something I would rush a crash program and redirect major resources to fix immediately, but it's not something to think of as desirable and continue propogating for more years. It's not desirable, but given limited resources, undesirable things are inevitable. I don't know what the sysadmins are spending their time on, but presumably it's something that they feel takes precedence over this. (None has commented so far here . . .) On Tue, Oct 26, 2010 at 3:36 AM, Nikola Smolenski smole...@eunet.rs wrote: For a maximum security and minimal overhead, let the login always be over https. If a logged-in user is an admin or higher, use https for everything. Expand to all editors if easily possible. This is an improvement, but not an ideal solution, because a MITM could just change the HTTPS login link to be HTTP instead, and translate the request to HTTPS themselves so Wikimedia doesn't see the difference. HTTPS for everything makes more sense, ideally with Strict-Transport-Security. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] New installer is here
Good afternoon, In r75437, r75438[0][1] I moved the old installer to old-index.php and moved the new to index.php. At this stage in the process, I don't see us backing this out before we branch 1.17. I really want people to test it out and report any major breakages [2]. This has been a long development process for almost 2 years now, and I'd like to thank Max, Mark H., Jure, Jeroen, Roan and Siebrand for their invaluable help in working on this. And especially thanks to Tim for starting the project and providing feedback, as always. There is a *lot* of code in includes/installer, and I'd like to highlight some of the major changes that you'll need to know. Database updaters: They have been moved from the gigantic file in maintenance/updaters.inc (patchfiles still go in the same place though). Each supported DB type has a class that needs to subclass DatabaseUpdater. The format's very similar, only it's operating on methods in the classes instead of global functions. The globals $wgExtNewTables, etc. are retained for back compat and will be for quite some time. However, you can pass more advanced callbacks since the LoadExtensionSchemaUpdates hook now passes the DatabaseUpdater subclass as a param. DB2 and MSSQL have been dropped from the installer. The implementations are far from complete and I'm not comfortable advertising their use yet. Other known issues: - Some UI quirks still exist, but work is coming here - Postgres and Oracle are *almost* done - Stuff listed on mw.org[2] -Chad [0] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/75437 [1] http://www.mediawiki.org/wiki/Special:Code/MediaWiki/75438 [2] http://www.mediawiki.org/wiki/New-installer_issues ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] New installer is here
2010/10/26 Erik Moeller e...@wikimedia.org: A few quick notes: And, sorry for duplicating stuff from the known issues list. -- Erik Möller Deputy Director, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] New installer is here
I am on ALL of these things, actually. I have fixes for most of them pending. On 10/26/10 10:41 AM, Erik Moeller wrote: 2010/10/26 Chadinnocentkil...@gmail.com: Good afternoon, In r75437, r75438[0][1] I moved the old installer to old-index.php and moved the new to index.php. At this stage in the process, I don't see us backing this out before we branch 1.17. I really want people to test it out and report any major breakages [2]. Congratulations. :-) It looks great. A few quick notes: 1) On the admin/site name screen at least, when both aren't supplied, it only shows the error messages, not the form below. This may be a general issue with the form validation. Screenshot: http://tinypic.com/r/2po9vh0/7 2) Checkbox alignment in general is a bit off, at least in Chrome, e.g.: http://tinypic.com/r/655n5x/7 3) for the Extensions section, I would suggest adding a more visible warning: Warning: Most extensions require additional configuration beyond this step. Installing unreviewed extensions may expose your wiki to security vulnerabilities. I know the Help already explains the first point, but the simple installer may suggest to the user that ticking a checkbox is all that's required. 4) It'd be great if we could change the design to Vector :-). In general it could use a bit more UI love -- perhaps Brandon will have time to take a quick look. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] New installer is here
2010/10/26 Brandon Harris bhar...@wikimedia.org: I am on ALL of these things, actually. I have fixes for most of them pending. Awesome :-) -- Erik Möller Deputy Director, Wikimedia Foundation Support Free Knowledge: http://wikimediafoundation.org/wiki/Donate ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
Aryeh Gregor Simetrical+wikilist at gmail.com writes: To clarify, the subject needs to 1) be reasonably doable in a short timeframe, 2) not build on top of something that's already too optimized. It should probably either be a new project; or an effort to parallelize something that already exists, isn't parallel yet, and isn't too complicated. So far I have the password-cracking thing, maybe dbzip2, and maybe some unspecified thing involving dumps. Some PageRank-like metric to approximate Wikipedia article importance/quality? Parallelizing eigenvalue calculations has a rich literature. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
On 24/10/10 17:42, Aryeh Gregor wrote: This term I'm taking a course in high-performance computing http://cs.nyu.edu/courses/fall10/G22.2945-001/index.html, and I have to pick a topic for a final project. According to the assignment http://cs.nyu.edu/courses/fall10/G22.2945-001/final-project.pdf, The only real requirement is that it be something in parallel. In the class, we covered * Microoptimization of single-threaded code (efficient use of CPU cache, etc.) * Multithreaded programming using OpenMP * GPU programming using OpenCL I've occasionally wondered how hard it would be possible to parallelize a parser. It's generally not done, despite the fact that parsers are so slow and useful. Some file formats can certainly be parsed in a parallel way, if you partition them in the right way. For example, if you were parsing a CSV file, you could partition on the line breaks. You can't do that by scanning the whole file O(N) since that would defeat the purpose, but you can seek ahead to a suitable byte position, and then scan forwards for the next line break to partition at. For more complex file formats, there are various approaches. Googling tells me that this is a well-studied problem for XML. Obviously for an assessable project, you don't want to dig yourself into a hole too big to get out of. If you chose XML you could just follow the previous work. JavaScript might be tractable. Attempting to parse wikitext would be insane. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] New installer is here
On Tue, Oct 26, 2010 at 10:00 AM, Chad innocentkil...@gmail.com wrote: This has been a long development process for almost 2 years now, and I'd like to thank Max, Mark H., Jure, Jeroen, Roan and Siebrand for their invaluable help in working on this. And especially thanks to Tim for starting the project and providing feedback, as always. There is a *lot* of code in includes/installer, and I'd like to highlight some of the major changes that you'll need to know. My hat is off to you, sirs! You guys have put a lot of great work into this -- absolutely blows away the old installer, that's for dang sure! Looks like 1.17 is going to be an awesome release... I feel like a proud grandpappy getting the chance to see you guys' work shine... :) -- brion ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
On Tue, Oct 26, 2010 at 8:25 AM, Ariel T. Glenn ar...@wikimedia.org wrote: Στις 26-10-2010, ημέρα Τρι, και ώρα 16:25 +0200, ο/η Platonides έγραψε: Robert Rohde wrote: Many of the things done for the statistical analysis of database dumps should be suitable for parallelization (e.g. break the dump into chunks, process the chunks in parallel and sum the results). You could talk to Erik Zachte. I don't know if his code has already been designed for parallel processing though. I don't think it's a good candidate since you are presumably using compressed files, and its decompression linearises it (and is most likely the bottleneck, too). If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks. Actually the LMZA used by default in 7z can be partially parallelized with some strong limitations: 1) The location of block N is generally only located by finding the end of block N-1, so files have to be read serially. 2) The ability to decompress block N may or may not depend on already having decompressed blocks N-1, N-2, N-3, etc., depending on the details of the data stream. Point 2 in particular tends to lead to a lot of conflicts that prevents parallelization. If block N happens to be independent of block N-1 then they can be done in parallel, but in general this will not be the case. The frequency of such conflicts depends a lot on the data stream and options given to the compressor. Last year LMZA2 was introduced in 7z with the primary intent of improving parallelization. It actually produces slightly worse compression in general, but can be operated to guarantee that block N is independent of blocks N-1 ... N-k for a specified k, meaning that k+1 blocks can always be considered in parallel. I believe that gzip has similar constraints to LMZA that make parallelization problematic, but I'm not sure about that. Getting back to Wikimedia, it appears correct that the Wikistats code is designed to run from the compressed files (source linked from [1]). As you suggest, one could use the properties of .bz2 format to parallelize that. I would also observe that parsers tend to be relatively slow, while decompressors tend to be relatively fast. I wouldn't necessarily assume that the decompressing is the only bottleneck. I've run analyses on dumps that took longer to execute than it took to decompress the files. However, they probably didn't take that many times longer (i.e. if the process were parallelized in 2 to 4 simultaneous chunks, then the decompression would be the primary bottleneck again). So it is probably true that if one wants to see a large increase in the speed of stats processing one needs to consider parallelizing both the decompression and the stats gathering. -Robert Rohde [1] http://stats.wikimedia.org/index_tabbed_new.html#fragment-14 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
Ariel T. Glenn wrote: If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks. Ariel The bzip2recover approach? I am not sure how much will be the gain after so much bit moving. Also, I was unable to continue from a flushed point, it may not be so easy. OTOH, if you already have an index and the blocks end at page boundaries (which is what I was doing), it becomes trivial. Remember that the previous block MUST continue up to the point where the next reader started processing inside the next block. And unlike what ttsiod said, you do encounter tags split between blocks in a normal compression. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Parallel computing project
Στις 27-10-2010, ημέρα Τετ, και ώρα 00:05 +0200, ο/η Ángel González έγραψε: Ariel T. Glenn wrote: If one were clever (and I have some code that would enable one to be clever), one could seek to some point in the (bzip2-compressed) file and uncompress from there before processing. Running a bunch of jobs each decompressing only their small piece then becomes feasible. I don't have code that does this for gz or 7z; afaik these do not do compression in discrete blocks. Ariel The bzip2recover approach? I am not sure how much will be the gain after so much bit moving. Also, I was unable to continue from a flushed point, it may not be so easy. OTOH, if you already have an index and the blocks end at page boundaries (which is what I was doing), it becomes trivial. Remember that the previous block MUST continue up to the point where the next reader started processing inside the next block. And unlike what ttsiod said, you do encounter tags split between blocks in a normal compression. I am able (using python bindings to the bzip2 library and some fiddling) to seek to an arbitrary point, find the first block after the seek point, and uncompress it and the following blocks in sequence. That is sufficient for our work, when we are talking about 250GB size compressed files. We process everything by pages, so we ensure that any reader reads only specified page ranges from the file. This avoids overlaps. We don't build an index; we're only talking about parallelizing 10-20 jobs at once, not all 21 million pages, so building an index would not be worth it. Ariel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Commons ZIP file upload for admins
@2010-10-26 03:45, Erik Moeller: 2010/10/25 Brion Vibberbr...@pobox.com: In all cases we have the worry that if we allow uploading those funky formats, we'll either a) end up with malicious files or b) end up with lazy people using and uploading non-free editing formats when we'd prefer them to use freely editable formats. I'm not sure I like the idea of using admin powers to control being able to upload those, though; bottlenecking content reviews as a strict requirement can be problematic on its own. Yeah, I don't like the bottleneck approach either, but in the absence of better systems, it may be the best way to go as an immediate solution. We could do it for a list of whitelisted open formats that are requested by the community. And we'd see from usage which file types we need to prioritize proper support/security checks for. What I'd probably like to see is a more wide-open allowal of arbitrary 'source files' which can be uploaded as attachments to standalone files. We could give them more limited access: download only, no inline viewing, only allowed if DLs are on separate safe domain, etc. It seems fairly straightforward to me to say: These free file formats are permitted to be uploaded. We haven't developed fully sophisticated security checks for them yet, so we're asking trusted users to do basic sanity checks until we've developed automatic checks. We can then prod people to convert any proprietary formats into free ones that are on that whitelist. And if they're free formats, I'm not sure why they shouldn't be first-class citizens -- as Michael mentioned, that makes it possible to plop in custom handlers at a later time. A COLLADA handler for 3D files may seem like a remote possibility, but it's certainly within the realm of sanity. ZIP files would have to be specially treated so they're only allowed if they contain only files in permitted formats. So, consistent with Michael's suggestion, we could define a 'restricted-upload' right, initially given to admins only but possibly expanded to other users, which would allow files from the potentially insecure list of extensions to be uploaded, and for ZIP files, would ensure that only accepted file types are contained within the archive. The resultant review bottleneck would simply be a reflection that we haven't gotten around to adding proper support for these file types yet. On the plus side, we could add restricted upload support for new open formats as soon as there's consensus to do so. The main downside I would see is that users might end up being confused why these files get uploaded. To mitigate this, we could add a This file has a restricted filetype. Files of this type can currently only be uploaded by administrators for security reasons note on file description pages. ODS, ODT and such should be fairly easy to check at least on a basic level. A very basic check would be to check if it contains Basic or Scripts folder. Bit more advanced would be to check if manifest.xml contains application/binary (to check if anyone tried to change default naming) and check if any file contains script:module (for the same reason). If any of this would be true than there should be a warning. I think we should also support Dia for diagrams and XCF for layered bitmaps. Don't know much about XCF, but Dia is a simple XML file (which might be zipped) and so shouldn't be dangerous at all. I guess it could even be unzipped upon loading because Dia supports both zipped and unzipped versions alike. There is/was also Extension:Dia which generates thumbnails... It seems to work fine even with 1.16 from the trunk and the latest Dia version. It doesn't work with zipped Dia files but this would be manageable. Regards, Nux. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] RT
After the recent dicussions open open-ness and clarity with requests by serveral people what is contained within the RT after several people have asked and given answers like it's staff stuff. So what is stored in it that can't be within either the staff or internal wiki where it must be private or bugzilla for other matters? ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Commons ZIP file upload for admins
On Tue, Oct 26, 2010 at 6:50 AM, Max Semenik maxsem.w...@gmail.com wrote: Instead of amassing social constructs around technical deficiency, I propose to fix bug 24230 [1] by implementing proper checking for JAR format. Also, we need to check all contents with antivirus and disallow certain types of files inside archives (such as .exe). Once we took all these precautions, I see no need to restrict ZIPs to any special group. Of course, this doesn't mean that we soul allow all the safe ZIPs, just several open ZIP-based file formats. If we only want zip's for several formats, we should check that they are of the expected type, _and_ that they consist of open file formats within the zip. e.g. Open Office XML (the MS format) can include binary files for OLE objects and fonts (I think) see Table 2. Content types in a ZIP container http://msdn.microsoft.com/en-us/library/aa338205(office.12).aspx OOXML can also include any other mimetype, which are registered _within_ the zip, and linked into the main content file. afaics, allowing only safe zip to be upload isn't difficult. Expand the zip, and reject any zip which contains files on $wgFileBlacklist, and not on $wgFileExtensions + $wgZipFileExtensions. $wgZipFileExtensions would consist of array('xml') Then check the mimetypes of the files in the zip, against $wgMimeTypeBlacklist (with 'application/zip' removed), again allowing desired XML mimetypes through. -- John Vandenberg ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l