Re: [Wikitech-l] What's the best place to do post-upload processing on a file? Etc.
Thanks for the pointer Michael -- Timed Media Handler seems like a good example of job queuing use. The way TMH does job queuing seems like a feasible option for the post-upload processing I'm doing. Rather than enqueuing them upon the file's Upload Complete event firing, TMH seems to put video transcoding jobs into the queue the first time they're requested -- i.e. the first time a page containing the video is loaded. My initial impression was that it'd be faster from the user's perspective if the job were enqueued as soon as possible, which I assume would be onUploadComplete. Maybe there's a negligible difference here, or maybe I don't understand something -- any thoughts? On a separate note, I've found a way to speed up the post-upload processing needed for my extension. The ray-tracing can be divided among multiple CPU cores. (I've tried both ray-tracing libraries supported by the molecular visualization package I'm using, and they only support multicored, not multithreaded, distribution of the task.) The amount of time needed to do the post-upload processing seems to decrease proportional to the number of cores used. Given that, would it be possible to use multiple cores for this post-upload processing? If so, how many cores could be used for a given one of these ray-tracing tasks? If that distribution got the time needed to complete this processing down to something that was deemed reasonable for the user, would that make it unnecessary to enqueue jobs for this option? Thanks, Eric On Fri, May 4, 2012 at 7:19 PM, Michael Dale md...@wikimedia.org wrote: You will want to put into a jobQueue you can take a look at the Timed Media Handler extension for how post upload processor intensive transformations can be handled. --michael On 05/04/2012 04:58 AM, emw wrote: Hi all, For a MediaWiki extension I'm working on (see http://lists.wikimedia.org/**pipermail/wikitech-l/2012-** April/060254.htmlhttp://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html), an effectively plain-text file will need to be converted into a static image. I've got a set of scripts that does that, but it takes my medium-grade consumer laptop about 30 seconds to convert the plain-text file into a ray-traced static image. Since ray-tracing the images being created here substantially improves their visual quality, my impression is that it's worth a moderately expensive transformation operation like this, but only if the operation is done once. Given that, I assume it'd be best to do this transformation immediately after the plain-text file has completed uploading. Is that right? If not, what's a better time/way to do that processing? I've looked into MediaWiki's 'UploadComplete' event hook to accomplish this. That handler gives a way to access information about the upload and the local file. However, I haven't been able to find a way to get the uploaded file's path on the local file system, which I would need to do the transformation. Looking around related files I see references to $srcPath, which seems like what I would need. Am I just missing some getter method for file system path data in UploadBase.php or LocalFile.php? How can I get the information about an uploaded file's location on the file system while in an onUploadComplete-like object method in my extension? Thanks, Eric __**_ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l __**_ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] What's the best place to do post-upload processing on a file? Etc.
Hi all, For a MediaWiki extension I'm working on (see http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html), an effectively plain-text file will need to be converted into a static image. I've got a set of scripts that does that, but it takes my medium-grade consumer laptop about 30 seconds to convert the plain-text file into a ray-traced static image. Since ray-tracing the images being created here substantially improves their visual quality, my impression is that it's worth a moderately expensive transformation operation like this, but only if the operation is done once. Given that, I assume it'd be best to do this transformation immediately after the plain-text file has completed uploading. Is that right? If not, what's a better time/way to do that processing? I've looked into MediaWiki's 'UploadComplete' event hook to accomplish this. That handler gives a way to access information about the upload and the local file. However, I haven't been able to find a way to get the uploaded file's path on the local file system, which I would need to do the transformation. Looking around related files I see references to $srcPath, which seems like what I would need. Am I just missing some getter method for file system path data in UploadBase.php or LocalFile.php? How can I get the information about an uploaded file's location on the file system while in an onUploadComplete-like object method in my extension? Thanks, Eric ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models
Thanks for the feedback Platonides. Would requiring Python, GIMP and PyMOL to be installed on the server be workable for a WMF MediaWiki deployment? Not ideal, but is probably workable. Still much better than relying (and potentially DDOSing) on a third party. If you could drop GIMP requirement, that'd be even better (why is it needed?). A small script that hooks into GIMP API methods is used to tidy up PNGs output by the PyMOL molecular visualization program. The PyMOL images are originally output with a lot of extraneous whitespace. Specifically, the script takes in a PNG and outputs an autocropped image with 50 pixels of whitespace around the subject -- e.g. http://en.wikipedia.org/wiki/File:Protein_FOXP2_PDB_2a07.png. The script: https://code.google.com/p/pdbbot/source/browse/trunk/crop-and-pad-pdb.scm. ImageMagick seems like it might also have the ability to programmatically autocrop an image and add a certain padding around the subject. I'll look into that and substitute an ImageMagick script for the GIMP one if possible. If anyone can think of a better option for that, please let me know. - Eric ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models
That page is precisely the kind of information I'm looking for, thanks! Per the instructions there, I'll talk with Howie Fung about the extension. And I'll update here with further questions or significant notes as they come along. - Eric On Sat, Apr 21, 2012 at 4:12 PM, Sumana Harihareswara suma...@wikimedia.org wrote: On 04/21/2012 08:07 AM, emw wrote: Hi all, I'm in the process of developing a media handling extension for MediaWiki that will allow users with WebGL-enabled browsers to manipulate 3D models of large biological molecules, like proteins and DNA. I'm new to MediaWiki development, and I've got some questions about how I should go forward with development of this extension if I want to ultimately get it into official Wikimedia MediaWiki deployments. Eric, thank you for this contribution, and, like, wow, cool! Let me point you to https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment , which answers several of your questions, I think. Please keep sharing your progress here. -- Sumana Harihareswara Volunteer Development Coordinator Wikimedia Foundation ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models
Hi all, I'm in the process of developing a media handling extension for MediaWiki that will allow users with WebGL-enabled browsers to manipulate 3D models of large biological molecules, like proteins and DNA. I'm new to MediaWiki development, and I've got some questions about how I should go forward with development of this extension if I want to ultimately get it into official Wikimedia MediaWiki deployments. My initial goal is to put the kind of interactive model available at http://webglmol.sourceforge.jp/glmol/viewer.html into infoboxes like the one in http://en.wikipedia.org/wiki/FOXP2. The library enabling this interactivity is called GLmol -- it's licensed under LGPL and described at http://webglmol.sourceforge.jp/index-en.html. There is some more background discussion on the extension at http://en.wikipedia.org/wiki/Portal:Gene_Wiki/Discussion#Enabling_molecular_structure_manipulation_with_WebGL. I have a prototype of the extension working on a local deployment of MediaWiki 1.18.1. I've tried to organize the extension's code roughly along the lines of http://www.mediawiki.org/wiki/Extension:OggHandler. The user workflow to get an interactive protein model into an article is to: 1) Upload a PDB file (e.g. http://www.rcsb.org/pdb/files/2A07.pdb) representing the protein structure through MediaWiki's standard file upload UI. 2) Add a wikilink to the resulting file, very similar to what's done with images. For example, [[File:2A07.pdb]]. If the user's browser has WebGL enabled, an interactive model of the macromolecule similar to one in the linked GLmol demo is then loaded onto the page via an asynchronous request to get the 3D model's atomic coordinate data. I've done work to decrease the time needed to render the 3D model and the size of the 3D model data (much beyond gzipping), so my prototype loads faster than the linked demo. A main element of this extension -- which I haven't yet developed -- is how it will gracefully degrade for users without WebGL enabled. IE8 and IE9 don't support WebGL, and IE10 probably won't either. Safari 5.1.5 supports WebGL, but not by default. WebGL is also not supported on many smartphones. One idea is to fall back to a 2D canvas representation of the model, perhaps like the 3D-to-2D examples at https://github.com/mrdoob/three.js/. I see several drawbacks to this. First, it would not be a fall-back for clients with JavaScript disabled. Second, the GLmol molecular viewer library doesn't currently support 2D canvas fall-back, and it would probably take substantial time and effort to add that feature. Third, there are browser plug-ins for IE that enable WebGL, e.g. http://iewebgl.com/. Given that, my initial plan for handling browsers without WebGL enabled is to fall back to a static image of the corresponding protein/DNA structure. A few years ago I wrote a program to take in a PDB file and output a high-quality static image of the corresponding structure. This resulted in PDBbot (http://commons.wikimedia.org/wiki/User:PDBbot, http://code.google.com/p/pdbbot/). That code could likely be repurposed in this media handling extension to generate a static image upon the upload of a PDB file. The PDBbot code is mostly Python 3, and it interacts with GIMP (via scripts in scheme) and PyMOL (http://en.wikipedia.org/wiki/PyMOL, freely licensed: http://pymol.svn.sourceforge.net/viewvc/pymol/trunk/pymol/LICENSE?revision=3882view=markup ). Would requiring Python, GIMP and PyMOL to be installed on the server be workable for a WMF MediaWiki deployment? If not, then there is a free web service developed for Wikipedia (via Gene Wiki) available from the European Bioinformatics Institute, which points to their pre-rendered static images for macromolecules. The static images could thus be retrieved from a remote server if it wouldn't be feasible to generate them on locally on the upload server. I see a couple of disadvantages to this approac, e.g. relying on a remote third-party web service, but I thought I'd put the idea out for consideration. If generating static images on the upload server wouldn't be possible, would this be a workable alternative? After I get an answer on the questions above, I can begin working on that next major part of the extension. This is a fairly blocking issue, so feedback would definitely be appreciated. Beyond that, and assuming this extension seems viable so far, I've got some more questions: 1. Once I get the prototype more fully developed, what would be the best next step to presenting it and getting it code reviewed? Should I set up a demo on a random domain/third-party VPN, or maybe something like http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page? Or maybe the former would come before the latter? 2. PDB (.pdb) is a niche file type that has a non-standard MIME type of chemical/x-pdb. See http://en.wikipedia.org/wiki/Protein_Data_Bank_%28file_format%29 for more. To upload files with this MIME
Re: [Wikitech-l] State of page view stats
Anyway, I don't say that the project is impossible or unnecessary, but there're lots of tradeoffs to be made - what kind of real time querying workloads are to be expected, what kind of pre-filtering do people expect, etc. I could be biased here, but I think the canonical use case for someone seeking page view information would be viewing page view counts for a set of articles -- most times a single article, but also multiple articles -- over an arbitrary time range. Narrowing that down, I'm not sure whether the level of demand for real-time data (say, for the previous hour) would be higher than the demand for fast query results for more historical data. Would these two workloads imply the kind of trade-off you were referring to? If not, could you give some examples of what kind of expected workloads/use cases would entail such trade-offs? If ordering pages by page view count for a given time period would imply such a tradeoff, then I think it'd make sense to deprioritize page ordering. I'd be really interested to know your thoughts on an efficient schema for organizing the raw page view data in the archives at http://dammit.lt/wikistats/. Thanks, Eric ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] State of page view stats
I'd be willing to work on this on a volunteer basis. I developed http://toolserver.org/~emw/wikistats/, a page view analysis tool that incorporates lots of features that have been requested of Henrik's tool. The main bottleneck has been that, like MZMcBride mentions, an underlying database of page view data is unavailable. Henrik's JSON API has limitations probably tied to the underlying data model. The fact that there aren't any other such API's is arguably the bigger problem. I wrote down some initial thoughts on how this data reliability, and WMF's page view data services generally, could be improved at http://en.wikipedia.org/w/index.php?title=User_talk:Emwoldid=442596566#wikistats_on_toolserver. I've also drafted more specific implementation plans. These plans assume that I would be working with the basic data in Domas's archives. There is still a lot of untapped information in that data -- e.g. hourly views -- and potential for mashups with categories, automated inference of trend causes, etc. If more detailed (but still anonymized) OWA data were available, however, that would obviously open up the potential for much richer APIs and analysis. Getting the archived page view data into a database seems very doable. This data seems like it would be useful even if there were OWA data available, since that OWA data wouldn't cover 12/2007 through 2009. As I see it, the main thing needed from WMF would be storage space on a publicly-available server. Then, optionally, maybe some funds for the cost of cloud services to process and compress the data, and put it into a database. Input and advice would be invaluable, too. Eric ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l