Re: [Wikitech-l] What's the best place to do post-upload processing on a file? Etc.

2012-05-09 Thread emw
Thanks for the pointer Michael -- Timed Media Handler seems like a good
example of job queuing use.

The way TMH does job queuing seems like a feasible option for the
post-upload processing I'm doing.  Rather than enqueuing them upon the
file's Upload Complete event firing, TMH seems to put video transcoding
jobs into the queue the first time they're requested -- i.e. the first time
a page containing the video is loaded.  My initial impression was that it'd
be faster from the user's perspective if the job were enqueued as soon as
possible, which I assume would be onUploadComplete.

Maybe there's a negligible difference here, or maybe I don't understand
something -- any thoughts?

On a separate note, I've found a way to speed up the post-upload processing
needed for my extension.  The ray-tracing can be divided among multiple CPU
cores. (I've tried both ray-tracing libraries supported by the molecular
visualization package I'm using, and they only support multicored, not
multithreaded, distribution of the task.)  The amount of time needed to do
the post-upload processing seems to decrease proportional to the number of
cores used.

Given that, would it be possible to use multiple cores for this post-upload
processing?  If so, how many cores could be used for a given one of these
ray-tracing tasks?   If that distribution got the time needed to complete
this processing down to something that was deemed reasonable for the user,
would that make it unnecessary to enqueue jobs for this option?

Thanks,
Eric

On Fri, May 4, 2012 at 7:19 PM, Michael Dale md...@wikimedia.org wrote:

 You will want to put into a jobQueue you can take a look at the Timed
 Media Handler extension for how post upload processor intensive
 transformations can be handled.

 --michael


 On 05/04/2012 04:58 AM, emw wrote:

 Hi all,

 For a MediaWiki extension I'm working on (see
 http://lists.wikimedia.org/**pipermail/wikitech-l/2012-**
 April/060254.htmlhttp://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html),
 an
 effectively plain-text file will need to be converted into a static image.
 I've got a set of scripts that does that, but it takes my medium-grade
 consumer laptop about 30 seconds to convert the plain-text file into a
 ray-traced static image.  Since ray-tracing the images being created here
 substantially improves their visual quality, my impression is that it's
 worth a moderately expensive transformation operation like this, but only
 if the operation is done once.

 Given that, I assume it'd be best to do this transformation immediately
 after the plain-text file has completed uploading.  Is that right?  If
 not,
 what's a better time/way to do that processing?

 I've looked into MediaWiki's 'UploadComplete' event hook to accomplish
 this. That handler gives a way to access information about the upload and
 the local file.  However, I haven't been able to find a way to get the
 uploaded file's path on the local file system, which I would need to do
 the
 transformation.  Looking around related files I see references to
 $srcPath,
 which seems like what I would need.  Am I just missing some getter method
 for file system path data in UploadBase.php or LocalFile.php?  How can I
 get the information about an uploaded file's location on the file system
 while in an onUploadComplete-like object method in my extension?

 Thanks,
 Eric
 __**_
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l



 __**_
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/**mailman/listinfo/wikitech-lhttps://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] What's the best place to do post-upload processing on a file? Etc.

2012-05-04 Thread emw
Hi all,

For a MediaWiki extension I'm working on (see
http://lists.wikimedia.org/pipermail/wikitech-l/2012-April/060254.html), an
effectively plain-text file will need to be converted into a static image.
I've got a set of scripts that does that, but it takes my medium-grade
consumer laptop about 30 seconds to convert the plain-text file into a
ray-traced static image.  Since ray-tracing the images being created here
substantially improves their visual quality, my impression is that it's
worth a moderately expensive transformation operation like this, but only
if the operation is done once.

Given that, I assume it'd be best to do this transformation immediately
after the plain-text file has completed uploading.  Is that right?  If not,
what's a better time/way to do that processing?

I've looked into MediaWiki's 'UploadComplete' event hook to accomplish
this. That handler gives a way to access information about the upload and
the local file.  However, I haven't been able to find a way to get the
uploaded file's path on the local file system, which I would need to do the
transformation.  Looking around related files I see references to $srcPath,
which seems like what I would need.  Am I just missing some getter method
for file system path data in UploadBase.php or LocalFile.php?  How can I
get the information about an uploaded file's location on the file system
while in an onUploadComplete-like object method in my extension?

Thanks,
Eric
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models

2012-04-22 Thread emw
Thanks for the feedback Platonides.

 Would requiring Python, GIMP and PyMOL to be installed on the server be
 workable for a WMF MediaWiki deployment?

 Not ideal, but is probably workable. Still much better than relying (and
 potentially DDOSing) on a third party.
 If you could drop GIMP requirement, that'd be even better (why is it
 needed?).

A small script that hooks into GIMP API methods is used to tidy up PNGs
output by the PyMOL molecular visualization program.  The PyMOL images are
originally output with a lot of extraneous whitespace.  Specifically, the
script takes in a PNG and outputs an autocropped image with 50 pixels of
whitespace around the subject -- e.g.
http://en.wikipedia.org/wiki/File:Protein_FOXP2_PDB_2a07.png.  The script:
https://code.google.com/p/pdbbot/source/browse/trunk/crop-and-pad-pdb.scm.

ImageMagick seems like it might also have the ability to programmatically
autocrop an image and add a certain padding around the subject.  I'll look
into that and substitute an ImageMagick script for the GIMP one if
possible.

If anyone can think of a better option for that, please let me know.

- Eric
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models

2012-04-22 Thread emw
That page is precisely the kind of information I'm looking for, thanks!
Per the instructions there, I'll talk with Howie Fung about the extension.
And I'll update here with further questions or significant notes as they
come along.

- Eric

On Sat, Apr 21, 2012 at 4:12 PM, Sumana Harihareswara suma...@wikimedia.org
 wrote:

 On 04/21/2012 08:07 AM, emw wrote:
  Hi all,
 
  I'm in the process of developing a media handling extension for MediaWiki
  that will allow users with WebGL-enabled browsers to manipulate 3D models
  of large biological molecules, like proteins and DNA.  I'm new to
 MediaWiki
  development, and I've got some questions about how I should go forward
 with
  development of this extension if I want to ultimately get it into
 official
  Wikimedia MediaWiki deployments.

 Eric, thank you for this contribution, and, like, wow, cool!  Let me
 point you to
 https://www.mediawiki.org/wiki/Writing_an_extension_for_deployment ,
 which answers several of your questions, I think.

 Please keep sharing your progress here.

 --
 Sumana Harihareswara
 Volunteer Development Coordinator
 Wikimedia Foundation

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Developing MediaWiki extension for WebGL-enabled interactive 3D models

2012-04-21 Thread emw
Hi all,

I'm in the process of developing a media handling extension for MediaWiki
that will allow users with WebGL-enabled browsers to manipulate 3D models
of large biological molecules, like proteins and DNA.  I'm new to MediaWiki
development, and I've got some questions about how I should go forward with
development of this extension if I want to ultimately get it into official
Wikimedia MediaWiki deployments.

My initial goal is to put the kind of interactive model available at
http://webglmol.sourceforge.jp/glmol/viewer.html into infoboxes like the
one in http://en.wikipedia.org/wiki/FOXP2.  The library enabling this
interactivity is called GLmol -- it's licensed under LGPL and described at
http://webglmol.sourceforge.jp/index-en.html.  There is some more
background discussion on the extension at
http://en.wikipedia.org/wiki/Portal:Gene_Wiki/Discussion#Enabling_molecular_structure_manipulation_with_WebGL.


I have a prototype of the extension working on a local deployment of
MediaWiki 1.18.1.  I've tried to organize the extension's code roughly
along the lines of http://www.mediawiki.org/wiki/Extension:OggHandler.  The
user workflow to get an interactive protein model into an article is to:

  1) Upload a PDB file (e.g. http://www.rcsb.org/pdb/files/2A07.pdb)
representing the protein structure through MediaWiki's standard file upload
UI.
  2) Add a wikilink to the resulting file, very similar to what's done with
images.  For example, [[File:2A07.pdb]].

If the user's browser has WebGL enabled, an interactive model of the
macromolecule similar to one in the linked GLmol demo is then loaded onto
the page via an asynchronous request to get the 3D model's atomic
coordinate data.  I've done work to decrease the time needed to render the
3D model and the size of the 3D model data (much beyond gzipping), so my
prototype loads faster than the linked demo.

A main element of this extension -- which I haven't yet developed -- is how
it will gracefully degrade for users without WebGL enabled.  IE8 and IE9
don't support WebGL, and IE10 probably won't either.  Safari 5.1.5 supports
WebGL, but not by default.  WebGL is also not supported on many smartphones.

One idea is to fall back to a 2D canvas representation of the model,
perhaps like the 3D-to-2D examples at https://github.com/mrdoob/three.js/.
I see several drawbacks to this.  First, it would not be a fall-back for
clients with JavaScript disabled.  Second, the GLmol molecular viewer
library doesn't currently support 2D canvas fall-back, and it would
probably take substantial time and effort to add that feature.  Third,
there are browser plug-ins for IE that enable WebGL, e.g.
http://iewebgl.com/.

Given that, my initial plan for handling browsers without WebGL enabled is
to fall back to a static image of the corresponding protein/DNA structure.
A few years ago I wrote a program to take in a PDB file and output a
high-quality static image of the corresponding structure.  This resulted in
PDBbot (http://commons.wikimedia.org/wiki/User:PDBbot,
http://code.google.com/p/pdbbot/).  That code could likely be repurposed in
this media handling extension to generate a static image upon the upload of
a PDB file.  The PDBbot code is mostly Python 3, and it interacts with GIMP
(via scripts in scheme) and PyMOL (http://en.wikipedia.org/wiki/PyMOL,
freely licensed:
http://pymol.svn.sourceforge.net/viewvc/pymol/trunk/pymol/LICENSE?revision=3882view=markup
).

Would requiring Python, GIMP and PyMOL to be installed on the server be
workable for a WMF MediaWiki deployment?  If not, then there is a free web
service developed for Wikipedia (via Gene Wiki) available from the European
Bioinformatics Institute, which points to their pre-rendered static images
for macromolecules.  The static images could thus be retrieved from a
remote server if it wouldn't be feasible to generate them on locally on the
upload server.  I see a couple of disadvantages to this approac, e.g.
relying on a remote third-party web service, but I thought I'd put the idea
out for consideration.  If generating static images on the upload server
wouldn't be possible, would this be a workable alternative?

After I get an answer on the questions above, I can begin working on that
next major part of the extension.  This is a fairly blocking issue, so
feedback would definitely be appreciated.

Beyond that, and assuming this extension seems viable so far, I've got some
more questions:

1. Once I get the prototype more fully developed, what would be the
best next step to presenting it and getting it code reviewed?  Should I set
up a demo on a random domain/third-party VPN, or maybe something like
http://deployment.wikimedia.beta.wmflabs.org/wiki/Main_Page?  Or maybe the
former would come before the latter?

2. PDB (.pdb) is a niche file type that has a non-standard MIME type of
chemical/x-pdb.  See
http://en.wikipedia.org/wiki/Protein_Data_Bank_%28file_format%29 for more.
To upload files with this MIME 

Re: [Wikitech-l] State of page view stats

2011-08-12 Thread Emw
 Anyway, I don't say that the project is impossible or unnecessary, but
there're lots of tradeoffs to be made
 - what kind of real time querying workloads are to be expected, what kind of
pre-filtering do people expect, etc.

I could be biased here, but I think the canonical use case for someone seeking
page view information would be viewing page view counts for a set of articles --
most times a single article, but also multiple articles -- over an arbitrary
time range.  Narrowing that down, I'm not sure whether the level of demand for
real-time data (say, for the previous hour) would be higher than the demand for
fast query results for more historical data.  Would these two workloads imply
the kind of trade-off you were referring to?  If not, could you give some
examples of what kind of expected workloads/use cases would entail such
trade-offs?  

If ordering pages by page view count for a given time period would imply such a
tradeoff, then I think it'd make sense to deprioritize page ordering.

I'd be really interested to know your thoughts on an efficient schema for
organizing the raw page view data in the archives at 
http://dammit.lt/wikistats/.

Thanks,
Eric


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] State of page view stats

2011-08-11 Thread Emw
I'd be willing to work on this on a volunteer basis.

I developed http://toolserver.org/~emw/wikistats/, a page view analysis tool
that incorporates lots of features that have been requested of Henrik's tool. 
The main bottleneck has been that, like MZMcBride mentions, an underlying
database of page view data is unavailable.  Henrik's JSON API has limitations
probably tied to the underlying data model.  The fact that there aren't any
other such API's is arguably the bigger problem.

I wrote down some initial thoughts on how this data reliability, and WMF's page
view data services generally, could be improved at
http://en.wikipedia.org/w/index.php?title=User_talk:Emwoldid=442596566#wikistats_on_toolserver.
 I've also drafted more specific implementation plans.  These plans assume that
I would be working with the basic data in Domas's archives.  There is still a
lot of untapped information in that data -- e.g. hourly views -- and potential
for  mashups with categories, automated inference of trend causes, etc. If more
detailed (but still anonymized) OWA data were available, however, that would
obviously open up the potential for much richer APIs and analysis.

Getting the archived page view data into a database seems very doable.  This
data seems like it would be useful even if there were OWA data available, since
that OWA data wouldn't cover 12/2007 through 2009.  As I see it, the main thing
needed from WMF would be storage space on a publicly-available server.  Then,
optionally, maybe some funds for the cost of cloud services to process and
compress the data, and put it into a database.  Input and advice would be
invaluable, too.

Eric


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l