Re: [Wikitech-l] RFC: Parsoid roadmap

Aaron Schulz Tue, 29 Jan 2013 16:17:17 -0800

+1

I think everything into Q3 looks like a good way to proceed forward. There
might be an interesting division of labor on getting these things done
(parsiod job handling, Cite extension rewrite, API batching). I'd be willing
to help in areas I'd be useful in. I think this is ambitious, but the steps
laid out look manageable by themselves. We will see how the target dates
collide with reality, which may also depend on the level of interest.


I'd really like to see a reduction of CPU spent on refreshLinks jobs, so
anything to help in that area is welcome. We currently rely on throwing more
processes and hardware at the problem and using de-duplication to at least
stop jobs from piling up (such as when heavily used templates keep getting
edited before the previous jobs finish). De-duplication has it's own costs,
and will make sense to move the queue of the main clusters. Managing these
jobs is getting more difficult. In fact, it's the editing of a few templates
that can account for a majority of the queue, where tens of thousands of
entire pages are parsed because of some modest template change. I like the
idea of storing dependency information in (or alongside) the HTML as
metadata and using it to recompute only affected parts of the DOM. 

There is certainly discussion to be had about the cleanest way to handle the
trade-offs of when to store updated HTML for a revision (when a
template/file changes or a magic word or DPL list should be re-calculated).
It probably will not make sense for old revisions of pages. If we are
storing new versions of HTML, it may make sense to purge the old ones from
external storage if updates are frequent, though that interface has no
deletion support and that is slightly against the philosophy of the external
storage classes. It's probably not a big deal to change it though. I've also
been told that the HTML tends to compress well, so we should not be looking
at on order-of-magnitude text storage requirement increase (though maybe 4X
or so from some quick tests). I'd like to see some documented statistics on
this though, with samples.

I think the Visual Editor + HTML only method for third parties is
interesting and could probably make use of ContentHandler well. I'm curious
about the exact nature of HTML validation needed server-side for this setup,
but from what I understand it would not be too complicated and the metadata
could be handled in a way that does not require blind trust of the client.



--
View this message in context: 
http://wikimedia.7.n6.nabble.com/RFC-Parsoid-roadmap-tp4994503p4994870.html
Sent from the Wikipedia Developers mailing list archive at Nabble.com.

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] RFC: Parsoid roadmap

Reply via email to