Dear devs,

I would like to initiate a discussion about how to reduce the time required to 
generate dump files. A while ago Emmanuel Engelhart opened a bugreport 
suggesting to parallelize this feature and I would like to go through the 
available options and hopefully determine a course of action.

The current process is straightforward and sequential (as far as I know): it 
reads table by table and row by row and stores the output. The drawbacks of 
this process are that it takes increasingly more time to generate a dump as the 
different projects continue to grow and when the process halts or is 
interrupted then it needs to start all over again. 

I believe that there are two approaches to parallelizing the export dump:
1) Launch multiple PHP processes that each take care of a particular range of 
ids. This might not be called true parallelization, but it achieves the same 
goal. The reason for this approach is that PHP has very limited (maybe no) 
support for parallelization / multiprocessing. The only thing PHP can do is 
fork a process (I might be incorrect about this)


2) Use a different language with builtin support for multiprocessing like Java 
or Python. I am not intending to start an heated debate but I think this is an 
option that at least should be on the table and be discussed. Obviously, an 
important reason not to do it is that it's a different language. I am not sure 
how integral the export functionality is to MediaWiki and if it is then this is 
a dead end. 

However, if the export functionality is primarily used by Wikimedia and nobody 
else then we might consider a different language. Or, we make a standalone app 
that is not part of Mediawiki and it's use is only internally for Wikimedia.


If i am missing other approaches or solutions then please chime in. 

Best regards,


Diederik
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to