Re: [Wikitech-l] #switch limits
Tim Starling tstarl...@wikimedia.org variously wrote: https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es_PyrF1-2009action=edit That template alone uses 47MB for 37000 #switch cases I tried converting that template with 37000 switch cases to a Lua array. Lua used 6.5MB for the chunk and then another 2.4MB to execute it, so 8.9MB in total compared to 47MB for wikitext It's only a 400kb string, and no key is a substring of another key. So just match the regexp /\|lookFor=(.*)$/m , and $1 holds the value. This works great in Perl, PHP, JavaScript... D'oh, Extension:RegexParserFunctions not enabled on Wikimedia sites. Fine, use string functions to look for |lookFor= , look from there onwards for the next '|', and take the substring. D'oh, $wgPFEnableStringFunctions is set false on Wikimedia sites, bug 6455 (a great read). Fine, use the string lookup function people have coded in wiki template syntax. e.g. {{Str find0}} – Very fast zero-based substring search with string support up to *90* characters. D'oh, several orders of magnitude too small. OK, lua and Scribuntu. Reading the fine tutorial https://www.mediawiki.org/wiki/Lua_scripting/Tutorial local p = {} p.bigStr = [[ |01001=22.4 |01002=17.3 ... 36,000 lines ]] p.findStr = '|' .. p.lookFor .. '=' p.begin, p.ending = string.find( p.bigStr, p.findStr ) ... something or other... Amazingly, my browser and the syntaxhighlighter in the Module namespace can handle this 400kB textarea, https://www.mediawiki.org/wiki/Module:SPageBigString.lua, well done! If I just ask for the string.len( p.bigStr ), Scribuntu loads and executes this module. I dunno how to determine its memory consumption. But when I try to do string.find() I get Script error, probably because I've never written any Lua before this evening. Assuming it's possible, what are the obvious flaws in string matching that I'm overlooking? Is there an explanation of how to simulate the Scribuntu/Lua calling environment (the frame setup, I guess) in command-line lua? This was fun :-) -- =S Page software engineer on E3 ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
2012/9/24 Tim Starling tstarl...@wikimedia.org I suppose a nested switch like: {{#switch: {{{1}}} | 0 = {{#switch: {{{2}}} | 0 = zero | 1 = one }} | 1 = {{#switch: {{{2}}} | 0 = two | 1 = three }} }} might give you a performance advantage over one of the form: {{#switch: {{{1}}}{{{2}}} | 00 = zero | 01 = one | 10 = two | 11 = three }} I was thinking about something different - to split a long list into a tree of sub-templates, and to use upper templates to select the right one sub-template. This would avoid parsing of a single, heavy template, but has the disadvantage of multiple calls to much smaller templates.(one for each level); so, if basic #switch is unexpectably fast, I don't see a sound reason to add complexity to the code. Alex ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
On 21/09/12 17:47, Strainu wrote: 2012/9/21 Tim Starling tstarl...@wikimedia.org: On 21/09/12 16:06, Strainu wrote: I'm just curious: would LUA improve memory usages in this use case? Yes, it's an interesting question. I tried converting that template with 37000 switch cases to a Lua array. Lua used 6.5MB for the chunk and then another 2.4MB to execute it, so 8.9MB in total compared to 47MB for wikitext. So it's an improvement, but we limit Lua memory to 50MB and you would hit that limit long before you loaded 15 such arrays. I'm not sure on how the Lua code would look like, but perhaps you can tweak the loading of Lua templates so that you don't load the same code more than once? I'm totally oblivious on how MediaWiki (or is it PHP?) is linked to Lua right now, but I'm thinking along the lines of a C program which loads a library once, then can use it many times over. With such an approach, you would have 6.5 + 15*2.4 = 42.5 MB of memory (assuming memory cannot be reused between calls). The Lua code looks like this: a = { ['01001'] = 22.4, ['01002'] = 17.3, ['01004'] = 21.1, ['01005'] = 20.0, ['01006'] = 9.3, ['01007'] = 21.2 ... } Then presumably you would do something with the a table to generate wikitext. Lua needs 6.5MB per table for the internal representation of the code itself, i.e. bytecode and supporting structures. Presumably, most of that space would be in the form of instructions like add an element to the current table with key '01007' and value 21.2. When those instructions are executed, another 2.4MB is needed to store the resulting table. The bytecode is cached between #invoke calls, but the table is not. So if you replaced the individual data template invocations with #invoke calls, the memory requirement would be something like 15*6.5MB + 2.4MB = 99.9MB. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
On 21/09/12 21:37, Alex Brollo wrote: I too use sometimes large switches (some hundred) and I'm far from happy about. For larger switches, I use nested switches, but I find very difficult to compare performance of nested switches (i.e.: a 1000 elements switch can be nested in three switches of 10 elements) against single global switches. I imagine that there's a performance function changing the number of switch level and number of switch elements, but I presume that it would be difficult to calculate; can someone explore the matter by tests? I suppose a nested switch like: {{#switch: {{{1}}} | 0 = {{#switch: {{{2}}} | 0 = zero | 1 = one }} | 1 = {{#switch: {{{2}}} | 0 = two | 1 = three }} }} might give you a performance advantage over one of the form: {{#switch: {{{1}}}{{{2}}} | 00 = zero | 01 = one | 10 = two | 11 = three }} But it has no significant impact on memory usage, which was the subject of my initial post, and the performance advantage would have to compete with the overhead of using padleft etc. to split the input arguments. To get a memory usage advantage, you have to split the templates up into smaller data items, like what is done for the English Wikipedia country data, e.g.: https://en.wikipedia.org/w/index.php?title=Template:Flagaction=edit http://en.wikipedia.org/w/index.php?title=Template:Country_data_Canada But it is a time/memory tradeoff. We saw short Olympics articles with rendering times in the tens of seconds due to heavy use of these flag templates. There is a time overhead to loading each template from the database. Another way would be, to implement a .split() function to transform a string into a list, at least; much better, to implement a JSON parsing of a JSON string, to get lists and dictionaries from strings saved into pages. I guess a dramatic improvement of performance; but I'm far from sure about. I'm not sure how that would help. It sounds like you are describing the existing #switch except with a different syntax. Once you're finished parsing the JSON, you presumably have to store the lists and dictionaries in memory for use by the calling templates, and then you would have a similar memory usage to the Lua solution I discussed. One extraordinary thing about these enormous data templates on the French Wikipedia is that they are not especially slow. The existing optimisations within the wikitext parser seem to work pretty well. We convert the #switch to XML and cache it in memcached, then for subsequent parse operations, it's a fast native XML parse operation followed by a tree traversal. We're seeing hundreds of megabytes of memory usage in 5-10 seconds of rendering time. If the templates were nested as you suggest, it would be even faster. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
I'm just curious: would LUA improve memory usages in this use case? Strainu Original Message From: Tim Starling tstarl...@wikimedia.org Sent: Fri Sep 21 07:07:34 GMT+03:00 2012 To: wikitech-l@lists.wikimedia.org Subject: [Wikitech-l] #switch limits Over the last week, we have noticed very heavy apache memory usage on the main Wikimedia cluster. In some cases, high memory usage resulted in heavy swapping and site-wide performance issues. After some analysis, we've identified the main cause of this high memory usage to be geographical data (données) templates on the French Wikipedia, and to a lesser extent, the same data templates copied to other wikis for use on articles about places in Europe. Here is an example of a problematic template: https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es_PyrF1-2009action=edit That template alone uses 47MB for 37000 #switch cases, and one article used about 15 similarly sized templates. The simplest solution to this problem is for the few Wikipedians involved to stop doing what they are doing, and to remove the template invocations which have already been introduced. Antoine Musso has raised the issue on the French Wikipedia's Bistro and some of the worst cases have already been fixed. To protect site stability, I've introduced a new preprocessor complexity limit called the preprocessor generated node count, which is incremented by about 6 for each #switch case. When the limit is exceeded, an exception is thrown, preventing the page from being saved or viewed. The limit is currently 4 million (~667,000 #switch cases), and it will soon be reduced to 1.5 million (~250,000 #switch cases). That's a compromise which allows most of the existing geographical pages to keep working, but still allows a memory usage of about 230MB. At some point, we would like to patch PHP upstream to cause memory for DOM XML trees to be allocated from the PHP request pool, instead of with malloc(). But to deploy that, we would need to reduce the limit to the point where the template DOM cache can easily fit in the PHP memory limit of 128MB. In the short term, we will be working with the template editors to ensure that all articles can be viewed with a limit of 1.5 million. That's not a very viable solution in the long term, so I'd also like to introduce save-time warnings and tracking categories for pages which use more than, say, 50% of the limit, to encourage authors to fix articles without being directly prompted by WMF staff members. At some point in the future, you may be able to put this kind of geographical data in Wikidata. Please, template authors, wait patiently, don't implement your own version of Wikidata using wikitext templates. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l Sent from my Kindle Fire ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
On 21/09/12 16:06, Strainu wrote: I'm just curious: would LUA improve memory usages in this use case? Yes, it's an interesting question. I tried converting that template with 37000 switch cases to a Lua array. Lua used 6.5MB for the chunk and then another 2.4MB to execute it, so 8.9MB in total compared to 47MB for wikitext. So it's an improvement, but we limit Lua memory to 50MB and you would hit that limit long before you loaded 15 such arrays. It's still an O(N) solution. What we really want is to avoid loading the entire French census into memory every time someone wants to read an article about France. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
2012/9/21 Tim Starling tstarl...@wikimedia.org: On 21/09/12 16:06, Strainu wrote: I'm just curious: would LUA improve memory usages in this use case? Yes, it's an interesting question. I tried converting that template with 37000 switch cases to a Lua array. Lua used 6.5MB for the chunk and then another 2.4MB to execute it, so 8.9MB in total compared to 47MB for wikitext. So it's an improvement, but we limit Lua memory to 50MB and you would hit that limit long before you loaded 15 such arrays. I'm not sure on how the Lua code would look like, but perhaps you can tweak the loading of Lua templates so that you don't load the same code more than once? I'm totally oblivious on how MediaWiki (or is it PHP?) is linked to Lua right now, but I'm thinking along the lines of a C program which loads a library once, then can use it many times over. With such an approach, you would have 6.5 + 15*2.4 = 42.5 MB of memory (assuming memory cannot be reused between calls). It's still an O(N) solution. What we really want is to avoid loading the entire French census into memory every time someone wants to read an article about France. Well, you said something about Wikidata. But even if the client Wiki would not need to load the full census, can it be avoided on Wikidata? Strainu ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
2012/9/21 Strainu strain...@gmail.com: Well, you said something about Wikidata. But even if the client Wiki would not need to load the full census, can it be avoided on Wikidata? Talking about the template that Tim listed: https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es_PyrF1-2009action=edit I was trying to understand the template and its usage. As far as I can tell it maps a ZIP (or some other identifier) of a commune to a value (maybe a percentage or population, sorry, the documentation did not exist and my French is rusty). So basically it provide all values for a given property. Differently said that Wikipage implements a database table with the columns key and value and holds the whole table. (I think when Ward Cunningham described a wiki the simplest online database that could possibly work, this is *not* what he envisioned.) In Wikidata we are not storing the data by the property, but for every item. Put differently, every row in that template would become one statement for the item identified by its key. So Wikidata would not load the whole census data for every article, but only the data for the items that is actually requested. On the other hand, we would indeed load the whole data for one item on the repository (not the Wikipedias), which might lead to problems with very big items at some points. We will test make tests to see how this behaves once these features have been developed, and then see if we need to do something like partition by property groups (similar as Cassandra does it). I hope that helps, Denny ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
On 21.09.2012, 11:47 Strainu wrote: 2012/9/21 Tim Starling tstarl...@wikimedia.org: On 21/09/12 16:06, Strainu wrote: I'm just curious: would LUA improve memory usages in this use case? Yes, it's an interesting question. I tried converting that template with 37000 switch cases to a Lua array. Lua used 6.5MB for the chunk and then another 2.4MB to execute it, so 8.9MB in total compared to 47MB for wikitext. So it's an improvement, but we limit Lua memory to 50MB and you would hit that limit long before you loaded 15 such arrays. I'm not sure on how the Lua code would look like, but perhaps you can tweak the loading of Lua templates so that you don't load the same code more than once? I'm totally oblivious on how MediaWiki (or is it PHP?) is linked to Lua right now, but I'm thinking along the lines of a C program which loads a library once, then can use it many times over. And what if a page is related to France, Germany and other European countries at once? Loading this information just once isn't helpful - it needs to load just what is needed, otherwise smart wikipedians will keep inventing creative ways to push the boundaries:) With such an approach, you would have 6.5 + 15*2.4 = 42.5 MB of memory (assuming memory cannot be reused between calls). It's still an O(N) solution. What we really want is to avoid loading the entire French census into memory every time someone wants to read an article about France. Well, you said something about Wikidata. But even if the client Wiki would not need to load the full census, can it be avoided on Wikidata? (Mumbles something about databases that don't store all information in one row and don't always read all the rows at once) -- Best regards, Max Semenik ([[User:MaxSem]]) ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
I took another look at the output that is created with the data, and I am at the same time delighted and astonished by the capability and creativity of the Wikipedia community to solve such tasks with MediaWiki template syntax and at the same time horrified by the necessity of the solution taken. Adding to my own explanation of how Wikidata would help here: we plan to implement some form of query answering capabilities in phase III which would actually not work on the full items, as described in my previous mail, but just on some smarter derived representation of the data. So specific queries -- the possible expressivity is not defined yet -- would be performed much more efficiently than performing them on the fly over all relevant items. (That is covered by the technical proposal as item P3.2 in http://meta.wikimedia.org/wiki/Wikidata/Technical_proposal#Technical_requirements_and_rationales_3). Cheers, Denny 2012/9/21 Denny Vrandečić denny.vrande...@wikimedia.de: 2012/9/21 Strainu strain...@gmail.com: Well, you said something about Wikidata. But even if the client Wiki would not need to load the full census, can it be avoided on Wikidata? Talking about the template that Tim listed: https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es_PyrF1-2009action=edit I was trying to understand the template and its usage. As far as I can tell it maps a ZIP (or some other identifier) of a commune to a value (maybe a percentage or population, sorry, the documentation did not exist and my French is rusty). So basically it provide all values for a given property. Differently said that Wikipage implements a database table with the columns key and value and holds the whole table. (I think when Ward Cunningham described a wiki the simplest online database that could possibly work, this is *not* what he envisioned.) In Wikidata we are not storing the data by the property, but for every item. Put differently, every row in that template would become one statement for the item identified by its key. So Wikidata would not load the whole census data for every article, but only the data for the items that is actually requested. On the other hand, we would indeed load the whole data for one item on the repository (not the Wikipedias), which might lead to problems with very big items at some points. We will test make tests to see how this behaves once these features have been developed, and then see if we need to do something like partition by property groups (similar as Cassandra does it). I hope that helps, Denny -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
I too use sometimes large switches (some hundred) and I'm far from happy about. For larger switches, I use nested switches, but I find very difficult to compare performance of nested switches (i.e.: a 1000 elements switch can be nested in three switches of 10 elements) against single global switches. I imagine that there's a performance function changing the number of switch level and number of switch elements, but I presume that it would be difficult to calculate; can someone explore the matter by tests? Another way would be, to implement a .split() function to transform a string into a list, at least; much better, to implement a JSON parsing of a JSON string, to get lists and dictionaries from strings saved into pages. I guess a dramatic improvement of performance; but I'm far from sure about. Alex brollo ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
Alternately; if ever there was a case for automatedly creating a whole hierarchy of new separate templates for each article, or even just directly editing the articles and putting the data in... Templates would make finding and updating later somewhat easier I think. Just have one per location code. George William Herbert Sent from my iPhone On Sep 21, 2012, at 4:37 AM, Alex Brollo alex.bro...@gmail.com wrote: I too use sometimes large switches (some hundred) and I'm far from happy about. For larger switches, I use nested switches, but I find very difficult to compare performance of nested switches (i.e.: a 1000 elements switch can be nested in three switches of 10 elements) against single global switches. I imagine that there's a performance function changing the number of switch level and number of switch elements, but I presume that it would be difficult to calculate; can someone explore the matter by tests? Another way would be, to implement a .split() function to transform a string into a list, at least; much better, to implement a JSON parsing of a JSON string, to get lists and dictionaries from strings saved into pages. I guess a dramatic improvement of performance; but I'm far from sure about. Alex brollo ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] #switch limits
Some atomic specific page data set is needed and it's perfectly logic and predictable the creative users try any trick to forse wikicode and template code do get such a result. I appreciate deeply and I'm enthusiast about WikiData project, but I wonder about this issue: is wikidata a good data container for data sets needed from a single, specific page of a single project? I.e.: consider citations from Bible: they have a widely used structure; something like Genesis, 4:5 to point to verse 5 into chapter 4 of Genesis. A good switch can translate this reference into a link+anchor to a Page: page of a wikisource version of Bible; a different switch will translate this reference into a link+anchor pointing to ns0 version of same Bible. Can you imagine to host such a set of data into WikiData? I can't; some local data container is needed; #switch makes perfectly the job, end creative users will find this way and will use it, since it's needed to get result. Simply build something more light and efficient and simple than #switch to get the same result, and users will use it. Alex brollo ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] #switch limits
Over the last week, we have noticed very heavy apache memory usage on the main Wikimedia cluster. In some cases, high memory usage resulted in heavy swapping and site-wide performance issues. After some analysis, we've identified the main cause of this high memory usage to be geographical data (données) templates on the French Wikipedia, and to a lesser extent, the same data templates copied to other wikis for use on articles about places in Europe. Here is an example of a problematic template: https://fr.wikipedia.org/w/index.php?title=Mod%C3%A8le:Donn%C3%A9es_PyrF1-2009action=edit That template alone uses 47MB for 37000 #switch cases, and one article used about 15 similarly sized templates. The simplest solution to this problem is for the few Wikipedians involved to stop doing what they are doing, and to remove the template invocations which have already been introduced. Antoine Musso has raised the issue on the French Wikipedia's Bistro and some of the worst cases have already been fixed. To protect site stability, I've introduced a new preprocessor complexity limit called the preprocessor generated node count, which is incremented by about 6 for each #switch case. When the limit is exceeded, an exception is thrown, preventing the page from being saved or viewed. The limit is currently 4 million (~667,000 #switch cases), and it will soon be reduced to 1.5 million (~250,000 #switch cases). That's a compromise which allows most of the existing geographical pages to keep working, but still allows a memory usage of about 230MB. At some point, we would like to patch PHP upstream to cause memory for DOM XML trees to be allocated from the PHP request pool, instead of with malloc(). But to deploy that, we would need to reduce the limit to the point where the template DOM cache can easily fit in the PHP memory limit of 128MB. In the short term, we will be working with the template editors to ensure that all articles can be viewed with a limit of 1.5 million. That's not a very viable solution in the long term, so I'd also like to introduce save-time warnings and tracking categories for pages which use more than, say, 50% of the limit, to encourage authors to fix articles without being directly prompted by WMF staff members. At some point in the future, you may be able to put this kind of geographical data in Wikidata. Please, template authors, wait patiently, don't implement your own version of Wikidata using wikitext templates. -- Tim Starling ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l