Re: [Wikitech-l] Lua deployed to www.mediawiki.org

2012-08-23 Thread Domas Mituzas
Hi!

> As long as people in the templating community were at least consulted with,
> then that's fine. I'm just saying we cannot randomly throw features onto
> users without discussing it with them.

Same way template editors created whatever they created without discussing with 
developers, ha ha ha.

BR, (thanks MZMCBRBRBR)
Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Lua deployed to www.mediawiki.org

2012-08-22 Thread Domas Mituzas
> ...took place? I'm sure that such discussions has taken place
> somewhere, because if not - that's not very mature behavior for open
> source developer team.

why do you have to be such an ass, by the way?

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How to write a parser

2012-06-20 Thread Domas Mituzas
Well, the  syntax is:

condition = and_condition ('or' and_condition)*
and_condition = relation ('and' relation)*
relation  = is_relation | in_relation | within_relation | 'n' 
is_relation   = expr 'is' ('not')? value
in_relation   = expr ('not')? 'in' range_list

within_relation = expr ('not')? 'within' range_list
expr  = 'n' ('mod' value)?
range_list= (range | value) (',' range_list)*
value = digit+
digit = 0|1|2|3|4|5|6|7|8|9
range = value'..'value

Would this one work: 
http://pear.php.net/package/PHP_ParserGenerator

?
Domas

On Jun 20, 2012, at 2:02 PM, Niklas Laxström wrote:

> No, this is not about a wikitext parser. Rather something much simpler.
> 
> Have a look at [1] and you will see rules like:
> n in 0..1
> n is 2
> n mod 10 in 3..4,9 and n mod 100 not in 10..19,70..79,90..99
> 
> Long ago when I wanted to compare the plural rules of MediaWiki and
> CLDR I wrote a parser for the CLDR rule format. Unfortunately my
> implementation uses regular expression and eval, which makes it
> unsuitable for production. Now, writing parsers is not my area of
> expertise, so can you please point me how to do this properly with
> PHP. Bonus points if it is also easily adaptable to JavaScript.
> 
> [1] 
> http://unicode.org/repos/cldr-tmp/trunk/diff/supplemental/language_plural_rules.html
> 
>  -Niklas
> 
> -- 
> Niklas Laxström
> 
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] PHP 5.4 has been released

2012-03-05 Thread Domas Mituzas
> 
> And we will not be able to use them in core for next five years
> because of shared hosting compatibility. Yay!

Shared hosting compatibility is for old branches ;-) 

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Bump of minimum required PHP version to 5.3 for MediaWiki 1.20

2012-02-21 Thread Domas Mituzas
Hi!

> Agreed. Trying to push people to upgrade to 5.3 will be a colossal waste
> of time. Remember the pushes for PHP5 and to get people to drop PHP4?

None of that frustration was around MediaWiki development - we dropped PHP4 
swiftly, and I guess only Jeffrey Merkey complained. ;-)
If MediaWiki is better on newer PHP, we should use newer PHP. 

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Announcement: Terry Chay joins WMF as Director of Features Engineering

2012-02-21 Thread Domas Mituzas
Oh my, I've been an admirer for many years ;-)

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dropping StartProfiler.php

2011-12-28 Thread Domas Mituzas
> Good point. Rather than simply simplifying that kind of case we could add  
> to the $wgProfiler array a callback functionality that can return a  
> boolean on whether to start the profiler.

Or which one to start, because, um, we start different ones.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Dropping StartProfiler.php

2011-12-27 Thread Domas Mituzas
> We could try to simplify those kind of common cases.

Yet there are cases that are not so common - "start profiler for this page", 
"for this ip", "for this wiki", "for this query string", etc. 
Where would that belong?

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How Does Wikimedia Succeed As A Non-Profit?

2011-11-10 Thread Domas Mituzas
Hi!

> As far as I am aware Medecins Sans Frontieres pay full whack salaries:
> their doctors are not volunteers. Are you suggesting something?

Heh, no, I guess my example was wrong. My point is that nonprofits have people 
motivated to do mission-critical things, and other stuff like 'running a 
website' is not that important, and someone's cousin will do that, as long as 
he has Frontpage installed.

Wikipedia was built as a website, so it got more people who know what a website 
is. :-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] How Does Wikimedia Succeed As A Non-Profit?

2011-11-10 Thread Domas Mituzas
Hi!

>   Non-profit organizations are famous for having terrible web 
> sites. 

Maybe other non-profit organizations don't conduct their business mainly via 
their website.
Wikipedia wasn't that beautiful at some point in time either, a volunteer came 
and redesigned it. 

The way Medecins Sans Frontieres can get highly qualified doctors to join, 
Wikipedia had lots of motivation for technology people - something they could 
do that was at very core of the activities. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] page view stats redux

2011-11-07 Thread Domas Mituzas
Hi!

> I had thought to do a daily update.  If it turns out that hourly updates
> are indeed useful, I'll set that up.  I don't know of anyone else that
> has a current mirror.

Yeh, don't believe anything I say, wait for someone on mailing list to tell you 
the same to make conclusions. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikipedia lacks a "share' button

2011-10-21 Thread Domas Mituzas
Hi!

> Gentlemen, where is the Share button?

I have one in my browser, and I think thats where it belongs ;-)

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Mystery of most-viewed pages on En.Wikipedia

2011-10-05 Thread Domas Mituzas
Yo,

> So I guess this was just one IP hitting the same article ~1.5 million
> times per day for 3-4 days, for whatever reason. 

OMG, if anyone can influence quality journalism on examiner.com that easily, we 
definitely have to go and build proper analysis of full logs with all the shiny 
modern technologies and whatever cluster we can build for that. That would be 
bump in program spending \o/

(Though it would be nice to notice someone crap like that. Web activity, not 
article, I mean).

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-20 Thread Domas Mituzas
> 
> Ah, okay.  I remember that's what happened in MyISAM but I figured
> they had that fixed in InnoDB.

InnoDB has optimized path for index builds, not for schema changes.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding MD5 / SHA1 column to revision table

2011-09-19 Thread Domas Mituzas

> 
> * When reverting, do a select count(*) where md5=? and then do something 
> more advanced when more than one match is found

finally "we don't need an index on it" becomes "we need an index on it", and 
storage efficiency becomes much more interesting (binary packing yay ;-)

so, what are the use cases and how does one index for them? is it global hash 
check, per page? etc

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Help us document MediaWiki's history

2011-09-13 Thread Domas Mituzas
Well played sir. 
:-D

> 
> Just think how different the world might have been ;)

Pity there was no Mongo at that time.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Help us document MediaWiki's history

2011-09-13 Thread Domas Mituzas
Hi!

> A cool software history project, "Architecture of Open Source
> Applications," asked MediaWiki for a short history of our software.

If you're soliciting feedback, you should note that putting this into an 
initial version of initial ideas list is somewhat odd way to attract 
contributions:

"Drawbacks, decisions we have rued
• dependence on MySQL?"

What did you mean here? The word 'rued' sounds relatively rude to me. It is not 
that we have strict dependency on MySQL (did you see the code?), nor there were 
too many bitterly regrets (it is somewhat least problematic part of technology 
stack, imo ;)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] State of page view stats

2011-08-12 Thread Domas Mituzas
> Downloading gigs and gigs of raw data and then processing it is generally
> more impractical for end-users.

You were talking about 3.7M articles. :) It is way more practical than working 
with pointwise APIs though :-)

> Any tips? :-)  My thoughts were that the schema used by the GlobalUsage
> extension might be reusable here (storing wiki, page namespace ID, page
> namespace name, and page title).

I don't know what GlobalUsage does, but probably it is all wrong ;-)

> As I recall, the system of determining which domain a request went to is a
> bit esoteric and it might be the worth the cost to store the whole domain
> name in order to cover edge cases (labs wikis, wikimediafoundation.org,
> *.wikimedia.org, etc.).

*shrug*, maybe, if I'd run a second pass I'd aim for cache oblivious system 
with compressed data both on-disk and in-cache (currently it is b-tree with 
standard b-tree costs). 
Then we could actually store more data ;-) Do note, there're _lots_ of data 
items, and increasing per-item cost may quadruple resource usage ;-) 

Otoh, expanding project names is straightforward, if you know how). 

> There's some sort of distinction between projectcounts and pagecounts (again
> with documentation) that could probably stand to be eliminated or
> simplified.

projectcounts are aggregated by project, pagecounts are aggregated by page. If 
you looked at data it should be obvious ;-) 
And yes, probably best documentation was in some email somewhere. I should've 
started a decent project with descriptions and support and whatever. 
Maybe once we move data distribution back into WMF proper, there's no need for 
it to live nowadays somewhere in Germany. 

> But the biggest improvement would be post-processing (cleaning up) the
> source files. Right now if there are anomalies in the data, every re-user is
> expected to find and fix these on their own. It's _incredibly_ inefficient
> for everyone to adjust the data (for encoding strangeness, for bad clients,
> for data manipulation, for page existence possibly, etc.) rather than having
> the source files come out cleaner.

Raw data is fascinating in that regard though - one can see what are bad 
clients, what are anomalies, how they encode titles, what are erroneus titles, 
etc. 
There're zillions of ways to do post-processing, and none of these will match 
all needs of every user. 

> I think your first-pass was great. But I also think it could be improved.
> :-)

Sure, it can be improved in many ways, including more data (some people ask 
(page,geography) aggregations, though with our long tail that is huge 
dataset growth ;-) 

> I meant that it wouldn't be very difficult to write a script to take the raw
> data and put it into a public database on the Toolserver (which probably has
> enough hardware resources for this project currently).

I doubt Toolserver has enough resources to have this data thrown at it and 
queried more, unless you simplify needs a lot. 
There's 5G raw uncompressed data per day in text form, and long tail makes 
caching quite painful, unless you go for cache oblivious methods. 

> It's maintainability
> and sustainability that are the bigger concerns. Once you create a public
> database for something like this, people will want it to stick around
> indefinitely. That's quite a load to take on.

I'd love to see that all the data is preserved infinitely. It is one of most 
interesting datasets around, and its value for the future is quite incredible. 

> I'm also likely being incredibly naïve, though I did note somewhere that it
> wouldn't be a particularly small undertaking to do this project well.

Well, initial work took few hours ;-) I guess by spending few more hours we 
could improve that, if we really knew what we want. 

> I'd actually say that having data for non-existent pages is a feature, not a
> bug. There's potential there to catch future redirects and new pages, I
> imagine.

That is one of reasons we don't eliminate that data now from raw dataset. I 
don't see it as a bug, I just see that for long-term aggregations that data 
could be omitted. 

> A user wants to analyze a category with 100 members for the page view data
> of each category member. You think it's a Good Thing that the user has to
> first spend countless hours processing gigabytes of raw data in order to do
> that analysis? It's a Very Bad Thing. And the people who are capable of
> doing analysis aren't always the ones capable of writing the scripts and the
> schemas necessary to get the data into a usable form.

No, I think we should have API to that data to fetch small sets of data without 
much pain. 

> The reality is that a large pile of data that's not easily queryable is
> directly equivalent to no data at all, for most users. Echoing what I said
> earlier, it doesn't make much sense for people to be continually forced to
> reinvent the wheel (post-processing raw data and putting it into a queryable
> format).

I agree. By opening up the 

Re: [Wikitech-l] State of page view stats

2011-08-12 Thread Domas Mituzas
Hi!

> Currently, if you want data on, for example, every article on the English
> Wikipedia, you'd have to make 3.7 million individual HTTP requests to
> Henrik's tool. At one per second, you're looking at over a month's worth of
> continuous fetching. This is obviously not practical.

Or you can download raw data. 

> A lot of people were waiting on Wikimedia's Open Web Analytics work to come
> to fruition, but it seems that has been indefinitely put on hold. (Is that
> right?)

That project was pulsing with naiveness, if it ever had to be applied to wide 
scope of all projects ;-)

> Is it worth a Toolserver user's time to try to create a database of
> per-project, per-page page view statistics?

Creating such database is easy, making it efficient is a bit different :-) 

> And, of course, it wouldn't be a bad idea if Domas' first-pass implementation 
> was improved on Wikimedia's side, regardless.

My implementation is for obtaining raw data from our squid tier, what is wrong 
with it? 
Generally I had ideas of making query-able data source - it isn't impossible 
given a decent mix of data structures ;-) 

> Thoughts and comments welcome on this. There's a lot of desire to have a
> usable system.

Sure, interesting what people think could be useful with the dataset - we may 
facilitate it.

>  But short of believing that in
> December 2010 "User Datagram Protocol" was more interesting to people
> than Julian Assange you would need some other data source to make good
> statistics. 

Yeah, "lies, damn lies and statistics". We need better statistics (adjusted by 
wikipedian geekiness) than full page sample because you don't believe general 
purpose wiki articles that people can use in their work can be more popular 
than some random guy on the internet and trivia about him. 
Dracula is also more popular than Julian Assange, so is Jenna Jameson ;-) 

> http://stats.grok.se/de/201009/Ngai.cc would be another example.


Unfortunately every time you add ability to spam something, people will spam. 
There's also unintentional crap that ends up in HTTP requests because of broken 
clients. It is easy to filter that out in postprocessing, if you want, by 
applying article-exists bloom filter ;-)

> If the stats.grok.se data actually captures nearly all requests, then I am 
> not sure you realize how low the figures are. 

Low they are, Wikipedia's content is all about very long tail of data, besides 
some heavily accessed head. Just graph top-100 or top-1000 and you will see the 
shape of the curve:
https://docs.google.com/spreadsheet/pub?hl=en_US&key=0AtHDNfVx0WNhdGhWVlQzRXZuU2podzR2YzdCMk04MlE&hl=en_US&gid=1

> As someone with most of the skills and resources (with the exception of time, 
> possibly) to create a page view stats database, reading something like this 
> makes me think...

Wow.

> Yes, the data is susceptible to manipulation, both intentional and 
> unintentional. 

I wonder how someone with most of skills and resources wants to solve this 
problem (besides the aforementioned article-exists filter, which could reduce 
dataset quite a lot ;)

> ... you can begin doing real analysis work. Currently, this really isn't 
> possible, and that's a Bad Thing.

Raw data allows you to do whatever analysis you want. Shove it into SPSS/R/.. 
;-) Statistics much?

> The main bottleneck has been that, like MZMcBride mentions, an underlying
> database of page view data is unavailable.  

Underlying database is available, just not in easily queryable format. There's 
a distinction there, unless you all imagine database as something you send SQL 
to and it gives you data. Sorted files are databases too ;-)
Anyway, I don't say that the project is impossible or unnecessary, but there're 
lots of tradeoffs to be made - what kind of real time querying workloads are to 
be expected, what kind of pre-filtering do people expect, etc.

Of course, we could always use OWA. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] mysql_* functions and MediaWiki

2011-07-30 Thread Domas Mituzas
> c) Both PDO and MySQLi support "prepared" statements, which could let us 
> introduce a form of statement cache so that if we need to execute the 
> same query multiple times except with different parameters, there is 
> less overhead involved since the statement is already prepared and just 
> needs the data values to use.

That statement cache needs persistent connections, which are somewhat 
expensive. 
Otherwise you're doing two roundtrips instead of one, and there's not much 
performance win anyway.

> Another bonus of prepared statements is 
> that it properly escapes everything according to whatever database 
> engine you happen to be using when you substitute in parameters.

I don't think that has been an issue lately.

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Please don't commit broken code

2011-07-29 Thread Domas Mituzas
Hi!

just wanted to point out that there's open-source software for code-review that 
is designed for this (code reviews before commit), it supports both SVN and Git:

http://phabricator.org/

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] ICANN expansion, relative URLs

2011-06-18 Thread Domas Mituzas
> I suppose "in theory" having "apple" available is no worse than "apple.com"
> (since you *could* have an "apple.com.mylocaldomain" already and have to
> worry about which takes precedence), but in practice that sounds like a
> crappy thing to do. :)

Well, yes, this is exactly why you don't usually use TLDs as subdomains on top 
of company internal search path. 
I guess this makes us switch back to IP addresses, if there's a constant chance 
of conflict we can no longer control :-)

With IPv6 that will be even easier. And who needs DNS when we have Google. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] XKCD: Extended Mind

2011-05-25 Thread Domas Mituzas
> Well, in fact, Randall's pretty good about the details (vis 2000 people 
> showing up in a park in Cambridge just because he put an undesignated
> time and lat/lon in a strip)...

Randall: " I drew it based on an older error message where the IP was 
10.0.0.243.  I changed it to 242 (a) because I try not to get too specific with 
those things, and didn't want people poking the actual machine at .243 (if it 
was still there) -- I actually considered putting .276 and seeing how many 
poeple noticed, but figured they'd just think I made a dumb mistake.  and (b) 
as part of this ancient inside joke involving the number 242 ... "

Funny though, it caused way more confusion with .242, as we have random stuff 
pointing at it ;-)
.243 was enwiki database box until July 2008;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] XKCD: Extended Mind

2011-05-25 Thread Domas Mituzas
> 
> I second Domas to check because there may be a super secret conspiracy
> and the drawing may be correct. ;-)

well, $wgContributionTrackingDBserver = 'db9.pmtpa.wmnet'; - though I don't see 
anything on profiling. *sigh*

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] XKCD: Extended Mind

2011-05-25 Thread Domas Mituzas
> 
> I would have thought the fact that it was hand drawn would have given
> it away.

well, it is valid DB IP, so some random extension pulling data from db9 could 
be likely. 
Worth checking anyway ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] dns issues during downtime

2011-05-25 Thread Domas Mituzas
> What was the problem exactly? I don't see anything about it in the
> server admin log.

The usual, powerdns deadlock. There're plenty of cases like that in the past.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] XKCD: Extended Mind

2011-05-25 Thread Domas Mituzas

On May 25, 2011, at 9:35 AM, K. Peachey wrote:

> http://xkcd.org/903/
> -Peachey

that error is fake! 10.0.0.242 is internal services DNS server and is not used 
to serve en.wikipedia.org - dberror log does not have a single instance of it! 
10.0.6.42 on the other hand

the incident yesterday was network card drivers / linux / network cards being 
stupid to interface going down/up - we saw some isolated issues similar to that 
in the past, unfortunately that knocked out all databases and they all needed 
serial console intervention. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] dns issues during downtime

2011-05-24 Thread Domas Mituzas
frankly, european dns server had problems we didn't really notice - until the 
actual maintenance. DNS should've been fully functional as it is in multiple 
datacenters.

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-09 Thread Domas Mituzas
> I've spent a lot of time profiling and optimising the parser in the
> past. It's a complex process. You can't just look at one number for a
> large amount of very complex text and conclude that you've found an
> optimisation target.

unless it is {{cite}}

Cheers,
Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-04 Thread Domas Mituzas
Hi!

> A single long line containing no markup is indeed an edge case, but it
> is a good reference case since it is the input where the parser will run
> at its fastest.

Bubblesort will also have O(N) complexity sometimes :-)

> Replacing the spaces with newlines will cause a tenfold increase in the
> execution time.  Sure, in relative numbers less is time spent executing
> regexps, but in absolute numbers, more time is spent there.


Well, this is not fair - you should sum up all zend symbols if you compare that 
way - there're no debugging symbols for libpcre, so you get aggregated view. 
Thats same like saying that 10 is smaller number than 7, just because you can 
factorize it ;-) 

Comparing apples and oranges doesn't always help, that kind of hand waving may 
impress others, but some have spent more time looking at that data than just 
for ranting in single mailing list thread ;-) 

Cheers,
Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Domas Mituzas
Ohi,

> The time it takes to execute the code that glues together the regexps
> will be insignificant compared to actually executing the regexps for any
> article larger than a few hundred bytes.  

Well, you did an edge case - a long line. Actually, try replacing spaces with 
newlines, and you will get 25x cost difference ;-) 

> But the top speed of the parser (in bytes/seconds) will be largely unaffected.

Damn!

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Zend performance (Was: WYSIWYG and parser plans)

2011-05-03 Thread Domas Mituzas
Hi!

> The discussion was concerning parser performance,

the discussion was concerning overall parser performance, not just edge cases. 

> so a profile of only parser execution would have been most relevant.  

Indeed, try parsing decent articles with all their template trees.

> In my profiling data, parser execution dominates, and as you can see, its 
> mostly regexp evaluation.

Indeed, because you don't have anything what would invoke callbacks or 
branching in the code. 

>  (With "php parser", I was referring to "zendparse" and
> "lex_scan", which doesn't seem to use libpcre.  I.e., almost all calls
> to libpcre is made from the wikitext parser.)

I usually don't know what "php parser" is, opcode caches have been around for 
past decade.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Zend performance (Was: WYSIWYG and parser plans)

2011-05-03 Thread Domas Mituzas
Hi!

> I'm not sure what you are profiling,

Wikipedia :) 

> but when repeatingly requesting a
> preview of an article containing 20 bytes of data consisting of the
> pattern "a a a a a a " I got the below results.  (The php parser doesn't
> seem to depend on perl regexps.)

I'm sure nothing profiles better than a synthetic edge case. What do you mean 
by it not depending on perl regexps? It is top symbol in your profile. 

> CPU: CPU with timer interrupt, speed 0 MHz (estimated)
> Profiling through timer interrupt
> samples  %app name symbol name
> 994  23.4933  libpcre.so.3.12.1/lib/libpcre.so.3.12.1
> 545  12.8811  libphp5.so   zendparse
> 369   8.7213  libphp5.so   lex_scan
> 256   6.0506  libc-2.11.2.so   memcpy
> 137   3.2380  libphp5.so   zend_hash_find
> 135   3.1907  libphp5.so   _zend_mm_alloc_canary_int
> 105   2.4817  libphp5.so   __i686.get_pc_thunk.bx
> 902.1272  libphp5.so   _zend_mm_free_canary_int
> 671.5835  libphp5.so   zif_strtr
> 591.3945  libphp5.so   zend_mm_add_to_free_list
> 481.1345  libphp5.so   zend_mm_remove_from_free_list

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-03 Thread Domas Mituzas
> 
> Thoughts? Also, for re-licensing, what level of approval do we need?
> All authors of the parser, or the current people in an svn blame?

Current people are doing 'derivative work' on previous authors work. I think 
all are needed. Pain oh pain. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-03 Thread Domas Mituzas
Hi!

> I was just talking about this in IRC :). We could re-license the
> parser to be LGPL or BSD so that other implementations can use our
> parser more freely.

This is how WMF staff treats volunteers:

[21:17:23]   domas: and now I took your BSD idea, and didn't give 
you credit
[21:17:38]  * Ryan_Lane wins
[21:17:51]   FLAWLESS VICTORY
[21:17:55]   except for the IRC logs

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Licensing (Was: WYSIWYG and parser plans)

2011-05-03 Thread Domas Mituzas
> Which would also require the linking application to be GPL licensed,
> which is less than ideal. 

Which of course allows me to fork the thread and ask why does MediaWiki have to 
be GPL licensed.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WYSIWYG and parser plans (was What is wrong with Wikia's WYSIWYG?)

2011-05-03 Thread Domas Mituzas
> It's slightly more difficult, but it definitely isn't any easier

It is much easier to embed it in other languages, once you get shared object 
with Parser methods exposed ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Zend performance (Was: WYSIWYG and parser plans)

2011-05-03 Thread Domas Mituzas
> 
> regexps might be fast, but when you have to run hundreds of them all
> over the place and do stuff in-language then the language becomes the
> bottleneck.

some oprofile data showsthat pcre is few percent of execution time - and 
there's really lots of Zend internals that stand in the way - memory management 
(HPHP implements it as C++ object allocations via jemalloc), symbol resolutions 
(native calls in C++), etc. 

Domas

samples  %image name   app name symbol name
4924009.6648  libphp5.so   libphp5.so   
_zend_mm_alloc_int
4515738.8634  libc-2.7.so  libc-2.7.so  (no symbols)
3478126.8268  libphp5.so   libphp5.so   
zend_hash_find
3456656.7847  no-vmlinux   no-vmlinux   (no symbols)
3305136.4873  libphp5.so   libphp5.so   
_zend_mm_free_int
2257554.4311  libpcre.so.3.12.1libpcre.so.3.12.1(no symbols)
1599253.1390  libphp5.so   libphp5.so   
zend_do_fcall_common_helper_SPEC
1377092.7029  libphp5.so   libphp5.so   
_zval_ptr_dtor
1272332.4973  libxml2.so.2.6.31libxml2.so.2.6.31(no symbols)
1112492.1836  libphp5.so   libphp5.so   
zend_hash_quick_find
93994 1.8449  libphp5.so   libphp5.so   
_zend_hash_quick_add_or_update
84693 1.6623  libphp5.so   libphp5.so   
zend_assign_to_variable
84256 1.6538  fss.so   fss.so   (no symbols)
56474 1.1085  libphp5.so   libphp5.so   execute
49959 0.9806  libphp5.so   libphp5.so   
zend_hash_destroy
48450 0.9510  libz.so.1.2.3.3  libz.so.1.2.3.3  (no symbols)
46967 0.9219  libphp5.so   libphp5.so   
ZEND_JMPZ_SPEC_TMP_HANDLER
46523 0.9131  libphp5.so   libphp5.so   
_zend_hash_add_or_update
45747 0.8979  libphp5.so   libphp5.so   
zend_str_tolower_copy
39154 0.7685  libphp5.so   libphp5.so   
zend_fetch_dimension_address
35356 0.6940  libphp5.so   libphp5.so   
ZEND_RECV_SPEC_HANDLER
33381 0.6552  libphp5.so   libphp5.so   
compare_function
32660 0.6410  libphp5.so   libphp5.so   
_zend_hash_index_update_or_next_insert
31815 0.6245  libphp5.so   libphp5.so   
zend_parse_va_args
31689 0.6220  libphp5.so   libphp5.so   
ZEND_SEND_VAR_SPEC_CV_HANDLER
31554 0.6193  libphp5.so   libphp5.so   _emalloc
30404 0.5968  libphp5.so   libphp5.so   
_get_zval_ptr_var
29812 0.5851  libphp5.so   libphp5.so   
ZEND_ASSIGN_REF_SPEC_CV_VAR_HANDLER
28092 0.5514  libphp5.so   libphp5.so   
ZEND_DO_FCALL_SPEC_CONST_HANDLER
27760 0.5449  libphp5.so   libphp5.so   
zend_hash_clean
27589 0.5415  libphp5.so   libphp5.so   
zend_fetch_var_address_helper_SPEC_CONST
26731 0.5247  libphp5.so   libphp5.so   
_zval_dtor_func
24732 0.4854  libphp5.so   libphp5.so   
ZEND_ASSIGN_SPEC_CV_VAR_HANDLER
24732 0.4854  libphp5.so   libphp5.so   
ZEND_RECV_INIT_SPEC_CONST_HANDLER
22587 0.4433  libphp5.so   libphp5.so   
zend_send_by_var_helper_SPEC_CV
22176 0.4353  libphp5.so   libphp5.so   _efree
21911 0.4301  libphp5.so   libphp5.so   .plt
21102 0.4142  libphp5.so   libphp5.so   
ZEND_SEND_VAL_SPEC_CONST_HANDLER
19556 0.3838  libphp5.so   libphp5.so   
zend_fetch_property_address_read_helper_SPEC_UNUSED_CONST
18568 0.3645  libphp5.so   libphp5.so   
zend_get_property_info
18348 0.3601  libphp5.so   libphp5.so   
zend_std_get_method
18279 0.3588  libphp5.so   libphp5.so   
zend_get_hash_value
17944 0.3522  libphp5.so   libphp5.so   
php_var_unserialize
17461 0.3427  libphp5.so   libphp5.so   
_zval_copy_ctor_func
17187 0.3373  libtidy-0.99.so.0.0.0libtidy-0.99.so.0.0.0(no symbols)
16341 0.3207  libphp5.so   libphp5.so   
zend_get_parameters_ex
16103 0.3161  libphp5.so   libphp5.so   
zend_std_read_property
15662 0.3074  libphp5.so   libphp5.so   
zend_hash_copy
14678 0.2881  libphp5.so   libphp5.so   
zend_binary_strcmp
14556 0.2857  apc.so 

Re: [Wikitech-l] facebook like box in mediawiki

2011-04-21 Thread Domas Mituzas
> ZOMG DOMAS IS WORKING FOR TEH ALIENS!!!1!one!

Careful, I can plan an Inception for you all to believe that privacy
policy is bad idea, after I stop hunting dolphins in the cove, of
course in my human avatar.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] facebook like box in mediawiki

2011-04-21 Thread Domas Mituzas
> hell, Aaron Sorkin got an Oscar for dramatizing why you *should* be
> concerned about it.

"Alien" and "Aliens" both won Oscars too.
You *should* be concerned about aliens.

Cheers,
Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] HipHop

2011-04-05 Thread Domas Mituzas
> For comparison: WYSIFTW parses [[Barak Obama]] in 3.5 sec on my iMac,
> and in 4.4 sec on my MacBook (both Chrome 12).

Try parsing [[Barack Obama]], 4s spent on parsing a redirect page is
quite a lot (albeit it has some vandalism)
OTOH, my macbook shows raw wikitext pretty much immediately. Parser is
definitely the issue.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] HipHop

2011-03-28 Thread Domas Mituzas

On Mar 28, 2011, at 5:28 PM, Aryeh Gregor wrote:

> ... and Facebook ignores that and adds what
> it thinks would be useful? 

Facebook already has features Zend does not:

https://github.com/facebook/hiphop-php/blob/master/doc/extension.new_functions

Stuff like:
* Parallel RPC - MySQL, HTTP, ..
* Background execution, post-send execution, pagelet server
etc

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] A good question

2011-03-16 Thread Domas Mituzas

On Mar 16, 2011, at 3:34 PM, Victor Vasiliev wrote:

> Imagine a user approaches you and asks "What are significant changes
> between 1.16 and 1.17?"

Try marketing-l. 

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] secure/singer proxy errors

2011-03-14 Thread Domas Mituzas
> For now I have added a sleep() to the code to limit invalidations to
> 100 pages per second per job runner process.

Mass invalidations will create MJ issue in the long tail though... Need 
poolcounter ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikimedia engineering February report

2011-03-04 Thread Domas Mituzas

On Mar 4, 2011, at 6:36 PM, Guillaume Paumier wrote:

> Hi,
> 
> Le vendredi 04 mars 2011 à 17:25 +0100, Krinkle a écrit : 
>> On 4 March 2011, David Gerard wrote:
>> 
>>> On 4 March 2011 09:58, Guillaume Paumier   
>>> wrote:
>>> 
 * posting a link here is a good practice that you'd like us to  
 continue;
>>> 
>>> +1
>> 
>> +1
> 
> Thanks to those who answered. I don't think it's necessary to continue
> to +1; this seems to be the consensus so far, so I'll just do that.

+1

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Sprint to 1.17

2011-02-22 Thread Domas Mituzas
Hi!

> You might know this already, but you can download Oracle Database from 
> oracle.com, and use it for free for the purpose of application 
> development or testing.

I'd think whoever cares already know about that. I don't see why WMF should be 
directly working on that though ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Domas Mituzas
Hi!

> I was assuming usage of pfsockopen(), of course.

Though protocol is slightly cheaper, you still have to do TCP handshake :)

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Domas Mituzas
Hi!

> I suppose it's possible in theory, but in any case, it's not what
> they're doing.  They *are* going through MySQL, via the HandlerSocket
> plugin.

After reading the code I can sure correct myself - they are calling into some 
of MySQL things (e.g. for table open), but everything else is table handler 
interfaces :-) 
Though I guess what I mean "going through MySQL" and you have in mind are 
entirely different topics :)

> I wonder if they'd get much different performance by just using
> prepared statements and read committed isolation, with the
> transactions spanning multiple requests.  The tables would only get
> locked once per transaction, right?

They wouldn't gain too much of perf with PS, transaction establishment is quite 
cheap in MySQL, especially compared to other vendors. 
Do note, prepared statements don't prepare query plan for you - it is 
reestablished at every execution, neither it does have open table handlers, 
IIRC - generally you just don't have to reparse the text. 

> It was an example of a way to get fast results if you don't care about
> your reads being atomic.

InnoDB is faster than MyISAM at high performance workloads.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Domas Mituzas
Hi!

> It seems from my tinkering that MySQL query cache handling is
> circumvented via HandlerSocket.

On busy systems (I assume we talk about busy systems, as discussion is about 
HS) query cache is usually eliminated anyway. 
Either by compiling it out, or by patching the code not to use qcache mutexes 
unless it really really is enabled. In worst case, it is just simply disabled. 
:) 

> So if you update/insert/delete via HandlerSocket, then query via SQL
> your not guarenteed to see the changes unless you use SQL_NO_CACHE.

You are probably right. Again, nobody cares about qcache at those performance 
boundaries. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Domas Mituzas
Hi!

> This could also reduce memory usage by not using memcached (as often) 
> which, I understand, is a bigger problem.

No it is not. 

First of all, our memcached and database access times not that far  away - 0.7 
vs 1.3 ms (again, memcached is static response time, whereas database average 
is impacted by calculations). 
On another hand, we don't store in memcached what is stored in database and we 
don't store in database what is stored in memcached.

Think about these as two separate systems, not as complementing each other too 
much.
We use memcached to offload application cluster, not database cluster. 

And database cluster already has over a terabyte of RAM (replicas and whatnot), 
whereas our memcached lives in puny 158GB arena. 

I described some of fundamental differences of how we use memcached in 
http://dom.as/uc/workbook2007.pdf - pages 11-13. Nothing much changed since 
then. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Domas Mituzas
Hi!

A:
> It's easy to get fast results if you don't care about your reads being
> atomic (*), and I find it hard to believe they've managed to get
> atomic reads without going through MySQL.

MySQL upper layers know nothing much about transactions, it is all 
engine-specific - BEGIN and COMMIT processing is deferred to table handlers.  
It would incredibly easy for them to implement repeatable read snapshots :) (if 
thats what you mean by atomic read)

> (*) Among other possibilities, just use MyISAM.

How is that applicable to any discussion? 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Using MySQL as a NoSQL

2010-12-24 Thread Domas Mituzas
Hi!

> I have recently encountered this text in which the author claims very 
> high MySQL speedups for simple queries

It is not that he speeds up simple queries (you'd notice that maybe if you used 
infiniband, and even then it wouldn't matter much :)
He just avoided hitting some expensive critical sections that make scaling on 
multicore systems problematic. 

> It looks interesting. There are some places where mediawiki could take
> that shortcut if available.

It wouldn't be a shortcut if you had to establish another database connection 
besides existing one. 

> I wonder if we have such CPU bottleneck, though.

No, not really. Our average (do note, this isn't median and is affected by 
heavy queries more) DB response time is 1.3ms (measuring on the client). 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Convention for logged vs not-logged page requests

2010-10-20 Thread Domas Mituzas
Hi!

> will still return the same results, wouldn't it make more sense to
> teach the stat's logger to ignore both?  Or is there a reason that we
> actually want to track one and not the other?

Pretty URLs are for being pretty URLs (e.g. in your address bar). That
leads to very easy assumption, that if there's a pretty URL, it
probably indicates a pageview :-) We quite like other pretty URLs for
Special pages e.g. Watchlist or Recentchanges - as we track their
accesses.

> It seems like an awful lot of trouble to teach every software author
> that they need to follow a particular convention just so the stats
> engine will work as intended.  It would seem like it would be much
> simpler to teach the stats engine to simply detect and ignore this
> special case.  Or is there a reason that doing so is not possible?

Heh, apparently stats became a big deal lately, so one with powers to
change that can feel important! ;-)

Anyway, there're few choices to resolve it on the stats side:

1) Implement pulling of a namespace  map for each project, build out
an efficient rules engine (in C) for dealing with this (do note, every
project will have different namespace for this URL). Also, make it
extensible, so each developer tells about which names will be
not-a-pageview ;-) There's nothing as fun as writing that kind of
code, and do note, it won't be just five (or fifty) lines.

2) Add additional internal header (X-Pageview: true!), that would be
logged by squids inside the stream :) That probably asks for large
review inside MediaWiki, as well as squid code changes (and of course,
rollout of new binary). Would be nice inter-group effort.

3) Not care about inflated per-project numbers, or have people adjust
the numbers, as the source data is there (They can filter out banner
loader themselves!)

You can pick any of these, make sure it gets into strategy plan, as we
don't decide things on wikitech-l anymore :)
I prefer, hehehe, not doing anything, and just having pretty URLs just
for pageviews ;-)

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Collaboration between staff and volunteers: a two-way street

2010-10-20 Thread Domas Mituzas
...
> +4,294,967,295

see what you did to poor Roan, in this "always be positive"
environment this is the only way he can write -1.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Announce]: Mark Bergsma promotion to Operations Engineer Programs Manager

2010-09-15 Thread Domas Mituzas
Hi!

> Erik gave an overview of how EPMs work a few days ago:
> http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/49532

What I learned is that most important information should be put under most 
obscure subject lines, so that only people who really really care would read 
that.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] [Announce]: Mark Bergsma promotion to Operations Engineer Programs Manager

2010-09-15 Thread Domas Mituzas
Hello,

> Please join me in congratulating Mark Bergsma on his promotion last week
> to Operations EPM.  

Congratulations with a long title (whatever it means, Enterprise? Executive? 
Project Manager?)
What is the rationale for another Director of Operations? What will he direct? 
What is the future structure of 'ops'?
Who participated in the discussion about ops leadership needs?

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] On decentralized discussions

2010-09-08 Thread Domas Mituzas

> Why decentralized discussions even more? And is there a reason you
> always seem to spilt your replies to the thread into new treads/topics
> instead of just replying to the original one?


Innovation, maybe?

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Community vs. centralized development

2010-09-08 Thread Domas Mituzas
Hi!

> I created a yahoo group

Why not Facebook Page?!!?

Domas


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Community vs. centralized development

2010-09-08 Thread Domas Mituzas
Hi!

> ... there would now be open source hardware 

Do you need open source "Enter" key? 

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Community vs. centralized development

2010-09-07 Thread Domas Mituzas
Hi!

> Back in 2006 I wanted to make search better, and if then it 
> wasn't for Tim Starling to give me shell access and a couple of test 
> servers to play with, I think we would not have the new search, or at 
> least not developed by me.

I probably have similar story to share :) 

Unfortunately, on the infrastructure/operations engineering side, there has 
been a shift to this whole new "we are a major website that should never go 
down" concept, which restricts volunteers from getting immediate shell accesses.
On one hand, some of us have been lucky to join when there were no such 
constraints and could go through all the learning process, on another - we 
wouldn't get such access nowadays. 

I quite enjoyed when our site SLA was "more up than down", or you know, when 
none of us had many interests outside having this damn thing work :) 

In commercial organizations besides signing "I won't do crap" contracts, 
there's also "loss of income" incentive that protects from doing crap. 
Our infrastructure (and infrastructure team) is way smaller than all the 
aspirations around of "being a top site". 
How would we match volunteer contributions in this field with overall demands? 

Domas

P.S. There have been way more folks telling "pay me!" than "how can I 
volunteer" in the infra side, than on mw dev, I guess ;-)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikimedia logging infrastructure

2010-08-12 Thread Domas Mituzas
> Without having looked at any code, can't the threads just add data to
> a semaphore linked list (fast), and a single separate thread writes
> the stuff to disk occasionally?

Isn't that the usual error that threaded software developers do:

1. get all threads depend on single mutex
2. watch them fight! (you'd get a million wakeups here a second :-)

as a bonus point you get a need to copy data to a separate buffer or frenzy 
memory allocating with another mutex for malloc/free ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikimedia logging infrastructure

2010-08-12 Thread Domas Mituzas
Hi!

> Sure. Make each thread call accept and let the kernel give incoming
> sockets to one of them. There you have the listener done :)
> Solaris used to need an explicit locking, but it is now fixed there, too.

Heh, I somewhat ignored this way - yeah, it would work just fine - one can do 
per-file synchronization rather than per-event, as there's not much state 
involved on either side. 

> Given the following incomint events:
> udp2log has problems
> jeluf created a new wiki
> domas fixed the server
> 
> I call corrupted this:
> jeluf domas
> udp2log has fixed the server
> problems created a new wiki

Well, you wouldn't want to use fwrite/etc calls, as their behavior in threaded 
environment isn't that useful :)
write()s aren't atomic either, so... what you have to do is:

lock(fd);
write(); write(); write(); (may be needed for single buffer, in case first 
write is not complete)
unlock(fd); 

> I don't get it. What is slow on it?
> 
> What it does is:
> 1) Get socket data
> 2) Split line into pieces
> 3) fwrite each line in 16 fds
> 4) Go to 1

1) Get socket data
2) Split packet into lines
3) Write lines into 16 fds
4) Go to 1

> If there's plenty of CPU, the pipes doesn't fill, the fwrite doesn't
> block...
> Why isn't it coping with it?
> Too much time lost in context changes?

There're no context changes, as it running fully on a core.
"plenty of CPU" is 100% core use, most of time is spent in write(), and 
apparently syscalls aren't free.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-08-11 Thread Domas Mituzas
Hi!

<3 enthusiasm :)

> 1)
> This is not a website "http://en.wikipedia.org";, is a redirection to this:
> http://en.wikipedia.org/wiki/Main_Page
> Can't "http://en.wikipedia.org/wiki/Main_Page"; be served from
> "http://en.wikipedia.org";?

Our major entrance is not via main page usually, so this would be a niche 
optimization that does not really matter that much (well, ~2% of article views 
go to main page, and only 15% of that are loading http://en.wikipedia.org/, 
and... :)

> 2)
> The CSS load fine.  \o/

No, they don't, at least not on first pageview. 

> Probabbly the combining effort will save speed anyway.

Yes. We have way too many separate css assets. 

> A bunch of js files!, and load one after another, secuential. This is
> worse than a C program written to a file from disk reading byte by
> byte. !!

Actually, if a program reads byte by byte, whole page is already cached by OS, 
so it is not that expensive ;-) 
And yes, we know that we have a bit too many JS files loaded, and there's work 
to fix that (Roan wrote about that). 

> Combining will probably save a lot. Or using a strategy to force the
> browser to concurrent download + lineal execute, these files.

:-) Thanks for stating obvious. 

> 
> 5)
> There are a lot of img files. Do the page really need than much? sprinting?.

It is PITA to sprite (not sprint) community uploaded images, and again, that 
would work only for front page, which is not our main target. Skin should of 
course be sprited. 

> Total: 13.63 seconds.

Quite slow connection you've got there. I get 1s rendering times with 
cross-atlantic trips (and much better times if I get served by European caches 
:)

> You guys want to make this faster with cache optimization. But maybe
> is not bandwith the problem, but latency. Latency accumulate even with
> HEAD request that result in 302.   All the 302 in the world will not
> make the page feel smooth, if already acummulate into 3+ seconds
> territory.   ...Or I am wrong?

You are. First of all, skin assets are not doing IMS requests, they are all 
cached. 
We force browsers to do IMS on page views so that browsers would pick up edits 
(it is a wiki). 

> Probably is a much better idea to read that book that my post

I'm sorry to disappoint you but none of the issues you wrote down here are any 
new. 
If after reading any books or posts you think we have deficiencies, mostly it 
is because of one of two reasons, either because we're lazy and didn't 
implement, or because it is something we need to maintain wiki model. 

Though of course, it is all fresh and scared you for life, we've been doing 
this for life. ;-) 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikimedia logging infrastructure

2010-08-11 Thread Domas Mituzas
Hi!

> Going multithread is really easy for a socket listener.

Really? :) 

> However, not so
> much in the LogProcessors. If they are shared accross threads, you may
> end up with all threads blocked in the fwrite and if they aren't shared,
> the files may easily corrupt (depends on what you are exactly doing with
> them).

I don't really understand what you say ;-) Do you mean lost data as 'corrupt'?

> Since the problem is that the socket buffer fills, it surprised me that
> the server didn't increase SO_RCVBUF. That's not a solution but should
> help (already set in /proc/sys/net/core/rmem_default ?).

It is long term CPU saturation issue - mux process isn't fast enough to handle 
16 output streams. 
Do note, there're quite a few events a second :)

> The real issue is: what are you placing on your pipes that are so slow
> to read from them?
> Optimizing those scripts could be a simpler solution.

No, those scripts are not the bottleneck, there's plenty of CPU available, and 
they are not blocking (for too long, everything is blocking for a certain 
amount of time ;-)

> Wouldn't be hard to make the pipe writes non-blocking, properly blaming
> the slow pipes that couldn't be written

There are no slow pipes. Bottleneck is udp2log step.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Wikimedia logging infrastructure

2010-08-10 Thread Domas Mituzas
Hi!
> multiple collectors with distinct log pipes setup. E.g. one machine for
> the sampled logging, and another, independent machine to do all the
> special purpose log streams. I do like more efficient software solutions
> rather than throwing more iron at the problem, though. :)

Frankly, we could have same on single machine - e.g. two listeners on same 
multicast stream - for SMP perf :-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Developing true WISIWYG editor for media wiki

2010-08-04 Thread Domas Mituzas
> Unless you use hip-hop to do PHP->C++, then alchemy for C++ -> Flash...
> A really crazy idea :)

crazy uneducated idea :)

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] New table: globaltemplatelinks

2010-08-03 Thread Domas Mituzas
Hi!

> Can you please read it and give your opinion?

Great job on indexing, man, I see you cover pretty much every use case!

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-08-03 Thread Domas Mituzas
Hi!

> Couldn't you just tag every internal link with
> a separate class for the length of the target article,

Great idea, how come noone ever came up with this, I even have a stylesheet 
ready, here it is (do note, even it looks big in text, gzip gets it down to 10% 
so we can support this kind of granularity even up to a megabyte :)

Domas

a { color: blue }
a.1_byte_article { color: red; }
a.2_byte_article { color: red; }
a.3_byte_article { color: red; }
a.4_byte_article { color: red; }
a.5_byte_article { color: red; }
a.6_byte_article { color: red; }
a.7_byte_article { color: red; }
a.8_byte_article { color: red; }
a.9_byte_article { color: red; }
a.10_byte_article { color: red; }
a.11_byte_article { color: red; }
a.12_byte_article { color: red; }
a.13_byte_article { color: red; }
a.14_byte_article { color: red; }
a.15_byte_article { color: red; }
a.16_byte_article { color: red; }
a.17_byte_article { color: red; }
a.18_byte_article { color: red; }
a.19_byte_article { color: red; }
a.20_byte_article { color: red; }
a.21_byte_article { color: red; }
a.22_byte_article { color: red; }
a.23_byte_article { color: red; }
a.24_byte_article { color: red; }
a.25_byte_article { color: red; }
a.26_byte_article { color: red; }
a.27_byte_article { color: red; }
a.28_byte_article { color: red; }
a.29_byte_article { color: red; }
a.30_byte_article { color: red; }
a.31_byte_article { color: red; }
a.32_byte_article { color: red; }
a.33_byte_article { color: red; }
a.34_byte_article { color: red; }
a.35_byte_article { color: red; }
a.36_byte_article { color: red; }
a.37_byte_article { color: red; }
a.38_byte_article { color: red; }
a.39_byte_article { color: red; }
a.40_byte_article { color: red; }
a.41_byte_article { color: red; }
a.42_byte_article { color: red; }
a.43_byte_article { color: red; }
a.44_byte_article { color: red; }
a.45_byte_article { color: red; }
a.46_byte_article { color: red; }
a.47_byte_article { color: red; }
a.48_byte_article { color: red; }
a.49_byte_article { color: red; }
a.50_byte_article { color: red; }
a.51_byte_article { color: red; }
a.52_byte_article { color: red; }
a.53_byte_article { color: red; }
a.54_byte_article { color: red; }
a.55_byte_article { color: red; }
a.56_byte_article { color: red; }
a.57_byte_article { color: red; }
a.58_byte_article { color: red; }
a.59_byte_article { color: red; }
a.60_byte_article { color: red; }
a.61_byte_article { color: red; }
a.62_byte_article { color: red; }
a.63_byte_article { color: red; }
a.64_byte_article { color: red; }
a.65_byte_article { color: red; }
a.66_byte_article { color: red; }
a.67_byte_article { color: red; }
a.68_byte_article { color: red; }
a.69_byte_article { color: red; }
a.70_byte_article { color: red; }
a.71_byte_article { color: red; }
a.72_byte_article { color: red; }
a.73_byte_article { color: red; }
a.74_byte_article { color: red; }
a.75_byte_article { color: red; }
a.76_byte_article { color: red; }
a.77_byte_article { color: red; }
a.78_byte_article { color: red; }
a.79_byte_article { color: red; }
a.80_byte_article { color: red; }
a.81_byte_article { color: red; }
a.82_byte_article { color: red; }
a.83_byte_article { color: red; }
a.84_byte_article { color: red; }
a.85_byte_article { color: red; }
a.86_byte_article { color: red; }
a.87_byte_article { color: red; }
a.88_byte_article { color: red; }
a.89_byte_article { color: red; }
a.90_byte_article { color: red; }
a.91_byte_article { color: red; }
a.92_byte_article { color: red; }
a.93_byte_article { color: red; }
a.94_byte_article { color: red; }
a.95_byte_article { color: red; }
a.96_byte_article { color: red; }
a.97_byte_article { color: red; }
a.98_byte_article { color: red; }
a.99_byte_article { color: red; }
a.100_byte_article { color: red; }
a.101_byte_article { color: red; }
a.102_byte_article { color: red; }
a.103_byte_article { color: red; }
a.104_byte_article { color: red; }
a.105_byte_article { color: red; }
a.106_byte_article { color: red; }
a.107_byte_article { color: red; }
a.108_byte_article { color: red; }
a.109_byte_article { color: red; }
a.110_byte_article { color: red; }
a.111_byte_article { color: red; }
a.112_byte_article { color: red; }
a.113_byte_article { color: red; }
a.114_byte_article { color: red; }
a.115_byte_article { color: red; }
a.116_byte_article { color: red; }
a.117_byte_article { color: red; }
a.118_byte_article { color: red; }
a.119_byte_article { color: red; }
a.120_byte_article { color: red; }
a.121_byte_article { color: red; }
a.122_byte_article { color: red; }
a.123_byte_article { color: red; }
a.124_byte_article { color: red; }
a.125_byte_article { color: red; }
a.126_byte_article { color: red; }
a.127_byte_article { color: red; }
a.128_byte_article { color: red; }
a.129_byte_article { color: red; }
a.130_byte_article { color: red; }
a.131_byte_article { color: red; }
a.132_byte_article { color: red; }
a.133_byte_article { color: red; }
a.134_byte_article { color: red; }
a.135_byte_article { color: red; }
a.136_byte_article

Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-08-02 Thread Domas Mituzas
> The first load of the homepage can be slow:
> http://zerror.com/unorganized/wika/lader1.png
> http://en.wikipedia.org/wiki/Main_Page
> (I need a bigger monitor, the escalator don't fit on my screen)

well, no wonder that first page load is sluggish, with 12 style sheets, and 12 
javascript files - there're plenty of low hanging fruits there. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l



Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-08-02 Thread Domas Mituzas
> That's what he did. Read the query.

;-) thats what happens when email gets ahead of coffee.

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-08-02 Thread Domas Mituzas
Hi!

> I.e., only about a quarter of users have been ported to
> user_properties.  Why wasn't a conversion script run here?

In theory if all properties are at defaults, user shouldn't be there. The 
actual check should be against the blob field.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Caching, was: Re: wikipedia is one of the slower sites on the web

2010-07-30 Thread Domas Mituzas
Hi!

Do note, once you log out, you still have a cookie that prohibits edge caching, 
I think ;-)

> My browser, while reloading,

Don't hit "reload" button, thats what it does - reloads all assets. 

> bits.wikimedia.org  then again and then again here and there  needed
> time to reload a simple, very simple web page: 12 s.

What kind of connection do you have? 
On a simple eastern european dsl I get under 1s rendering times. 

> I guess, that if a plain html + css cached  version (without any default js
> and perhaps with a single, included css section) of the page could be found,
> such a time would be much shorter.

Though indeed we could have some more work done for first-load performance, 
which you are measuring with 'reload', it may not be absolute priority, as no 
skin assets are loaded on any subsequent page views. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Caching, was: Re: wikipedia is one of the slower sites on the web

2010-07-30 Thread Domas Mituzas
Hi!

>> That's pretty much the purpose of the caching servers.
> 
> Yes, but I  presume that a big advantage could come  from having a
> simplified, unique, js-free version of the pages online, completely devoid
> of "user preferences" to avoid any need to parse it again when uploaded by
> different users with different preferences profile. Nevertheless I say
> again: it's only a completely layman idea.

I can a bit elaborate on what Daniel said.

Whenever anyone edits a page, there're (simplistic view) of three caches that 
get populated. 

1. Revision text cache (for operations like diffs, re-parsing for other 
settings, etc)
2. Parser cache (for logged in users)
3. Edge HTTP cache - squid (for anonymous users)

So, anonymous users get "completely devoid of user preferences" pages, as they 
are simply defaults. 
Do note, even though squid cache objects can vary based on accept-encoding (we 
narrowed it down to two versions from 10 few years ago ;-), they map to single 
parser cache object. 

Parser cache hit may take under 50ms to deliver by backend mediawiki. 

Logged in users though bypass our squids, if they don't mess up with their 
preferences, usually hit same parser cache objects - there's an extremely high 
chance of that. 
Now, if you change single setting that affects parser cache variation, the 
'extremely high chance' switches to missing those objects - as there has to be 
someone with same settings as you to visit it before. 

Delivering parser cache miss may take from 50ms to 50s, which is what jidanni 
probably hits. 

So, we may have 1000x slower performance for our users because they don't 
really know about our caching internals. 
Our only hope is that most of them are also ignorant that those settings exist 
;-) 

There'd be of course another workaround - precaching objects for every 
variation, at extremely high cost for relatively low impact. 
Alternative is either having warning icon whenever people are in slow-perf mode 
that they'd be able to hide, or eliminating the choice (you know, the killing 
features business, that quite often works really well!!! ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-07-29 Thread Domas Mituzas
> This is probably a bad thing.  I'd think that most of the settings
> that fragment the parser cache should be implementable in a
> post-processing stage, which should be more than fast enough to run on
> parser cache hits as well as misses.  But we don't have such a thing.

some of which can be even done with css/js, I guess. 
I'm all for simplifying whatever processing backend has to do :-) 

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikipedia is one of the slower sites on the web

2010-07-29 Thread Domas Mituzas
> Could you please elaborate on that? Thanks.

we don't have large blinking red lights when people deviate with their parser 
cache settings - that makes them miss the cache and each pageview is slow. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] collab.wikimedia db error in many tools

2010-07-26 Thread Domas Mituzas
Hello,

> collab.wikimedia db error in many tools.
> like luxo's tool, vvv;s sulutil.

Thats so sad. It is really depressing, that even after I explained you on IRC 
(when you were reporting toolserver issue on #wikimedia-tech) that private 
wikis are not supposed to be exposed to toolserver users, you keep spamming 
non-related mailing lists. 

> How can we fix this problem? I don't know why this problem occurred.

No.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Architectural revisions to improve category sorting

2010-07-23 Thread Domas Mituzas
Hi!

> No doubt Domas will complain anyway, but without developers adding new
> features, I figure his volunteer DBA work would get very boring.

I don't complain about well designed features, especially if they don't scan 
millions of rows to return 0 ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] developer

2010-07-15 Thread Domas Mituzas
Hi!

> It's for Wikimedia operations as well as MediaWiki development.  The
> latter tends to take up much more of the list traffic in practice,
> though.

Indeed, staff-ization of WMF made more and more of communications internal, for 
better or worse.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding fields to the interwiki table

2010-06-17 Thread Domas Mituzas
> You're right. But centralizing this sort of thing makes long term
> planning for that sort of thing easier. And by putting it in core you
> get more eyes on it and hopefully more people caring :)

Well, it doesn't matter where these things are, in core, or externally
- in both cases people ignore issues filed about problems :)

(e.g. https://bugzilla.wikimedia.org/show_bug.cgi?id=23339 )

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Adding fields to the interwiki table

2010-06-16 Thread Domas Mituzas
Hi!

I somewhat didn't jump here, as we simply don't use interwiki table on
WMF sites, so the topic was out of interest. :)

> It seems to me that if we want to modify the core to support interwiki
> integration, there are any number of core tables that could benefit from DB
> name fields.

I personally don't like the "interwiki integration", as pretty much
nearly everything has to go through one of these methods:

1. Pulling from all wikis
2. Pushing to all wikis
3. Having central backend

All these have their own nightmares, and separation was quite often
preventing us from madness. CentralAuth has added its own share of
inefficiencies that nobody has been working on yet. Having shared data
between multiple systems isn't the easiest problem usually, and it
needs more attention than a single-time feature deployment.

> E.g., user_newtalk could have a DB name field, so that users
> could be informed which wiki(s) they have new messages on.

Neeesss (you're suggesting 2 here... :)

> in the harder cases it will make more sense to have global tables, kinda
> like what CentralAuth sets up, unless we want to do a major revamping of the
> code.

I have no idea what major revamping you have in mind, when it comes to
data sharing.

Do note that we don't have any data consistency framework for
cross-database publishing, so you will always end up with
inconsistencies around, that are not guarded by transactions. For each
feature that means building a conflict/consistency management.. :)

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] On proper sorting using CLDR

2010-06-11 Thread Domas Mituzas
> 
> That could even be done internally.

Indeed, for some languages sortkeys can be implicitly populated with certain 
rules applied on text.
Unfortunately that would target category sorting only, and not other lists ( 
pagelinks, templatelinks, users, etc ;)

Oh well. This is a difficult topic :) 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] On proper sorting using CLDR (was: varchar(255) binary in tables.sql)

2010-06-10 Thread Domas Mituzas
Hi!

> Yes it is a technical pain in the arse.The question is one of primacy. Is it
> more important to provide service or are technical considerations of the
> most importance. Yes, we discussed this in the past and we did not agree
> then and we do not agree now.

Well, I agree that it might be good idea to have language-specific ordering, 
just costs are quite high and there're not too many people eager to do 
engineering part of such project. 
CLDR isn't panacea, it is constantly evolving project, with inaccurate stable 
versions (even for well established languages like mine, heheh), and various 
proposed/testing versions. 

So, to pick CLDR based flow, and do it properly, it would consist of infinite 
loop of: 

1. Understanding which languages need a separate collation
2. Evaluating all available collations for a language, attracting input from 
local communities and standardization bodies 
3. Evaluating the algorithmic implications of chosen collation - then either 
approaching standards bodies to change it, or simplifying it internally (and 
forking), or implementing algorithms in software (though that sometimes is 
impossible to do in efficient way)
4. Porting (3) into a backend of choice
5. Provide upgrade path and conflict resolution method for existing content
6. Provide framework to do full index rebuilds and switchover between different 
collations (ok, this probably is one-time engineering project, albeit quite 
complex, as it has to have (4) and (5) in mind)
7. Monitor for new versions of collations :) 

Multiply all that by number of languages we have, and do note that there're 
multiple sorting variants per language too (e.g. dictionary vs phonebook 
ordering in Germany). 
So yes, it would be fantastic to have that kind of functionality, but you'd 
need quite some engineering capacity to pull it off.  

And if we get to implementation specifics - ordering rules are same as equality 
rules, causing quite some confusion in some cases (and some people will 
definitely want to have same sorted but not equal terms.. :) 

Of course, we can use community driven sortkey hacks for some features ;-)

> I wonder how our English language readers would react when the sort order
> for their lists would be wrong.

I guess it isn't absolutely tragic for others, as otherwise we wouldn't see 
projects in other languages at all. Now thats a benchmark! ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] varchar(255) binary in tables.sql

2010-06-09 Thread Domas Mituzas
Hello,

On Jun 8, 2010, at 11:22 PM, Gerard Meijssen wrote:

> The difference is that is actually does sort according to the CLDR.. It
> would be really nice if we did that.

It does not, it sorts according to the partial UCA implementation. 

We have discussed CLDR in the past - it is a huge collection of distinct 
collations, and even though it is possible to use LDMLs from CLDR project, it 
is PITA, due to both partial UCA support and continuous effort to rebuild 
indexing, resolve conflicts, and hit all sorts of obscure "linguists are not 
computer scientists" problems :) 

On Jun 8, 2010, at 5:28 PM, Paul Houle wrote:

> As a person who has labored mightily to make sense of dbpedia,  I 
> think that one reason why varbinary is preferable to varchar in many 
> applications in wikimedia is that varchar() string comparisons are case 
> insensitive and varbinary comparisons are case sensitive.

varchar with case insensitive collations is case insensitive, varchar with 
binary/case sensitive collations is case sensitive.

varbinary() otoh is varchar with 'binary' character set (if you define default 
server charset to be binary, as we do on our 5.x boxes, all varchar creation 
will be varbinary). 

>There are 10,000 or so articles in the english wikipedia that have 
> titles that vary only by case.  Load those into a varchar(255) and put a 
> primary key on them and mysql just won't let you do it.

Depends on a collation, but yes, you are right. There're more concerns there, 
not just case sensitivity, though.
Different collations can map different digraphs or different diacritics to 
different codepoints, causing quite some confusion. 

Like in my language, ą = a, but š > s :) 

> I looked at a sample of those article and came to the conclusion 
> that the semantic relations between them are complicated enough that 
> they cannot be autosquashed.

Indeed. If you go for CLDR-like national collations, you expose yourself not 
just to case sensitivity though, but also to all the different digraph/accented 
character mappings, that add even more confusion to your uniqueness 
constraints. 

On Jun 8, 2010, at 3:58 PM, Ryan Chan wrote:

> obviously, varchar(255) binary does not support character outside of BMP.

It does, if you do very very horrible hack of using latin1 character set (but 
I'd always say that is bad idea and binary charset aka varbinary should be used 
instead).

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Fixme, please fix me!

2010-06-09 Thread Domas Mituzas
> That isn't a good thing.
Why not?


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] wikistats earlier files

2010-04-30 Thread Domas Mituzas
Hi!

> I have uploaded them to the Internet Archive.
> The oldest one is
> http://www.archive.org/details/wikipedia_visitor_stats_200712

Lars, thanks - awesome job.
Did you get any automated way to do it?
Maybe we could just send to archive.org directly?

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] WMF decommissions servers

2010-03-30 Thread Domas Mituzas
Hi,

> No. There are far more articles created every day than are deleted.
> These servers have just been replaced by newer models. The total
> number of servers keeps increasing, but there are always going to be
> old servers that aren't worth the costs to keep running.

at least the crazy growth has stopped for few years already - now it is way 
more manageable and fits into Moore's laws ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] modernizing mediawiki

2010-03-05 Thread Domas Mituzas
Hi!

> A lot of the replies were helpful, in particular Ryan Lane and Yaron's 
> replies. Also made a quick reply to Domas.

Replies are good!

> Overall it's awesome no doubt (otherwise I wouldn't have used it in the first 
> place), but a few of the practices (i.e. editing localsettings file through 
> shell or ftp) and gui/aesthetics should definitely be more 'modern.'

Priorities, priorities. Do note, no GUI will be able to scale to configuration 
needs most easily explained by pointing at http://noc.wikimedia.org/conf/ 

> In case you haven't heard, it's 2010 lol. A lot has changed since then.

Damn, you just shattered my illusion that I'm still young!

> The average person who wants to start a wik is gonna have no idea how to do 
> that, much less even understand what svn up means. While I don't expect to be 
> dumbed down a huge degree, a little bit more simplicity wouldn't hurt would 
> it?

Average person who doesn't know how to do that can always try out service 
providers, and service providers can provide more skins and configuration 
interface, value added! 
Maybe we could have a shell script that does upgrade, but 'one click upgrade' 
from the web interface is quite insecure method. 

Maybe a good enough method would be having a simple shell script that does all 
that.. 

>> Feel free to develop it that way. 
> Easier said than done.

Exactly ;-) It is huge engineering effort that may not be entirely aligned with 
WMF mission. 
I have done quite a few changes to MediaWiki ages ago to better support my 
small company wiki needs (e.g. no hassle single sign on) - and somehow those 
changes got in (probably because, ehem, we didn't have formalized code review 
back then :-)

>> Wikia is heavily modified to give the gui a much more modern feel. Again i'm 
>> mostly focusing on the aesthetics. Unfortunately I don't think wikia 
>> distributes their skins.

https://svn.wikia-code.com/wikia/trunk/skins/

Personally I'd like to see more stuff from Wikia to be poached into Wikimedia 
deployment (we're giving too much time for Wikia to learn from their mistakes, 
before we learn from theirs :) 

> It's not just not my needs. It's about user friendliness for anyone who is 
> using wikimedia to work on their wiki project. While the developers have no 
> obligation to do it, it would be nice if they realized who their users are 
> other than wikimedia.

Everyone realizes that there're users other than Wikimedia. 
It is one of reasons why mediawiki has plethora of features that are not needed 
on Wikimedia sites (and that introduces code complexity). 
It is also one of reasons why 'mediawiki' is 'mediawiki' and not 'wikimedia 
software'

Of course, Wikimedia use quite often stands in the way of development (as 
features have to be secure, scale nicely and maintainable in medium-large sized 
operations environments) - and unfortunately for feature development, 
fortunately for everyone who runs large mediawiki instances, those needs have 
to be in the core of project. 

> These types of replies are hilarious. It's like 
> Iphone user: "Dear Apple, if your iphone had the following features it would 
> be great (A) (B) (C) ... "  

If you missed, iphone also has 3rd party application community. (a), (b), (c) 
features have been developed by third parties already, or there's a niche for 
those third parties. 
Of course iphone economy is much fancier than mediawiki economy, so probably 
niches aren't filled here as fast. 

You know, Microsoft didn't write every application for Windows, Apple doesn't 
own everything what runs on iStuff, lots of platforms have primary goals, and 
secondary goals can be filled by developer community. 
Absolutely same here, you have full powers to do whatever you want to do. 

> Apple: "Oh if you want those features, go ahead and develop them on your own."
> If I knew how to I would have done it already. What kind of advice is that? 
> Seriously lol

Seriously lol you can evangelize your needs, try to do project-management like 
activities, sketches, etc - and try involving other volunteer developers. 
Instead of being an 'entrepreneur', what would be of benefit to everyone, you 
end up being a whiner. 

If you really want to introduce lots of bad analogies, I should try to come up 
with my own. 
"As I don't drive, I need government pay for my personal driver, as they have 
roads out there!" 
I hope the analogy was bad enough! :)

Once you approach developer community, there's huge difference between:

"Hello folks, are there any projects in improving manageability/look/etc for 
third-party users?"
from "I've gone through a lot of frustrations.", "mediawiki and it's 
limitations", "why can't the money be put into making a modern product instead 
of in pockets of the people who run it", etc

I am amazed and I glorify the way how kindly some members of mailing list 
manage to take that, and try to put some sense into your head. 

Domas
___
W

Re: [Wikitech-l] hiphop progress

2010-03-03 Thread Domas Mituzas
Jared,

> assert(hash('adler32', 'foo', true) === mhash(MHASH_ADLER32, 'foo'));

Thanks! Would get to that eventually, I guess. Still, there's xdiff and few 
other things.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] modernizing mediawiki

2010-03-03 Thread Domas Mituzas
Hi!

> The Wikimedia Foundation makes millions more than Wordpress, but the
> Foundation is running a top 5 website.

wordpress.com is in top20 too :) 

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] modernizing mediawiki

2010-03-03 Thread Domas Mituzas
Hi!

> I hope I am emailing this to the right group. My concern was about mediawiki 
> and it's limitations, as well as it's outdated methods. As someone wo runs a 
> wiki, I've gone through a lot of frustrations.

Very sad to hear that!

> Mediawiki makes millions more than Wordpress does too

Hahahahaha, ha, hahahahahahahahaha, hahahahah, haha, hahahahahahahaha.

Ha.

Hahaha.

Let me recover, uh, hahahaha, oh, hah, thanks.

First of all, Wordpress is a platform for a commercial product, Wordpress.com, 
backed by a company, Auttomatic, which has way more funding (it closed 30M$ 
investment round two years ago) and nearly 40 employees. 
They have commercial offerings which are bringing quite some additional revenue 
they can feed into development. And of course, they have to compete with 
Google's Blogger, SixApart, Facebook, Twitter and others. 

In the large picture, Wikipedia raises money to spread knowledge, and the fact 
that people are using mediawiki in 3rd party environments is a side effect. 

> , why can't the money be put into making a modern product instead of in 
> pockets of the people who run it? I know Wordpress and Mediawiki serve two 
> different purposes, but that's not the point. The point is, one is modern and 
> user friendly (Wordpress), and the other (Mediawiki) is not. Other complaints:

MediaWiki is very modern product, just not on the visible side (though maybe 
usability initiative will change that). It has lots of fascinating modern 
things internally :) 
Though of course, by "in pockets of people who run it", you're definitely 
trolling here. :-(

> -Default skins are boring

They were not back in 2005 =) 

> -Very limited in being able to make the wiki look nice like you could with a 
> normal webpage.

Why would that be a priority for foundation developers? 

> -A major pain to update! Wordpress upgrades are so simple.

'svn up' -> done! ;-) Same for Wordpress... :)

> -Better customization so people can get a wiki the way they want.

Feel free to develop it that way. 

> It should be more like the wikis on wikia,

Wikia is mediawiki with extensions. So it is modern, again? 

> except without me having to learn css and php to make those types of 
> customizations.

Why should we be facilitating _your_ needs? 

> Give me some option, some places to put widgets. Not every wiki is going to 
> be as formal as the ones on wikimedia sites.

You can put 'widgets' via extensions. If you need something more, feel free to 
develop that. 

> And don't the people at Wikimedia commons get tired of always having to make 
> changes so it actually suits their site?
> If they had some of the options from the get go, i'm sure they'd appreciate 
> it too.

Maybe. 

> -I don't want to go to my ftp to download my local settings file, add a few 
> lines then reupload it. This is caveman-like behavior for the modern internet.

You can use WebDAV, SFTP, SCP, and your own staging environments.
On the other side, LocalSettings is the most flexible configuration method, 
that allows to manage thousands of wikis in quite small form factor. 

> -Being able to manage extensions like wordpress does.

Feel free to develop it :) 
> 
> In short, it's time to spend some money from those millions of dollars from 
> donations to make this software more modern. Being stubborn in modernizing it 
> will only make this software less relevant in the future if other wiki 
> software companies are willing to do things the people at Wikimedia aren't.

The donations are for making the software more modern for Wikimedia sites. 
Funneling them to MediaWiki as an open-source software project is a byproduct. 
:-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hiphop progress

2010-03-01 Thread Domas Mituzas
Howdy,

> Looks like a loot of fun :-)

Fun enough to have my evenings and weekends on it :) 

> this smell like something that can benefict from metadata.
> /* [return  integer] */  function getApparatusId($obj){
>  //body
> }

Indeed - type hints can be quite useful, though hiphop is smart enough to 
figure out it will be an integer return from code :)

It is quite interesting to see the enhancements to PHP that have been inside 
facebook and now are all released - XHP evolves PHP syntax to fit the web world 
( 
http://www.facebook.com/notes/facebook-engineering/xhp-a-new-way-to-write-php/294003943919
 ), the XBOX thing allows background/async execution of work without standing 
in the way of page rendering, etc. 

> What we can expect?  will future versions of MediaWiki be "hiphop
> compatible"? there will be a fork or snapshot compatible?  The whole
> experiment looks like will help to profile and enhance the engine,
> will it generate a MediaWiki.tar.gz  file we (the users) will able to
> install in our intranetss ??

Well, the build itself is quite portable (you'd have to have single binary and 
LocalSettings.php ;-) 

Still, the decision to merge certain changes into MediaWiki codebase (e.g. 
relative includes, rather than $IP-based absolute ones) would be quite 
invasive. 
Also, we'd have to enforce stricter policy on how some of the dynamic PHP 
features are used. 

I have to deal here with three teams (wikimedia ops, mediawiki development 
community and hiphop developers) to make stuff possible. 
Do note, getting it work for MediaWiki is quite simple task, compared to 
getting it work in Wikimedia operations environment. 

What I'd like to see though as final result - MediaWiki that works fine with 
both Zend and HPHP, and Wikimedia using the latter. 
Unfortunately, I will not be able to visit Berlin developer meeting to present 
this work to other developers and will try to get some separate discussions. 
You know, most of work will be coming up with solutions that are acceptable by 
Tim :-) 

> Maybe a blog article about your findings could be nice. It may help
> "write fast PHP code". And will scare littel childrens and PHP
> programmers with a C++ background.

My findings are hectic, at the moment, and I don't want to talk too much about 
them, until I get decently working mediawiki.
BTW, Main_Page and Special:BlankPage were both served in ~12ms. Now I have to 
get complex parser test cases work, and such.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] hiphop progress

2010-03-01 Thread Domas Mituzas
Howdy,

> Most of the code in MediaWiki works just fine with it (since most of
> it is mundane) but things like dynamically including certain files,
> declaring classes, eval() and so on are all out.

There're two types of includes in MediaWiki, ones I fixed for AutoLoader and 
ones I didn't - HPHP has all classes loaded, so AutoLoader is redundant. 
Generally, every include that just defines classes/functions is fine with HPHP, 
it is just some of MediaWiki's startup logic (Setup/WebStart) that depends on 
files included in certain order, so we have to make sure HipHop understands 
those includes.
There was some different behavior with file including - in Zend you can say 
require("File.php"), and it will try current script's directory, but if you do 
require("../File.php") - it will 

We don't have any eval() at the moment, and actually there's a mode when eval() 
works, people are just scared too much of it. 
We had some double class definitions (depending on whether certain components 
are available), as well as double function definitions ( ProfilerStub vs 
Profiler )

One of major problems is simply still not complete function set, that we'd need:

* session - though we could sure work around it by setting up our own Session 
abstraction, team at facebook is already busy implementing full support
* xdiff, mhash - the only two calls to it are from DiffHistoryBlob - so getting 
the feature to work is mandatory for production, not needed for testing :) 
* tidy - have to call the binary now

function_exists() is somewhat crippled, as far as I understand, so I had to 
work around certain issues there.
There're some other crippled functions, which we hit through the testing... 

It is quite fun to hit all the various edge cases in PHP language (e.g. 
interfaces may have constants) which are broken in hiphop. 
Good thing is having developers carefully reading/looking at those. Some things 
are still broken, some can be worked around in MediaWiki. 

Some of crashes I hit are quite difficult to reproduce - it is easier to bypass 
that code for now, and come up with good reproduction cases later. 

> Even if it wasn't hotspots like the parser could still be compiled
> with hiphop and turned into a PECL extension.

hiphop provides major boost for actual mediawiki initialization too - while 
Zend has to reinitialize objects and data all the time, having all that in core 
process image is quite efficient. 

> One other nice thing about hiphop is that the compiler output is
> relatively readable compared to most compilers. Meaning that if you

That especially helps with debugging :) 

> need to optimize some particular function it's easy to take the
> generated .cpp output and replace the generated code with something
> more native to C++ that doesn't lose speed because it needs to
> manipulate everything as a php object.

Well, that is not entirely true - if it manipulated everything as PHP object 
(zval), it would be as slow and inefficient as PHP. The major cost benefit here 
is that it does strict type inference, and falls back to Variant only when it 
cannot come up with decent type. 
And yes, one can find offending code that causes the expensive paths. I don't 
see manual C++ code optimizations as way to go though - because they'd be 
overwritten by next code build.

Anyway, there're lots of interesting problems after we get mediawiki working on 
it - that is, how would we deploy it, how would we maintain it, etc.
Building on single box takes around 10 minutes, and the image has to be 
replaced by shutting down old one and starting new one, not just overwriting 
the files. 

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hiphop! :)

2010-02-28 Thread Domas Mituzas
> 
> Nevertheless - a process isn't the same process when it's going at 10x
> the speed. This'll be interesting.

not 10x. I did concurrent benchmarks for API requests (e.g. opensearch) on 
modern boxes, and saw:

HipHop: Requests per second:1975.39 [#/sec] (mean)
Zend: Requests per second:371.29 [#/sec] (mean)

these numbers seriously kick ass. I still can't believe I observe 2000 
mediawiki requests/s from a single box ;-)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] hiphop! :)

2010-02-27 Thread Domas Mituzas
Hi!

> For those of us not familiar with MediaWiki benchmarking, what kind of
> times were you getting without hiphop?

Zend: 

> Domas, how much hacking did you have to do to MediaWiki to get it to
> compile in Hiphop?

Lots. I'm trying to get basic functionality/prototypes work.
Some changes had to be done to HipHop itself, some had to be done to generated 
code, some had to be done to MediaWiki. 

MediaWiki's "run wherever I can" dynamic adaptation to any environment isn't 
too helpful sometimes...

Domas


P.S. Zend: 

Concurrency Level:  1
Time taken for tests:   1.444158 seconds
Complete requests:  100
Failed requests:0
Write errors:   0
Total transferred:  138020 bytes
HTML transferred:   109600 bytes
Requests per second:69.24 [#/sec] (mean)
Time per request:   14.442 [ms] (mean)
Time per request:   14.442 [ms] (mean, across all concurrent requests)
Transfer rate:  92.79 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.0  0   0
Processing:14   14   0.0 14  14
Waiting:   10   12   1.7 14  14
Total: 14   14   0.0 14  14
WARNING: The median and mean for the waiting time are not within a normal 
deviation
These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50% 14
  66% 14
  75% 14
  80% 14
  90% 14
  95% 14
  98% 14
  99% 14
 100% 14 (longest request)


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] hiphop! :)

2010-02-27 Thread Domas Mituzas


r...@flack:/hiphop/web/phase3/includes# ab -n 100 -c 1 
'http://dom.as:8085/phase3/api.php?action=query&prop=info&titles=Main%20Page'
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking dom.as (be patient).done


Server Software:
Server Hostname:dom.as
Server Port:8085

Document Path:  
/phase3/api.php?action=query&prop=info&titles=Main%20Page
Document Length:991 bytes

Concurrency Level:  1
Time taken for tests:   0.389 seconds
Complete requests:  100
Failed requests:0
Write errors:   0
Total transferred:  116600 bytes
HTML transferred:   99100 bytes
Requests per second:256.87 [#/sec] (mean)
Time per request:   3.893 [ms] (mean)
Time per request:   3.893 [ms] (mean, across all concurrent requests)
Transfer rate:  292.49 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.0  0   0
Processing: 34   0.2  4   4
Waiting:24   0.4  4   4
Total:  34   0.2  4   4

Percentage of the requests served within a certain time (ms)
  50%  4
  66%  4
  75%  4
  80%  4
  90%  4
  95%  4
  98%  4
  99%  4
 100%  4 (longest request)
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-23 Thread Domas Mituzas
> is false.  (As near as I can tell, the header is required only
> for those requests that include an "action=" modifier.)

wrong.

"all uncached or uncacheable requests" would probably be more true version of 
it :)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-17 Thread Domas Mituzas
> 
> It showed that there was quite a bit of bathwater thrown out.  And at least
> one very large baby (Google translation), which was temporarily
> resurrected.  We still don't know how many other, smaller, babies were
> thrown out, and likely never will.

I'm pretty sure, that at least 99.9% of the drop in those graphs was exactly 
the activity I was going after.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Domas Mituzas
Hi!

> 1) Only if you've already identified the spammer through some other process
> (otherwise, you don't even know if they're using automated software).

You probably don't get scale of wikipedia or scale of the behavior we had to 
deal with, if you think that it isn't possible to notice behavior patterns of 
it :-)

> 2) It doesn't really show that the user is acting malicious even if you can

Acts like a duck, quacks like a duck :) 

> Regardless, what are you going to do about it?  Block the IP?

Perhaps. 

>  For how long?

Depends. Probably indefinitely.

>  Even if it's dynamic?

Dunno, probably not.

>  Even if it's shared by many others?

Would avoid it.

Anyway, you probably are missing one important point. 
We're trying to make Wikipedia's service better. 

It doesn't always have definite answers, and one has to balance and search for 
decent solutions.
Probably everything looks easier from your armchair. I'd love to have that 
view! :)

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Domas Mituzas
Robert,

> The current English error message text that I see from Python reads:

our error message system became overkill, with all those nice designs and 
multiple languages.
in certain cases serving that message too certain requests caused gigabits of 
bandwidth.
it is also not practical to update it with policies, because, um, it has all 
those nice designs and multiple languages.

we may actually end up having better error messages at some point in the 
future. 

> Everything except the very last line of that is either irrelevant or
> wrong.  And ERR_ACCESS_DENIED, though vaguely informative, provides no
> detail about what happened or how to do things properly.

I agree. 

> This is bad enough for bot operators who are likely to be fairly
> intelligent people, but if we are going to give this to everyone with
> a missing user agent string too (which includes people behind poorly
> behaved proxies and people who use certain anonminizing software out
> of intense desire for "privacy"), then this kind of response really
> starts to send the wrong message.

We're not sending this response to missing UAs, as this response is being sent 
by Squid ACLs, and the UA check is done at the MW side.

Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] User-Agent:

2010-02-16 Thread Domas Mituzas
Anthony,
>> Yes, we will ban all IPs participating in this.
> Guess it's just a matter of time until *reading* Wikipedia is unavailable to
> large portions of the world.

Your insight is entirely bogus here. 

> And "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT
> 5.1)", is pretty much
> useless, unless you've already identified the spammer through some other
> process.

It isn't useless. It clearly shows that the user is acting malicious by having 
automated software that disguises under common user agent. 

> Do any of the other major websites completely block traffic when they see
> blank user agents?


I don't know about UA policies but...
Various websites have various techniques to deal with such problems.
On the other hand, no other major website has such scarcity of hardware and/or 
human resources as Wikipedia, at such exposure and API complexity provided. 

BR,
Domas
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


  1   2   3   >