> Downloading gigs and gigs of raw data and then processing it is generally
> more impractical for end-users.

You were talking about 3.7M articles. :) It is way more practical than working 
with pointwise APIs though :-)

> Any tips? :-)  My thoughts were that the schema used by the GlobalUsage
> extension might be reusable here (storing wiki, page namespace ID, page
> namespace name, and page title).

I don't know what GlobalUsage does, but probably it is all wrong ;-)

> As I recall, the system of determining which domain a request went to is a
> bit esoteric and it might be the worth the cost to store the whole domain
> name in order to cover edge cases (labs wikis, wikimediafoundation.org,
> *.wikimedia.org, etc.).

*shrug*, maybe, if I'd run a second pass I'd aim for cache oblivious system 
with compressed data both on-disk and in-cache (currently it is b-tree with 
standard b-tree costs). 
Then we could actually store more data ;-) Do note, there're _lots_ of data 
items, and increasing per-item cost may quadruple resource usage ;-) 

Otoh, expanding project names is straightforward, if you know how). 

> There's some sort of distinction between projectcounts and pagecounts (again
> with documentation) that could probably stand to be eliminated or
> simplified.

projectcounts are aggregated by project, pagecounts are aggregated by page. If 
you looked at data it should be obvious ;-) 
And yes, probably best documentation was in some email somewhere. I should've 
started a decent project with descriptions and support and whatever. 
Maybe once we move data distribution back into WMF proper, there's no need for 
it to live nowadays somewhere in Germany. 

> But the biggest improvement would be post-processing (cleaning up) the
> source files. Right now if there are anomalies in the data, every re-user is
> expected to find and fix these on their own. It's _incredibly_ inefficient
> for everyone to adjust the data (for encoding strangeness, for bad clients,
> for data manipulation, for page existence possibly, etc.) rather than having
> the source files come out cleaner.

Raw data is fascinating in that regard though - one can see what are bad 
clients, what are anomalies, how they encode titles, what are erroneus titles, 
etc. 
There're zillions of ways to do post-processing, and none of these will match 
all needs of every user. 

> I think your first-pass was great. But I also think it could be improved.
> :-)

Sure, it can be improved in many ways, including more data (some people ask 
(page,geography) aggregations, though with our long tail that is huuuuuge 
dataset growth ;-) 

> I meant that it wouldn't be very difficult to write a script to take the raw
> data and put it into a public database on the Toolserver (which probably has
> enough hardware resources for this project currently).

I doubt Toolserver has enough resources to have this data thrown at it and 
queried more, unless you simplify needs a lot. 
There's 5G raw uncompressed data per day in text form, and long tail makes 
caching quite painful, unless you go for cache oblivious methods. 

> It's maintainability
> and sustainability that are the bigger concerns. Once you create a public
> database for something like this, people will want it to stick around
> indefinitely. That's quite a load to take on.

I'd love to see that all the data is preserved infinitely. It is one of most 
interesting datasets around, and its value for the future is quite incredible. 

> I'm also likely being incredibly naïve, though I did note somewhere that it
> wouldn't be a particularly small undertaking to do this project well.

Well, initial work took few hours ;-) I guess by spending few more hours we 
could improve that, if we really knew what we want. 

> I'd actually say that having data for non-existent pages is a feature, not a
> bug. There's potential there to catch future redirects and new pages, I
> imagine.

That is one of reasons we don't eliminate that data now from raw dataset. I 
don't see it as a bug, I just see that for long-term aggregations that data 
could be omitted. 

> A user wants to analyze a category with 100 members for the page view data
> of each category member. You think it's a Good Thing that the user has to
> first spend countless hours processing gigabytes of raw data in order to do
> that analysis? It's a Very Bad Thing. And the people who are capable of
> doing analysis aren't always the ones capable of writing the scripts and the
> schemas necessary to get the data into a usable form.

No, I think we should have API to that data to fetch small sets of data without 
much pain. 

> The reality is that a large pile of data that's not easily queryable is
> directly equivalent to no data at all, for most users. Echoing what I said
> earlier, it doesn't make much sense for people to be continually forced to
> reinvent the wheel (post-processing raw data and putting it into a queryable
> format).

I agree. By opening up the dataset I expected others to build upon that and 
create services. 
Apparently that doesn't happen. As lots of people use the data, I guess there 
is need for it, but not enough will to build anything for others to use, so it 
will end up being created in WMF proper. 

Building a service where data would be shown on every article is relatively 
different task from just analytical workload support.
For now, building query-able service has been on my todo list, but there were 
too many initiatives around that suggested that someone else will do that ;-)

Domas



_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to