Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

2015-03-19 Thread Giovanni Luca Ciampaglia
Hi Oliver, Tab-separation would be welcomed. Title normalisation would be *very* useful too. Another thing that could potentially save a lot of space would be to throw out all malformed requests, pieces of javascript, and similar junk. Not sure how difficult that would be though, without doing an

Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

2015-03-19 Thread aaron shaw
Adding to Giovanni's points (all of which I agree with 100%): - This would be awesome! The pageviews are a super useful for many of us and cleaning them up a bit would save a lot of redundant work for many of us down the road. - If you don't have to collapse page views incoming from mobile and

Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

2015-03-19 Thread Oliver Keyes
Thanks all for the awesome comments :). Will get to tomorrow morning![1] [1] East coast time. On 19 March 2015 at 20:37, aaron shaw aarons...@northwestern.edu wrote: Adding to Giovanni's points (all of which I agree with 100%): - This would be awesome! The pageviews are a super useful for

[Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

2015-03-13 Thread Oliver Keyes
So, we've got a new pageviews definition; it's nicely integrated and spitting out TRUE/FALSE values on each row with the best of em. But what does that mean for third-party researchers? Well...not much, at the moment, because the data isn't being released somewhere. But one resource we do have