Re: [Wiki-research-l] Availability of hourly pagecounts files

James Salsman Sat, 11 Jan 2020 16:30:25 -0800

That's fascinating, John; thank you. I'm copying this to wiki-research-l and
Fabian Suchanek, who gave the first part of the Research Showcase last month.


What do you like for coding stories? https://quanteda.io/reference/dfm.html ?
Sentiment is hard because errors are often 180 degrees away from correct.

How do you both feel about Soru et al (June 2018) "Neural Machine Translation
for Query Construction and Composition"
https://www.researchgate.net/publication/326030040 ?


On Sat, Jan 11, 2020 at 3:46 PM John Urbanik <johnurba...@gmail.com> wrote:
>
> Jim,
>
> I used to work as the chief data scientist at Collin's company.
>
> I'd suggest looking at things like relationships between the views / edits 
> for sets of pages as well as aggregating large sets of page views for 
> different pages in various ways. There isn't a lot of literature that is 
> directly applicable, and I can't disclose the precise methods being used due 
> to NDA.
>
> In general, much of the pageview data is weibull or GEV distributed on top of 
> being non-stationary, so I'd suggest looking into papers from extreme value 
> theory literature as well as literature around Hawkes/Queue-Hawkes processes. 
> Most traditional ML and signal processing is not very effective without doing 
> some pretty substantial pre-processing, and even then things are pretty 
> messy, depending on what you're trying to predict; most variables are 
> heteroskedastic w.r.t pageviews and there are a lot of real world events that 
> can cause false positives.
>
> Further, concept drift is pretty rapid in this space and structural breaks 
> happen quite frequently, so the reliability of a given predictor can change 
> extremely rapidly. Understanding how much training data to use for a given 
> prediction problem is itself a super interesting problem since there may be 
> some horizon after which the predictor loses power, but decreasing the 
> horizon too much means over fitting and loss of statistical significance.
>
> Good luck!
>
> John

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] Availability of hourly pagecounts files

Reply via email to