GoranSMilovanovic added a subscriber: JAllemandou.
GoranSMilovanovic added a comment.

@JAllemandou Hey, I need an insight into the production code for the Clickstream dataset, but I can't find the code repository anywhere. May you could help? Thanks. N.B. I am not looking for Python use cases (I've found them) neither for the SQL extraction of the monthly updates (I've seen that too) but rather for the code that feeds the tables in the clickstream database in Hadoop.

@Lea_WMDE I still need to get back to you on this one. I have studied the existing datasets, I still have some thinking to do about how to get to what we need based on what is already in production, and while in general I think it can be done I can say that it is not going to be "cheap" in any respect (i.e. time and computational resources).

In general, the prima facie structure of the Clickstream dataset is what we are looking for, except for we need some additional fields (desktop/mobile, logged/anonymous) and that our filtering criteria (e.g. how do we define a user session) might be different.

Also, I would need to study the ua_parser library (luckily, there's an R version) and find out how did the Analytics-Engineering used it to filter out spider traffic. In other words, it can be done, we can have the dataset (someday), but it is going to be complex and take quite some time - and especially if I have to produce every bit of it by myself.


TASK DETAIL
https://phabricator.wikimedia.org/T208569

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: JAllemandou, Aklapper, Lea_WMDE, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, D3r1ck01, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to