[MediaWiki-commits] [Gerrit] analytics...WDCM[master]: Labs processing 17 Dec 2017

2017-12-17 Thread GoranSMilovanovic (Code Review)
GoranSMilovanovic has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/398692 )

Change subject: Labs processing 17 Dec 2017
..


Labs processing 17 Dec 2017

Change-Id: I1cfad83ed71be0e046205a5cff09987df0a8ab7f
---
M WDCM_EngineGeo_goransm.R
A WDCM_Process.R
A WDCM_Update_Labs.R
3 files changed, 1,159 insertions(+), 2 deletions(-)

Approvals:
  GoranSMilovanovic: Verified; Looks good to me, approved



diff --git a/WDCM_EngineGeo_goransm.R b/WDCM_EngineGeo_goransm.R
index 635606b..4a04518 100644
--- a/WDCM_EngineGeo_goransm.R
+++ b/WDCM_EngineGeo_goransm.R
@@ -288,14 +288,14 @@
 ### --- join coordinates, items, labels, and usage
 setwd(dataDir)
 
-# - list .tsv files
+# - list .tsv files from dataDir
 lF <- list.files()
 w <- which(grepl("^wdcm_geoitem", lF))
 lF <- lF[w]
 w <- which(grepl(".tsv", lF, fixed = T))
 lF <- lF[w]
 
-# - remove old .csv files:
+# - remove old .csv files from dataDir
 rmF <- list.files()
 w <- which(grepl("^wdcm_geoitem", rmF))
 rmF <- rmF[w]
diff --git a/WDCM_Process.R b/WDCM_Process.R
new file mode 100644
index 000..467892d
--- /dev/null
+++ b/WDCM_Process.R
@@ -0,0 +1,910 @@
+
+### ---
+### --- WDCM Process Module, v. Beta 0.1
+### --- Script: WDCM_Process_v2.R, v. Beta 0.1
+### ---
+### --- DESCRIPTION:
+### --- WDCM_Process_v2.R takes a list of .tsv files that present
+### --- the data from wbc_entity_usage tables accross the client projects
+### --- fetched from production (stat1005) by WDCM_Search_Clients.R and 
+### --- further pre-processed by WDCM_Pre-Process.R (also on production).
+### --- The goal of this WDCM module/script is to produce (or update) 
+### --- the WDCM Stats Dashboard database.
+### ---
+### --- INPUT: 
+### --- the WDCM_Process_v2.R reads the .tsv input files from:
+### --- /home/goransm/WMDE/WDCM/WDCM_DataIN/WDCM_DataIN_ClientUsage_v2/
+### --- on the wikidataconcepts.eqiad.wmflabs Cloud VPS instance
+### --- These files are brought to Labs directly from productio
+### --- (currently the stat1005.eqiad.wmnet statbox)
+### ---
+### --- OUTPUT: the WDCM Dashboards MariaDB database is update
+### ---
+
+### ---
+### --- LICENSE:
+### ---
+### --- GPL v2
+### --- This file is part of Wikidata Concepts Monitor (WDCM)
+### ---
+### --- WDCM is free software: you can redistribute it and/or modify
+### --- it under the terms of the GNU General Public License as published by
+### --- the Free Software Foundation, either version 2 of the License, or
+### --- (at your option) any later version.
+### ---
+### --- WDCM is distributed in the hope that it will be useful,
+### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
+### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+### --- GNU General Public License for more details.
+### ---
+### --- You should have received a copy of the GNU General Public License
+### --- along with WDCM. If not, see .
+### ---
+
+### --- Setup
+library(RMySQL)
+library(httr)
+library(XML)
+library(data.table)
+library(dplyr)
+library(tidyr)
+library(readr)
+library(htmltab)
+library(snowfall)
+library(maptpx)
+library(Rtsne)
+
+# - mysql --defaults-file=/home/goransm/mySQL_Credentials/replica.my.cnf -h 
tools.labsdb u16664__wdcm_p
+# - database: u16664__wdcm_p
+
+### --- functions
+
+# - projectType() to determine project type
+projectType <- function(projectName) {
+  unname(sapply(projectName, function(x) {
+if (grepl("commons", x, fixed = T)) {"Commons"
+} else if (grepl("mediawiki|meta|species|wikidata", x)) {"Other"
+} else if (grepl("wiki$", x)) {"Wikipedia"
+} else if (grepl("quote$", x)) {"Wikiquote"
+} else if (grepl("voyage$", x)) {"Wikivoyage"
+} else if (grepl("news$", x)) {"Wikinews"
+} else if (grepl("source$", x)) {"Wikisource"
+} else if (grepl("wiktionary$", x)) {"Wiktionary"
+} else if (grepl("versity$", x)) {"Wikiversity"
+} else if (grepl("books$", x)) {"Wikibooks"
+} else {"Other"}
+  }))
+}
+
+### ---
+### --- NOTE:
+### --- TABLE NAMING CONVENTION FOR v2 (WDCM Stats Dashboard)
+### --- wdcm2_something
+### ---
+
+# - to nohup.out
+print(paste("WDCM Process.R update started at: ", Sys.time(), sep 

[MediaWiki-commits] [Gerrit] analytics...WDCM[master]: Labs processing 17 Dec 2017

2017-12-17 Thread GoranSMilovanovic (Code Review)
GoranSMilovanovic has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/398692 )

Change subject: Labs processing 17 Dec 2017
..

Labs processing 17 Dec 2017

Change-Id: I1cfad83ed71be0e046205a5cff09987df0a8ab7f
---
M WDCM_EngineGeo_goransm.R
A WDCM_Process.R
A WDCM_Update_Labs.R
3 files changed, 1,159 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.wikimedia.org:29418/analytics/wmde/WDCM 
refs/changes/92/398692/1

diff --git a/WDCM_EngineGeo_goransm.R b/WDCM_EngineGeo_goransm.R
index 635606b..4a04518 100644
--- a/WDCM_EngineGeo_goransm.R
+++ b/WDCM_EngineGeo_goransm.R
@@ -288,14 +288,14 @@
 ### --- join coordinates, items, labels, and usage
 setwd(dataDir)
 
-# - list .tsv files
+# - list .tsv files from dataDir
 lF <- list.files()
 w <- which(grepl("^wdcm_geoitem", lF))
 lF <- lF[w]
 w <- which(grepl(".tsv", lF, fixed = T))
 lF <- lF[w]
 
-# - remove old .csv files:
+# - remove old .csv files from dataDir
 rmF <- list.files()
 w <- which(grepl("^wdcm_geoitem", rmF))
 rmF <- rmF[w]
diff --git a/WDCM_Process.R b/WDCM_Process.R
new file mode 100644
index 000..467892d
--- /dev/null
+++ b/WDCM_Process.R
@@ -0,0 +1,910 @@
+
+### ---
+### --- WDCM Process Module, v. Beta 0.1
+### --- Script: WDCM_Process_v2.R, v. Beta 0.1
+### ---
+### --- DESCRIPTION:
+### --- WDCM_Process_v2.R takes a list of .tsv files that present
+### --- the data from wbc_entity_usage tables accross the client projects
+### --- fetched from production (stat1005) by WDCM_Search_Clients.R and 
+### --- further pre-processed by WDCM_Pre-Process.R (also on production).
+### --- The goal of this WDCM module/script is to produce (or update) 
+### --- the WDCM Stats Dashboard database.
+### ---
+### --- INPUT: 
+### --- the WDCM_Process_v2.R reads the .tsv input files from:
+### --- /home/goransm/WMDE/WDCM/WDCM_DataIN/WDCM_DataIN_ClientUsage_v2/
+### --- on the wikidataconcepts.eqiad.wmflabs Cloud VPS instance
+### --- These files are brought to Labs directly from productio
+### --- (currently the stat1005.eqiad.wmnet statbox)
+### ---
+### --- OUTPUT: the WDCM Dashboards MariaDB database is update
+### ---
+
+### ---
+### --- LICENSE:
+### ---
+### --- GPL v2
+### --- This file is part of Wikidata Concepts Monitor (WDCM)
+### ---
+### --- WDCM is free software: you can redistribute it and/or modify
+### --- it under the terms of the GNU General Public License as published by
+### --- the Free Software Foundation, either version 2 of the License, or
+### --- (at your option) any later version.
+### ---
+### --- WDCM is distributed in the hope that it will be useful,
+### --- but WITHOUT ANY WARRANTY; without even the implied warranty of
+### --- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+### --- GNU General Public License for more details.
+### ---
+### --- You should have received a copy of the GNU General Public License
+### --- along with WDCM. If not, see .
+### ---
+
+### --- Setup
+library(RMySQL)
+library(httr)
+library(XML)
+library(data.table)
+library(dplyr)
+library(tidyr)
+library(readr)
+library(htmltab)
+library(snowfall)
+library(maptpx)
+library(Rtsne)
+
+# - mysql --defaults-file=/home/goransm/mySQL_Credentials/replica.my.cnf -h 
tools.labsdb u16664__wdcm_p
+# - database: u16664__wdcm_p
+
+### --- functions
+
+# - projectType() to determine project type
+projectType <- function(projectName) {
+  unname(sapply(projectName, function(x) {
+if (grepl("commons", x, fixed = T)) {"Commons"
+} else if (grepl("mediawiki|meta|species|wikidata", x)) {"Other"
+} else if (grepl("wiki$", x)) {"Wikipedia"
+} else if (grepl("quote$", x)) {"Wikiquote"
+} else if (grepl("voyage$", x)) {"Wikivoyage"
+} else if (grepl("news$", x)) {"Wikinews"
+} else if (grepl("source$", x)) {"Wikisource"
+} else if (grepl("wiktionary$", x)) {"Wiktionary"
+} else if (grepl("versity$", x)) {"Wikiversity"
+} else if (grepl("books$", x)) {"Wikibooks"
+} else {"Other"}
+  }))
+}
+
+### ---
+### --- NOTE:
+### --- TABLE NAMING CONVENTION FOR v2 (WDCM Stats Dashboard)
+### --- wdcm2_something
+### ---
+
+# - to nohup.out
+print(paste("WDCM Process.R update started at: ",