[MediaWiki-commits] [Gerrit] search/MjoLniR[master]: Collect feature vectors from elasticsearch

2017-05-11 Thread DCausse (Code Review)
DCausse has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/349143 )

Change subject: Collect feature vectors from elasticsearch
..


Collect feature vectors from elasticsearch

Simple and straight forward collection of feature vectors from
elasticsearch. For the moment this skips the kafka middleman
that is planned to be used eventually for shipping data between
analytics and prod networks. That can be added, but seems best to
start with something simple and obvious.

This includes a relatively straight forward way of defining features,
but hopefully as work progresses on the elasticsearch plugin we can
remove that and provide elasticsearch with only the name of some
feature set to collect information about.

Bug: T163407
Change-Id: Iaf3d1eab15728397c8f197c9410477430cdba8a0
---
M .gitignore
A mjolnir/features.py
M mjolnir/spark/__init__.py
A mjolnir/test/fixtures/requests/test_features.sqlite3
A mjolnir/test/test_features.py
M setup.py
6 files changed, 454 insertions(+), 1 deletion(-)

Approvals:
  DCausse: Verified; Looks good to me, approved



diff --git a/.gitignore b/.gitignore
index 4b7c536..1e56238 100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,6 +5,7 @@
 
 # Distribution / packaging
 venv/
+build/
 *.egg-info/
 *.egg
 *.log
diff --git a/mjolnir/features.py b/mjolnir/features.py
new file mode 100644
index 000..7a26acd
--- /dev/null
+++ b/mjolnir/features.py
@@ -0,0 +1,346 @@
+"""
+Integration for collecting feature vectors from elasticsearch
+"""
+
+import json
+import mjolnir.spark
+from pyspark.ml.linalg import Vectors
+from pyspark.sql import functions as F
+import random
+import requests
+
+
+def _wrap_with_page_ids(hit_page_ids, should):
+"""Wrap an elasticsearch query with an ids filter.
+
+Parameters
+--
+hit_page_ids : list of ints
+Set of page ids to collect features for
+should : dict or list of dict
+Elasticsearch query for a single feature
+
+Returns
+---
+string
+JSON encoded elasticsearch query
+"""
+assert len(hit_page_ids) < 1
+if not isinstance(should, list):
+should = [should]
+return json.dumps({
+"_source": False,
+"from": 0,
+"size": ,
+"query": {
+"bool": {
+"filter": {
+'ids': {
+'values': map(str, set(hit_page_ids)),
+}
+},
+"should": should,
+"disable_coord": True,
+}
+}
+})
+
+
+class ScriptFeature(object):
+"""
+Query feature using elasticsearch script_score
+
+...
+
+Methods
+---
+make_query(query)
+Build the elasticsearch query
+"""
+
+def __init__(self, name, script, lang='expression'):
+self.name = name
+self.script = script
+self.lang = lang
+
+def make_query(self, query):
+"""Build the elasticsearch query
+
+Parameters
+--
+query : string
+User provided query term (unused)
+"""
+return {
+"function_score": {
+"score_mode": "sum",
+"boost_mode": "sum",
+"functions": [
+{
+"script_score": {
+"script": {
+"inline": self.script,
+"lang": self.lang,
+}
+}
+}
+]
+}
+}
+
+
+class MultiMatchFeature(object):
+"""
+Query feature using elasticsearch multi_match
+
+...
+
+Methods
+---
+make_query(query)
+Build the elasticsearch query
+"""
+def __init__(self, name, fields, minimum_should_match=1, 
match_type="most_fields"):
+"""
+
+Parameters
+--
+name : string
+Name of the feature
+fields : list
+Fields to perform multi_match against
+minimum_should_match: int, optional
+Minimum number of fields that should match. (Default: 1)
+match_type : string, optional
+Type of match to perform. (Default: most_fields)
+"""
+self.name = name
+assert len(fields) > 0
+self.fields = fields
+self.minimum_should_match = minimum_should_match
+self.match_type = match_type
+
+def make_query(self, query):
+"""Build the elasticsearch query
+
+Parameters
+--
+query : string
+User provided query term
+"""
+return {
+"multi_match": {
+"query": query,
+"minimum_should_match": self.minimum_should_match,
+"type": self.match_type,
+"fields

[MediaWiki-commits] [Gerrit] search/MjoLniR[master]: Collect feature vectors from elasticsearch

2017-04-19 Thread EBernhardson (Code Review)
EBernhardson has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/349143 )

Change subject: Collect feature vectors from elasticsearch
..

Collect feature vectors from elasticsearch

Simple and straight forward collection of feature vectors from
elasticsearch. For the moment this skips the kafka middleman
that is planned to be used eventually for shipping data between
analytics and prod networks. That can be added, but seems best to
start with something simple and obvious.

This includes a relatively straight forward way of defining features,
but hopefully as work progresses on the elasticsearch plugin we can
remove that and provide elasticsearch with only the name of some
feature set to collect information about.

Bug: T163407
Change-Id: Iaf3d1eab15728397c8f197c9410477430cdba8a0
---
M .gitignore
A mjolnir/features.py
A mjolnir/test/fixtures/requests/test_features.sqlite3
A mjolnir/test/test_features.py
M setup.py
5 files changed, 408 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.wikimedia.org:29418/search/MjoLniR 
refs/changes/43/349143/1

diff --git a/.gitignore b/.gitignore
index 4b7c536..1e56238 100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,6 +5,7 @@
 
 # Distribution / packaging
 venv/
+build/
 *.egg-info/
 *.egg
 *.log
diff --git a/mjolnir/features.py b/mjolnir/features.py
new file mode 100644
index 000..b5504e7
--- /dev/null
+++ b/mjolnir/features.py
@@ -0,0 +1,326 @@
+"""
+Integration for collecting feature vectors from elasticsearch
+"""
+
+from collections import defaultdict
+import json
+import mjolnir.spark
+import pyspark.sql
+from pyspark.sql import functions as F
+import requests
+
+
+def _wrap_with_page_ids(hit_page_ids, should):
+"""Wrap an elasticsearch query with an ids filter.
+
+Parameters
+--
+hit_page_ids : list of ints
+Set of page ids to collect features for
+should : dict or list of dict
+Elasticsearch query for a single feature
+
+Returns
+---
+string
+JSON encoded elasticsearch query
+"""
+assert len(hit_page_ids) < 1
+if type(should) is not list:
+should = [should]
+return json.dumps({
+"_source": False,
+"from": 0,
+"size": ,
+"query": {
+"bool": {
+"filter": {
+'ids': {
+'values': map(str, set(hit_page_ids)),
+}
+},
+"should": should,
+"disable_coord": True,
+}
+}
+})
+
+
+class ScriptFeature(object):
+"""
+Query feature using elasticsearch script_score
+
+...
+
+Methods
+---
+make_query(query)
+Build the elasticsearch query
+"""
+
+def __init__(self, name, script, lang='expression'):
+self.name = name
+self.script = script
+self.lang = lang
+
+def make_query(self, query):
+"""Build the elasticsearch query
+
+Parameters
+--
+query : string
+User provided query term (unused)
+"""
+return {
+"function_score": {
+"score_mode": "sum",
+"boost_mode": "sum",
+"functions": [
+{
+"script_score": {
+"script": {
+"inline": self.script,
+"lang": self.lang,
+}
+}
+}
+]
+}
+}
+
+
+class MultiMatchFeature(object):
+"""
+Query feature using elasticsearch multi_match
+
+...
+
+Methods
+---
+make_query(query)
+Build the elasticsearch query
+"""
+def __init__(self, name, fields, minimum_should_match=1, 
match_type="most_fields"):
+"""
+
+Parameters
+--
+name : string
+Name of the feature
+fields : list
+Fields to perform multi_match against
+minimum_should_match: int, optional
+Minimum number of fields that should match. (Default: 1)
+match_type : string, optional
+Type of match to perform. (Default: most_fields)
+"""
+self.name = name
+assert len(fields) > 0
+self.fields = fields
+self.minimum_should_match = minimum_should_match
+self.match_type = match_type
+
+def make_query(self, query):
+"""Build the elasticsearch query
+
+Parameters
+--
+query : string
+User provided query term
+"""
+return {
+"multi_match": {
+"query": query,
+"minimum_should_match": self.minimum_should_match,
+"type": self.match_type,
+"fields": self