[MediaWiki-commits] [Gerrit] search/extra[master]: Add token_count_router

2017-01-03 Thread jenkins-bot (Code Review)
jenkins-bot has submitted this change and it was merged. ( 
https://gerrit.wikimedia.org/r/330147 )

Change subject: Add token_count_router
..


Add token_count_router

A simple query wrapper that counts the number of tokens and decides
which sub-query to run by evaluating a set of conditions.

Bug: T152094
Change-Id: I582bf27e77f87f1e1d0f86d81371a46afb4ffcab
---
M README.md
A docs/token_count_router.md
M src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
A 
src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java
A 
src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryParser.java
A 
src/test/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryTest.java
6 files changed, 462 insertions(+), 1 deletion(-)

Approvals:
  EBernhardson: Looks good to me, approved
  jenkins-bot: Verified



diff --git a/README.md b/README.md
index cf444ea..792f53a 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 score functions, and anything else we think we end up creating to make search
 nice for Wikimedia. At this point it only contains:
 
-Filters:
+Queries:
 * [source_regex](docs/source_regex.md) - An nGram accelerated regular
 expression filter that is generally much much faster than sequentially checking
 all documents.
@@ -13,6 +13,8 @@
 independantly. For example, it can be used by multiple processes to reindex
 all documents without any interprocess communication. Added in 1.5.0, 1.4.1,
 and 1.3.0.
+* [token_count_router](docs/token_count_router.md) - Simple query wrapper that
+evaluates some conditions based on the number of tokens of the input query.
 
 Native Scripts:
 * [super_detect_noop](docs/super_detect_noop.md) - Like ```detect_noop``` but
diff --git a/docs/token_count_router.md b/docs/token_count_router.md
new file mode 100644
index 000..bc3a058
--- /dev/null
+++ b/docs/token_count_router.md
@@ -0,0 +1,54 @@
+token_count_router
+==
+
+The ```token_count_router``` is a simple query wrapper that counts the number
+of tokens in the provided text. It then evaluates a set of conditions to decide
+which subquery to run.
+It's useful in case the client would like to activate some proximity rescoring
+features based on the number of tokens and the analyzers available.
+
+Example
+---
+
+```
+GET /_search
+{
+"token_count_router": {
+"field": "text",
+"text": "input query",
+"conditions" : [
+{
+"gte": 2,
+"query": {
+"match_phrase": {
+"text": "input query",
+}
+}
+}
+],
+"fallback": {
+"match_none": {}
+}
+}
+}
+```
+
+A phrase query will be executed if the number of tokens emitted by the
+search analyzer of the `text` field is greater or equal to `2`.
+A `match_none` query is executed otherwise.
+This allows to move some decision logic based on token count to the
+backend allowing to use query templates and analyzer behaviors.
+
+Options
+---
+
+* `field` Use the search analyzer difined for this field.
+* `analyzer` Use this analyzer (`field` or `analyzer` must be defined)
+* `discount_overlaps` Set to true to ignore tokens emitted at the same 
position (defaults to `true`).
+* `conditions` Array of conditions (the first that matches wins):
+* `predicate` : can be `eq`, `gt`, `gte`, `lt` or `lte`, the value is the 
number of tokens to evaluate.
+`"lt": 10` is true when the number of tokens is lower than 
10.
+* `query` The query to apply if the condition is met.
+* `fallback` The query to apply if none of the conditions applies.
+
+Note that the query parser does not check the conditions coherence.
\ No newline at end of file
diff --git a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java 
b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
index 8fa20a6..8f73054 100644
--- a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
+++ b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
@@ -22,6 +22,7 @@
 import org.wikimedia.search.extra.superdetectnoop.VersionedDocumentHandler;
 import org.wikimedia.search.extra.superdetectnoop.WithinAbsoluteHandler;
 import org.wikimedia.search.extra.superdetectnoop.WithinPercentageHandler;
+import org.wikimedia.search.extra.tokencount.TokenCountRouterQueryParser;
 
 /**
  * Setup the Elasticsearch plugin.
@@ -44,6 +45,7 @@
 module.registerQueryParser(SourceRegexQueryParser.class);
 module.registerQueryParser(IdHashModQueryParser.class);
 module.registerQueryParser(FuzzyLikeThisQueryParser.class);
+module.registerQueryParser(TokenCountRouterQueryParser.class);
 }
 
 /**
diff --git 
a/src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java
 
b/src/main/java/org/wikimedia/search/extra/tokencount

[MediaWiki-commits] [Gerrit] search/extra[master]: Add token_count_router

2017-01-02 Thread DCausse (Code Review)
DCausse has uploaded a new change for review. ( 
https://gerrit.wikimedia.org/r/330147 )

Change subject: Add token_count_router
..

Add token_count_router

A simple query wrapper that counts the number of tokens and to decide
which sub-query to run by evaluating a set of conditions.

Bug: T152094
Change-Id: I582bf27e77f87f1e1d0f86d81371a46afb4ffcab
---
M README.md
A docs/token_count_router.md
M src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
A 
src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java
A 
src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryParser.java
A 
src/test/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryTest.java
6 files changed, 463 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.wikimedia.org:29418/search/extra 
refs/changes/47/330147/1

diff --git a/README.md b/README.md
index cf444ea..792f53a 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 score functions, and anything else we think we end up creating to make search
 nice for Wikimedia. At this point it only contains:
 
-Filters:
+Queries:
 * [source_regex](docs/source_regex.md) - An nGram accelerated regular
 expression filter that is generally much much faster than sequentially checking
 all documents.
@@ -13,6 +13,8 @@
 independantly. For example, it can be used by multiple processes to reindex
 all documents without any interprocess communication. Added in 1.5.0, 1.4.1,
 and 1.3.0.
+* [token_count_router](docs/token_count_router.md) - Simple query wrapper that
+evaluates some conditions based on the number of tokens of the input query.
 
 Native Scripts:
 * [super_detect_noop](docs/super_detect_noop.md) - Like ```detect_noop``` but
diff --git a/docs/token_count_router.md b/docs/token_count_router.md
new file mode 100644
index 000..a4e9bf0
--- /dev/null
+++ b/docs/token_count_router.md
@@ -0,0 +1,55 @@
+token_count_router
+==
+
+The ```token_count_router``` is a simple query wrapper that counts the number
+of tokens in the provided text. It then evaluates a set of conditions to decide
+which subquery to run.
+It's useful in case the client would like to activate some proximity rescoring
+features based on the number of tokens and the analyzers available.
+
+Example
+---
+
+Analyze a field with trigrams like so:
+```
+GET /_search
+{
+"token_count_router": {
+"field": "text",
+"text": "input query",
+"conditions" : [
+{
+"gte": 2,
+"query": {
+"match_phrase": {
+"text": "input query",
+}
+}
+}
+],
+"fallback": {
+"match_none": {}
+}
+}
+}
+```
+
+A phrase query will be executed if the umber of tokens emitted by the
+search analyzer of the `text` field is greater or equal to `2`.
+A ` match_none` query is executed otherwise.
+This allows to move some decision logic based on token count to the
+backend allowing to use query templates and analyzer behaviors.
+
+Options
+---
+
+* `field` Use the search analyzer difined for this field.
+* `analyzer` Use this analyzer (`field` or `analyzer` must be defined)
+* `discount_overlaps` Set to true to ignore tokens emitted at the same 
position (defaults to `true`).
+* `conditions` Array of conditions (the first that matches wins):
+* `predicate` : can be `eq`, `gt`, `gte`, `lt` or `lte`, the value is the 
number of tokens to evaluate.
+`"lt": 10` is true when the number of tokens is lower than 
10.
+* `query` The query to apply if the condition is met.
+* `fallback` The query to apply if none of the conditions applies.
+
+Note that the query parser does not check the conditions coherence.
\ No newline at end of file
diff --git a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java 
b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
index 8fa20a6..8f73054 100644
--- a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
+++ b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java
@@ -22,6 +22,7 @@
 import org.wikimedia.search.extra.superdetectnoop.VersionedDocumentHandler;
 import org.wikimedia.search.extra.superdetectnoop.WithinAbsoluteHandler;
 import org.wikimedia.search.extra.superdetectnoop.WithinPercentageHandler;
+import org.wikimedia.search.extra.tokencount.TokenCountRouterQueryParser;
 
 /**
  * Setup the Elasticsearch plugin.
@@ -44,6 +45,7 @@
 module.registerQueryParser(SourceRegexQueryParser.class);
 module.registerQueryParser(IdHashModQueryParser.class);
 module.registerQueryParser(FuzzyLikeThisQueryParser.class);
+module.registerQueryParser(TokenCountRouterQueryParser.class);
 }
 
 /**
diff --git 
a/src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java
 
b/src/main/java/org