[MediaWiki-commits] [Gerrit] search/extra[master]: Add token_count_router
jenkins-bot has submitted this change and it was merged. ( https://gerrit.wikimedia.org/r/330147 ) Change subject: Add token_count_router .. Add token_count_router A simple query wrapper that counts the number of tokens and decides which sub-query to run by evaluating a set of conditions. Bug: T152094 Change-Id: I582bf27e77f87f1e1d0f86d81371a46afb4ffcab --- M README.md A docs/token_count_router.md M src/main/java/org/wikimedia/search/extra/ExtraPlugin.java A src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java A src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryParser.java A src/test/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryTest.java 6 files changed, 462 insertions(+), 1 deletion(-) Approvals: EBernhardson: Looks good to me, approved jenkins-bot: Verified diff --git a/README.md b/README.md index cf444ea..792f53a 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ score functions, and anything else we think we end up creating to make search nice for Wikimedia. At this point it only contains: -Filters: +Queries: * [source_regex](docs/source_regex.md) - An nGram accelerated regular expression filter that is generally much much faster than sequentially checking all documents. @@ -13,6 +13,8 @@ independantly. For example, it can be used by multiple processes to reindex all documents without any interprocess communication. Added in 1.5.0, 1.4.1, and 1.3.0. +* [token_count_router](docs/token_count_router.md) - Simple query wrapper that +evaluates some conditions based on the number of tokens of the input query. Native Scripts: * [super_detect_noop](docs/super_detect_noop.md) - Like ```detect_noop``` but diff --git a/docs/token_count_router.md b/docs/token_count_router.md new file mode 100644 index 000..bc3a058 --- /dev/null +++ b/docs/token_count_router.md @@ -0,0 +1,54 @@ +token_count_router +== + +The ```token_count_router``` is a simple query wrapper that counts the number +of tokens in the provided text. It then evaluates a set of conditions to decide +which subquery to run. +It's useful in case the client would like to activate some proximity rescoring +features based on the number of tokens and the analyzers available. + +Example +--- + +``` +GET /_search +{ +"token_count_router": { +"field": "text", +"text": "input query", +"conditions" : [ +{ +"gte": 2, +"query": { +"match_phrase": { +"text": "input query", +} +} +} +], +"fallback": { +"match_none": {} +} +} +} +``` + +A phrase query will be executed if the number of tokens emitted by the +search analyzer of the `text` field is greater or equal to `2`. +A `match_none` query is executed otherwise. +This allows to move some decision logic based on token count to the +backend allowing to use query templates and analyzer behaviors. + +Options +--- + +* `field` Use the search analyzer difined for this field. +* `analyzer` Use this analyzer (`field` or `analyzer` must be defined) +* `discount_overlaps` Set to true to ignore tokens emitted at the same position (defaults to `true`). +* `conditions` Array of conditions (the first that matches wins): +* `predicate` : can be `eq`, `gt`, `gte`, `lt` or `lte`, the value is the number of tokens to evaluate. +`"lt": 10` is true when the number of tokens is lower than 10. +* `query` The query to apply if the condition is met. +* `fallback` The query to apply if none of the conditions applies. + +Note that the query parser does not check the conditions coherence. \ No newline at end of file diff --git a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java index 8fa20a6..8f73054 100644 --- a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java +++ b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java @@ -22,6 +22,7 @@ import org.wikimedia.search.extra.superdetectnoop.VersionedDocumentHandler; import org.wikimedia.search.extra.superdetectnoop.WithinAbsoluteHandler; import org.wikimedia.search.extra.superdetectnoop.WithinPercentageHandler; +import org.wikimedia.search.extra.tokencount.TokenCountRouterQueryParser; /** * Setup the Elasticsearch plugin. @@ -44,6 +45,7 @@ module.registerQueryParser(SourceRegexQueryParser.class); module.registerQueryParser(IdHashModQueryParser.class); module.registerQueryParser(FuzzyLikeThisQueryParser.class); +module.registerQueryParser(TokenCountRouterQueryParser.class); } /** diff --git a/src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java b/src/main/java/org/wikimedia/search/extra/tokencount
[MediaWiki-commits] [Gerrit] search/extra[master]: Add token_count_router
DCausse has uploaded a new change for review. ( https://gerrit.wikimedia.org/r/330147 ) Change subject: Add token_count_router .. Add token_count_router A simple query wrapper that counts the number of tokens and to decide which sub-query to run by evaluating a set of conditions. Bug: T152094 Change-Id: I582bf27e77f87f1e1d0f86d81371a46afb4ffcab --- M README.md A docs/token_count_router.md M src/main/java/org/wikimedia/search/extra/ExtraPlugin.java A src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java A src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryParser.java A src/test/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryTest.java 6 files changed, 463 insertions(+), 1 deletion(-) git pull ssh://gerrit.wikimedia.org:29418/search/extra refs/changes/47/330147/1 diff --git a/README.md b/README.md index cf444ea..792f53a 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ score functions, and anything else we think we end up creating to make search nice for Wikimedia. At this point it only contains: -Filters: +Queries: * [source_regex](docs/source_regex.md) - An nGram accelerated regular expression filter that is generally much much faster than sequentially checking all documents. @@ -13,6 +13,8 @@ independantly. For example, it can be used by multiple processes to reindex all documents without any interprocess communication. Added in 1.5.0, 1.4.1, and 1.3.0. +* [token_count_router](docs/token_count_router.md) - Simple query wrapper that +evaluates some conditions based on the number of tokens of the input query. Native Scripts: * [super_detect_noop](docs/super_detect_noop.md) - Like ```detect_noop``` but diff --git a/docs/token_count_router.md b/docs/token_count_router.md new file mode 100644 index 000..a4e9bf0 --- /dev/null +++ b/docs/token_count_router.md @@ -0,0 +1,55 @@ +token_count_router +== + +The ```token_count_router``` is a simple query wrapper that counts the number +of tokens in the provided text. It then evaluates a set of conditions to decide +which subquery to run. +It's useful in case the client would like to activate some proximity rescoring +features based on the number of tokens and the analyzers available. + +Example +--- + +Analyze a field with trigrams like so: +``` +GET /_search +{ +"token_count_router": { +"field": "text", +"text": "input query", +"conditions" : [ +{ +"gte": 2, +"query": { +"match_phrase": { +"text": "input query", +} +} +} +], +"fallback": { +"match_none": {} +} +} +} +``` + +A phrase query will be executed if the umber of tokens emitted by the +search analyzer of the `text` field is greater or equal to `2`. +A ` match_none` query is executed otherwise. +This allows to move some decision logic based on token count to the +backend allowing to use query templates and analyzer behaviors. + +Options +--- + +* `field` Use the search analyzer difined for this field. +* `analyzer` Use this analyzer (`field` or `analyzer` must be defined) +* `discount_overlaps` Set to true to ignore tokens emitted at the same position (defaults to `true`). +* `conditions` Array of conditions (the first that matches wins): +* `predicate` : can be `eq`, `gt`, `gte`, `lt` or `lte`, the value is the number of tokens to evaluate. +`"lt": 10` is true when the number of tokens is lower than 10. +* `query` The query to apply if the condition is met. +* `fallback` The query to apply if none of the conditions applies. + +Note that the query parser does not check the conditions coherence. \ No newline at end of file diff --git a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java index 8fa20a6..8f73054 100644 --- a/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java +++ b/src/main/java/org/wikimedia/search/extra/ExtraPlugin.java @@ -22,6 +22,7 @@ import org.wikimedia.search.extra.superdetectnoop.VersionedDocumentHandler; import org.wikimedia.search.extra.superdetectnoop.WithinAbsoluteHandler; import org.wikimedia.search.extra.superdetectnoop.WithinPercentageHandler; +import org.wikimedia.search.extra.tokencount.TokenCountRouterQueryParser; /** * Setup the Elasticsearch plugin. @@ -44,6 +45,7 @@ module.registerQueryParser(SourceRegexQueryParser.class); module.registerQueryParser(IdHashModQueryParser.class); module.registerQueryParser(FuzzyLikeThisQueryParser.class); +module.registerQueryParser(TokenCountRouterQueryParser.class); } /** diff --git a/src/main/java/org/wikimedia/search/extra/tokencount/TokenCountRouterQueryBuilder.java b/src/main/java/org