[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-05-23 Thread Erik Hatcher (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021381#comment-16021381
 ] 

Erik Hatcher commented on LUCENE-7638:
--

[~mbraun688] looks like from CHANGES and commits this was 6.5 so I marked it as 
so.  

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Fix For: 6.5
>
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-05-23 Thread Michael Braun (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16021245#comment-16021245
 ] 

Michael Braun commented on LUCENE-7638:
---

This ticket is missing fix versions, but it looks like there were commits. 

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-20 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875182#comment-15875182
 ] 

Matt Weber commented on LUCENE-7638:


[~jim.ferenczi] [~mikemccand] Could we apply the same articulation points logic 
in analyzeGraphPhrase but generate span queries to essentially act like a 
phrase? 

SpanNear[
  SpanOr[
SpanNear[SpanTerm[new], SpanTerm[york]]
SpanTerm[ny]
  ],
  SpanTerm[city]
]



> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-13 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15863338#comment-15863338
 ] 

Jim Ferenczi commented on LUCENE-7638:
--

Sure, thanks [~mikemccand] and [~mattweber] !

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-10 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15861913#comment-15861913
 ] 

Michael McCandless commented on LUCENE-7638:


[~jim.ferenczi] can this be closed now?

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-07 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15856188#comment-15856188
 ] 

Matt Weber commented on LUCENE-7638:


[~jim.ferenczi] Sorry so late been swamped.  Anyways, this is great!  I really 
like this approach, awesome job man!

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855842#comment-15855842
 ] 

ASF subversion and git services commented on LUCENE-7638:
-

Commit 5386116e9a3b54db4674b21e39a41cc4d43553f8 in lucene-solr's branch 
refs/heads/branch_6x from [~jim.ferenczi]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=5386116 ]

LUCENE-7638: Query parsers now analyze the token graph for articulation points 
(or cut vertices) in order to create more efficient queries for multi-token 
synonyms.


> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-07 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15855707#comment-15855707
 ] 

ASF subversion and git services commented on LUCENE-7638:
-

Commit e764d3e0590e998676a18f2b11364cecced4a8ad in lucene-solr's branch 
refs/heads/master from [~jim.ferenczi]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e764d3e ]

LUCENE-7638: Query parsers now analyze the token graph for articulation points 
(or cut vertices) in order to create more efficient queries for multi-token 
synonyms.


> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-05 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15853329#comment-15853329
 ] 

Jim Ferenczi commented on LUCENE-7638:
--

Thanks [~mikemccand].
Just to be clear this patch creates a BooleanQuery with MUST clauses (or a 
PhraseQuery if autoGenerateMultiTermSynonymsPhraseQuery is set to true) for 
each synonym path. 
I'll commit shortly if there are no objections.

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-02-02 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15849996#comment-15849996
 ] 

Michael McCandless commented on LUCENE-7638:


+1, the new patch looks great!  Thanks [~jim.ferenczi].

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch, LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-17 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826311#comment-15826311
 ] 

Matt Weber commented on LUCENE-7638:


Ok this is great.  So going forward we should assume that synonyms are to 
treated together (single token or multi-token) and ideally multi-token synonyms 
as a phrase.  Would it be best to move this logic into {{GraphQuery}} itself?  
This would make it so we can still detect when we are working with graph 
related queries and be easier to make the various optimizations talked about 
here.  Maybe make {{GraphQuery}} store the graph token stream instead of the 
processed queries and then do the graph processing / query generation when 
rewrite it called?

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-17 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826085#comment-15826085
 ] 

Jim Ferenczi commented on LUCENE-7638:
--

Thanks for taking a look @mattweber and [~mikemccand]

{quote}
Can this handle "side paths on side paths" graph structure (I think you called 
this "nested multi-term synonyms")? While no analysis chain can naturally 
produce this today (that I know of!), the TokenStream attributes can easily 
express it. And you could imagine it happening in the future, e.g. if you use 
Kuromoji tokenizer or WordDelimiterGraphFilter followed by a SynonymGraphFilter 
(except we'd need to e.g. use the synonym graph filter from LUCENE-5012, which 
can correctly consume a graph). If this is expected to work maybe we should add 
a test case showing that?
{quote}

I'll add a test case because it's expected to work ;) . This is also the reason 
why this patch does not produce {code}PhraseQuery{code} for synonyms.   For 
simple "side paths" this is easy to do but we would need to switch to Span 
queries for "side paths on side paths" so I though that it could be done in 
another issue. 

{quote}
It seems like you don't need to be using Term here, except at the end to pass 
to the newXXXQuery, since everything is in a single field here, and we are 
hoping to move away from Term entirely (LUCENE-7632)?
{quote}

Thanks, I'll simplify the patch.

{quote}
Holes are challenging for graph token streams ... can you add a test case that 
encounters holes, e.g. simulated StopFilter? There are at least two "fun" 
cases: a hole that cuts the graph entirely into two partitions, and a synonym 
spanning over a hole ... CannedTokenStream is useful for feeding such 
"interesting" cases.
{quote}

I though that holes would not be a problem for boolean queries but now I am not 
sure. I'll test that.

{quote}
The Path.id seems to be used only for tie-breaking on compare, not for lookup 
in the TreeSet as the comment suggests?
{quote}

The comment is misleading. It is needed because I use 
{code}TreeSet#remove{code} which uses compare to check object equality. So the 
{code}Path.id{code} is the unique identifier for the path.
 




> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-17 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15825894#comment-15825894
 ] 

Michael McCandless commented on LUCENE-7638:


This is an impressive patch!  Thanks [~jim.ferenczi].

I agree the combinatoric explosion is a serious problem/vulnerability/adversary 
and I love how this patch solves it.

Can this handle "side paths on side paths" graph structure (I think you called 
this "nested multi-term synonyms")?  While no analysis chain can naturally 
produce this today (that I know of!), the {{TokenStream}} attributes can easily 
express it.  And you could imagine it happening in the future, e.g. if you use 
Kuromoji tokenizer or {{WordDelimiterGraphFilter}} followed by a 
{{SynonymGraphFilter}} (except we'd need to e.g. use the synonym graph filter 
from LUCENE-5012, which can correctly consume a graph).  If this is expected to 
work maybe we should add a test case showing that?

It seems like you don't need to be using {{Term}} here, except at the end to 
pass to the {{newXXXQuery}}, since everything is in a single field here, and we 
are hoping to move away from {{Term}} entirely (LUCENE-7632)?

Holes are challenging for graph token streams ... can you add a test case that 
encounters holes, e.g. simulated {{StopFilter}}?  There are at least two "fun" 
cases: a hole that cuts the graph entirely into two partitions, and a synonym 
spanning over a hole ... {{CannedTokenStream}} is useful for feeding such 
"interesting" cases.

The {{Path.id}} seems to be used only for tie-breaking on compare, not for 
lookup in the {{TreeSet}} as the comment suggests?

On {{minShouldMatch}}: I feel it's actually more correct to define its 
semantics as operating on whole synonyms, not the individual tokens in each 
synonym, as this patch does.  Really, had you done the same synonym reduction 
at indexing time instead, so that "new york" in a document caused "ny" to be 
indexed as well, and then searched on "ny", this is the behavior you would see 
("new york" counts as the 1 {{minShouldMatch}}).

Of course, in general I think we are in new (to Lucene) territory here, on how 
graph structured queries should be matched/scored "properly", and I don't 
really know what the answer should be. {{TermAutomatonQuery}} [faces similar 
challenges|https://issues.apache.org/jira/browse/LUCENE-5815?focusedCommentId=14060665=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14060665].
  Maybe there are some IR papers out there that have studied this?

I even think it's odd that we don't use a {{PhraseQuery}} to match that 
syn-expanded "new york" even when the user didn't put double quotes around the 
query: had you done that syn at index time, it would "act like" a 
{{PhraseQuery}} ... but this is an unrelated issue!


> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: 

[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-16 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824462#comment-15824462
 ] 

Matt Weber commented on LUCENE-7638:


[~jim.ferenczi] I have mixed feelings about that as I can see plus and minus of 
both.  When I was originally working on this I essentially decided that 
everything should be passed to each path as if it was the original query.  What 
do you think [~mikemccand]?  Also, there are additional use cases that we 
handle in elasticsearch that have not made their way into Lucene yet that might 
be affected by this.  Boolean with cutoff frequency, prefix queries, etc.  

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-16 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824358#comment-15824358
 ] 

Jim Ferenczi commented on LUCENE-7638:
--

[~mattweber] I don't think we lose minimum should match support. It will be 
different but interestingly it would also solve some problems. For instance 
with the all path solution, a synonym like "ny, new york" with a minimum should 
match of 1, searching for "ny" would not return documents matching "new york". 
With the proposed solution each multi-term synonyms is considered as a single 
clause so "ny" and "new york" count for 1.
I like the finite strings solution because expressing the minimum should match 
in percentage gives you correct hits. This is great though it requires to 
duplicate a lot of terms so I wonder if this is something that we should really 
target. By considering each multi-term synonyms as 1 clause we could simplify 
the problem and produce more optimized query ?

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-16 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824337#comment-15824337
 ] 

Jim Ferenczi commented on LUCENE-7638:
--

For pure phrase query it's a good fit because it's a proximity query but for 
boolean queries the problem is different. We cannot build the 
TermAutomatonQuery directly, first we need to find the start and end state of 
each multi-term synonyms in the graph. That's what the attached patch is doing 
lazily, for each intersection point it creates a multi-term synonym query. 
Currently the multi-term synonym query is a boolean query but we could change 
the logic and use the TermAutomatonQuery instead or even create a PhaseQuery 
for each path in the multi-term synonym. This patch also handles nested 
multi-term synonyms which makes the detection of intersection points harder. 
Bottom point is that if we are able to extract the multi-term synonyms of the 
graph then we can choose more easily how we want to search and score these 
inner graph. Does this makes sense ?

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-16 Thread Matt Weber (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824319#comment-15824319
 ] 

Matt Weber commented on LUCENE-7638:


I think the problem here is that we lose minimum should match support as that 
is applied AFTER query generation by building a new boolean query.  Same thing 
for phrase slop even though that would not be affected by this patch.  If we 
can move this logic into rewrite method of GraphQuery then we could take all 
that information into consideration to build a more efficient query.

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7638) Optimize graph query produced by QueryBuilder

2017-01-16 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15824229#comment-15824229
 ] 

Adrien Grand commented on LUCENE-7638:
--

Maybe {{TermAutomatonQuery}} would be a good fit for that problem?

> Optimize graph query produced by QueryBuilder
> -
>
> Key: LUCENE-7638
> URL: https://issues.apache.org/jira/browse/LUCENE-7638
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Jim Ferenczi
> Attachments: LUCENE-7638.patch
>
>
> The QueryBuilder creates a graph query when the underlying TokenStream 
> contains token with PositionLengthAttribute greater than 1.
> These TokenStreams are in fact graphs (lattice to be more precise) where 
> synonyms can span on multiple terms. 
> Currently the graph query is built by visiting all the path of the graph 
> TokenStream. For instance if you have a synonym like "ny, new york" and you 
> search for "new york city", the query builder would produce two pathes:
> "new york city", "ny city"
> This can quickly explode when the number of multi terms synonyms increase. 
> The query "ny ny" for instance would produce 4 pathes and so on.
> For boolean queries with should or must clauses it should be more efficient 
> to build a boolean query that merges all the intersections in the graph. So 
> instead of "new york city", "ny city" we could produce:
> "+((+new +york) ny) +city"
> The attached patch is a proposal to do that instead of the all path solution.
> The patch transforms multi terms synonyms in graph query for each 
> intersection in the graph. This is not done in this patch but we could also 
> create a specialized query that gives equivalent scores to multi terms 
> synonyms like the SynonymQuery does for single term synonyms.
> For phrase query this patch does not change the current behavior but we could 
> also use the new method to create optimized graph SpanQuery.
> [~mattweber] I think this patch could optimize a lot of cases where multiple 
> muli-terms synonyms are present in a single request. Could you take a look ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org