[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2018-06-08 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506535#comment-16506535
 ] 

David Smiley commented on LUCENE-7619:
--

RE TokenStreamToAutomaton and finalOffsetGapAsHole, shouldn't this be ignored 
when preservePositionIncrements==false?  In other words, I think 
finalOffsetGapAsHole should only have an effect when preservePositionIncrements.

> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
>Priority: Major
> Fix For: 6.5, 7.0
>
> Attachments: LUCENE-7619.patch, LUCENE-7619.patch, LUCENE-7619.patch, 
> after.png, before.png
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-03-02 Thread Jigar Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15892428#comment-15892428
 ] 

Jigar Shah commented on LUCENE-7619:


I will go for snapshot for now. Thanks for suggesting direction :)

Many Thanks!

> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-03-01 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890134#comment-15890134
 ] 

Michael McCandless commented on LUCENE-7619:


Hi [~infodroids], besides WDGF itself, there have also been a number of changes 
to the query parser, to properly consume a graph and build the correct queries, 
e.g. LUCENE-7702, LUCENE-7699, LUCENE-7698, LUCENE-7638, etc.

It may be simpler for you to test a snapshot build of Lucene 6.5.0?

> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-02-28 Thread Jigar Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15888747#comment-15888747
 ] 

Jigar Shah commented on LUCENE-7619:


Hello [~mikemccand]

+1 

Many thanks for fixing this!

I am using WordDelemeterFilter (which often breaks phrase queries on words with 
puntuations). I am currently using Lucene 6.4.1 in production. Can you please 
suggest which classes I should patch on Lucene 6.4.1 to use this feature. 
Patching just WordDelimiterGraphFilter and using it in token stream instead of 
WordDelemeterFilter be fine? or there are any other dependent classes which I 
have to patch (please provide list if there are other classes too) ? 

Once Lucene 6.5 is released i will upgrade to Lucene 6.5 so i will get better 
tested fix, but for now i would like to patch Lucene 6.4.1 if patch is 
compitible and simple.

> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-01-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826995#comment-15826995
 ] 

ASF subversion and git services commented on LUCENE-7619:
-

Commit 03ffb1287d9908f8e1bb1417b7f18ca4645f209f in lucene-solr's branch 
refs/heads/branch_6x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=03ffb12 ]

LUCENE-7619: don't let offsets go backwards


> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-01-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826992#comment-15826992
 ] 

ASF subversion and git services commented on LUCENE-7619:
-

Commit 0bdcfc291fceab26e1c62a7e9791ce417671eacd in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=0bdcfc2 ]

LUCENE-7619: don't let offsets go backwards


> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-01-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826278#comment-15826278
 ] 

ASF subversion and git services commented on LUCENE-7619:
-

Commit 4e4ec082caa7c1fc8fca24ddfb6a8633a4ae9506 in lucene-solr's branch 
refs/heads/branch_6x from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=4e4ec08 ]

LUCENE-7619: add WordDelimiterGraphFilter (replacing WordDelimiterFilter) to 
produce a correct token stream graph when splitting words


> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-01-17 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15826260#comment-15826260
 ] 

ASF subversion and git services commented on LUCENE-7619:
-

Commit 637915b890d9f0e5cfaa6887609f221029327a25 in lucene-solr's branch 
refs/heads/master from Mike McCandless
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=637915b ]

LUCENE-7619: add WordDelimiterGraphFilter (replacing WordDelimiterFilter) to 
produce a correct token stream graph when splitting words


> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: after.png, before.png, LUCENE-7619.patch, 
> LUCENE-7619.patch, LUCENE-7619.patch
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-01-08 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15810645#comment-15810645
 ] 

David Smiley commented on LUCENE-7619:
--

Very cool!

> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7619.patch, after.png, before.png
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7619) Add WordDelimiterGraphFilter

2017-01-04 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15798952#comment-15798952
 ] 

Adrien Grand commented on LUCENE-7619:
--

Wow, I did not think WDF would ever be fixed to produce correct positions, this 
is very exciting!

> Add WordDelimiterGraphFilter
> 
>
> Key: LUCENE-7619
> URL: https://issues.apache.org/jira/browse/LUCENE-7619
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Michael McCandless
>Assignee: Michael McCandless
> Fix For: master (7.0), 6.5
>
> Attachments: LUCENE-7619.patch, after.png, before.png
>
>
> Currently, {{WordDelimiterFilter}} doesn't try to set the {{posLen}} 
> attribute and so it creates graphs like this:
> !before.png!
> but with this patch (still a work in progress) it creates this graph instead:
> !after.png!
> This means (today) positional queries when using WDF at search time are 
> buggy, but since we fixed LUCENE-7603, with this change here you should be 
> able to use positional queries with WDGF.
> I'm also trying to produce holes properly (removes logic from the current WDF 
> that swallows a hole when whole token is just delimiters).
> Surprisingly, it's actually quite easy to tweak WDF to create a graph (unlike 
> e.g. {{SynonymGraphFilter}}) because it's already creating the necessary new 
> positions, and its output graph never has side paths, except for single 
> tokens that skip nodes because they have {{posLen > 1}}.  I.e. the only fix 
> to make, I think, is to set {{posLen}} properly.  And it really helps that it 
> does its own "new token buffering + sorting" already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org