[jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

2012-06-27 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402256#comment-13402256
 ] 

Robert Muir commented on LUCENE-4170:
-

I think shingles has a similar bug: it doesn't look at the existing posLength 
of the input tokens at all, instead it just fills posLength with the 
builtGramSize.

 TestRandomChains fail with Shingle+CommonGrams
 --

 Key: LUCENE-4170
 URL: https://issues.apache.org/jira/browse/LUCENE-4170
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Attachments: LUCENE-4170.patch


 ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
 -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt 
 -Dtests.timezone=America/Argentina/Salta -Dargs=-Dfile.encoding=ISO8859-1
 This test has two shinglefilters, then a common-grams filter. I think posLen 
 impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

2012-06-27 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402307#comment-13402307
 ] 

Steven Rowe commented on LUCENE-4170:
-

bq. I think shingles has a similar bug: it doesn't look at the existing 
posLength of the input tokens at all, instead it just fills posLength with the 
builtGramSize.

I agree.

However, the problem isn't just position length: ShingleFilter has never 
handled input position increments of zero, so real graph compatibility will 
mean fixing that too.

I think Karl Wettin's ShingleMatrixFilter (deprecated in 3.6, dropped in 4.0) 
is an attempt to permute all combinations of overlapping (poslength=1) terms to 
produce shingles.  ShingleMatrixFilter wouldn't handle poslength  1, though.

I'm not even sure what token ngramming should mean over an input graph.  The 
trivial case where input tokens' poslength is always zero and position 
increment is always one is obviously already handled.

I think both issues should be handled, since poslength  1 will very likely be 
used with posincr = 0, e.g. synonyms and kuromoji de-compounding.


 TestRandomChains fail with Shingle+CommonGrams
 --

 Key: LUCENE-4170
 URL: https://issues.apache.org/jira/browse/LUCENE-4170
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Attachments: LUCENE-4170.patch


 ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
 -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt 
 -Dtests.timezone=America/Argentina/Salta -Dargs=-Dfile.encoding=ISO8859-1
 This test has two shinglefilters, then a common-grams filter. I think posLen 
 impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-4170) TestRandomChains fail with Shingle+CommonGrams

2012-06-27 Thread Steven Rowe (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-4170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13402314#comment-13402314
 ] 

Steven Rowe commented on LUCENE-4170:
-

bq. I'm not even sure what token ngramming should mean over an input graph.

A thought problem: run ShingleFilter with mingramsize=2, maxgramsize=3, 
outputUnigrams=true over input {{\[a/1] \[b/1] \[c/1] \[d/1]}} (where {{/n}} 
indicates poslength = {{n}}, and {{\[a b]}} indicates tokens {{a}} and {{b}} 
are at the same position; I'll omit the {{\[]}}'s below when only one token is 
at a given position), then run ShingleFilter again with the same config over 
the first ShingleFilter's output:

{noformat}
shinglefilter(min:2,max:3,unigrams:true) with input:  a/1  b/1  c/1  d/1 

_ token sep: [a/1  a_b/2  a_b_c/3]  [b/1  b_c/2  b_c_d/3]  [c/1  c_d/2]  d/1

shinglefilter(2,3,unigrams) with shinglefilter output above as input:

= token sep: [a/1  a_b/2  a_b_c/3  a=b/2  a=b_c/3  a=b_c_d/4  a=b=c/3  
a=b=c_d/4  a=b_c=d/4  a_b=c/3  a_b=c_d/4  a_b=c=d/4  a_b_c=d/4]  
   [b/1  b_c/2  b_c_d/3  b=c/2  b=c_d/3  b_c=d/3]
   [c/1  c_d/2  c=d/2]
   d/1
{noformat}


 TestRandomChains fail with Shingle+CommonGrams
 --

 Key: LUCENE-4170
 URL: https://issues.apache.org/jira/browse/LUCENE-4170
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Reporter: Robert Muir
 Attachments: LUCENE-4170.patch


 ant test  -Dtestcase=TestRandomChains -Dtests.method=testRandomChains 
 -Dtests.seed=12635ABB4F789F2A -Dtests.multiplier=3 -Dtests.locale=pt 
 -Dtests.timezone=America/Argentina/Salta -Dargs=-Dfile.encoding=ISO8859-1
 This test has two shinglefilters, then a common-grams filter. I think posLen 
 impls in commongrams and/or shingle has a bug if the input is already a graph.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org