[jira] [Commented] (LUCENE-7508) [smartcn] tokens are not correctly created if text length > 1024

2016-12-15 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15753733#comment-15753733
 ] 

Chang KaiShin commented on LUCENE-7508:
---

After I looked into the internal handling of the input text. I found the source 
have already completed what I'm trying to do - Taking the input text as stream 
and leveraging the BUFFERMAX(length 1024) to loop through the entire text. The 
isSafeEnd method you mentioned previously is a key method to decide token 
boundaries. Currently lucene doesn't not contain Chinese token breakers so that 
it truncate possible tokens.
After overrided, it finds the last possible breaker position and leaves behind 
the remaining text after the break position
to the next loop. So the sentenses are remained intact and processed correctly. 
To Michael McCandless, additional ending sentence characters is necessary. I 
also named some Chinese breakers such as
';'
'。'
','
'、'
'~'
'('
')'

> [smartcn] tokens are not correctly created if text length > 1024
> 
>
> Key: LUCENE-7508
> URL: https://issues.apache.org/jira/browse/LUCENE-7508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
> Attachments: lucene-7508-test.patch, lucene-7508.patch
>
>
> If text length is > 1024, HMMChineseTokenizer failed to split sentences 
> correctly.
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> //String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,记者者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。。";
> String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,��者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相";
> System.out.println(sentence.length());
>// String sentence = "女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。";
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>  // System.out.println(termAttr.toString());
> }
> 
> analyzer.close();
>   }
> The text length in above sample is 1027, with this sample, the sentences are 
> like this:
> .
> Sentence:黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事
> Sentence:件真相
> The last 3 characters are detected as an individual sentence, so 还原事件真相 is 
> tokenized as 还原|事|件|真相. when the correct tokens should be 还原|事件|真相。
> Override isSafeEnd method in HMMChineseTokenizer fixes this issue by consider 
> ','  or '。'  as a safe end of text:
> public class HMMChineseTokenizer extends SegmentingTokenizerBase {
> 
>  /** For sentence tokenization, these are the unambiguous break positions. */
>   protected 

[jira] [Commented] (LUCENE-7508) [smartcn] tokens are not correctly created if text length > 1024

2016-12-12 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15744174#comment-15744174
 ] 

Chang KaiShin commented on LUCENE-7508:
---

Theoretically the text size could be infinite. The ideal approach is to take 
input text as stream instead of string. I'll try it.

> [smartcn] tokens are not correctly created if text length > 1024
> 
>
> Key: LUCENE-7508
> URL: https://issues.apache.org/jira/browse/LUCENE-7508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
> Attachments: lucene-7508-test.patch, lucene-7508.patch
>
>
> If text length is > 1024, HMMChineseTokenizer failed to split sentences 
> correctly.
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> //String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,记者者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。。";
> String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,��者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相";
> System.out.println(sentence.length());
>// String sentence = "女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。";
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>  // System.out.println(termAttr.toString());
> }
> 
> analyzer.close();
>   }
> The text length in above sample is 1027, with this sample, the sentences are 
> like this:
> .
> Sentence:黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事
> Sentence:件真相
> The last 3 characters are detected as an individual sentence, so 还原事件真相 is 
> tokenized as 还原|事|件|真相. when the correct tokens should be 还原|事件|真相。
> Override isSafeEnd method in HMMChineseTokenizer fixes this issue by consider 
> ','  or '。'  as a safe end of text:
> public class HMMChineseTokenizer extends SegmentingTokenizerBase {
> 
>  /** For sentence tokenization, these are the unambiguous break positions. */
>   protected boolean isSafeEnd(char ch) {
> switch(ch) {
>   case 0x000D:
>   case 0x000A:
>   case 0x0085:
>   case 0x2028:
>   case 0x2029:
>+   case '。':
>+   case ',':
> return true;
>   default:
> return false;
> }
>   }
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (LUCENE-7508) [smartcn] tokens are not correctly created if text length > 1024

2016-12-12 Thread Chang KaiShin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang KaiShin updated LUCENE-7508:
--
Attachment: lucene-7508.patch

A hard coded buffer size 1024 exists in SegmentingTokenizerBase.java.
Enlarge the size up to 2048, then the test case passes. As for the 2nd issue, I 
don't suggest to distinguish additional punctuation as the sentence's ending 
symbol. The internal lucene seems to perform well in identifying sentences. 
It's better to open an independent issue if it fails again.



> [smartcn] tokens are not correctly created if text length > 1024
> 
>
> Key: LUCENE-7508
> URL: https://issues.apache.org/jira/browse/LUCENE-7508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
> Attachments: lucene-7508-test.patch, lucene-7508.patch
>
>
> If text length is > 1024, HMMChineseTokenizer failed to split sentences 
> correctly.
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> //String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,记者者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。。";
> String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,��者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相";
> System.out.println(sentence.length());
>// String sentence = "女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。";
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>  // System.out.println(termAttr.toString());
> }
> 
> analyzer.close();
>   }
> The text length in above sample is 1027, with this sample, the sentences are 
> like this:
> .
> Sentence:黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事
> Sentence:件真相
> The last 3 characters are detected as an individual sentence, so 还原事件真相 is 
> tokenized as 还原|事|件|真相. when the correct tokens should be 还原|事件|真相。
> Override isSafeEnd method in HMMChineseTokenizer fixes this issue by consider 
> ','  or '。'  as a safe end of text:
> public class HMMChineseTokenizer extends SegmentingTokenizerBase {
> 
>  /** For sentence tokenization, these are the unambiguous break positions. */
>   protected boolean isSafeEnd(char ch) {
> switch(ch) {
>   case 0x000D:
>   case 0x000A:
>   case 0x0085:
>   case 0x2028:
>   case 0x2029:
>+   case '。':
>+   case ',':
> return true;
>   default:
> return false;
> }
>   }
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (LUCENE-7508) [smartcn] tokens are not correctly created if text length > 1024

2016-12-12 Thread Chang KaiShin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang KaiShin updated LUCENE-7508:
--
Attachment: lucene-7508-test.patch

failing test case

> [smartcn] tokens are not correctly created if text length > 1024
> 
>
> Key: LUCENE-7508
> URL: https://issues.apache.org/jira/browse/LUCENE-7508
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
> Attachments: lucene-7508-test.patch
>
>
> If text length is > 1024, HMMChineseTokenizer failed to split sentences 
> correctly.
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> //String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,记者者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。。";
> String sentence = 
> "“七八个物管工作人员对我一个文弱书生拳打脚踢,我极力躲避时还被追打。”前天,微信网友爆料称,一名50多岁的江西教师在昆明被物管群殴,手指骨折,向网友求助。教师为何会被物管殴打?事情的真相又是如何?昨天,记者来到圣世一品小区,通过调查了解,事情的起因源于这名教师在小区里帮女儿散发汗蒸馆广告单,被物管保安发现后,引发冲突。对于群殴教师的说法,该小区物管保安队长称:“保安在追的过程中,确实有拉扯,但并没有殴打教师,至于手指骨折是他自己摔伤的。”爆料江西教师在昆明被物管殴打记者注意到,消息于8月27日发出,爆料者称,自己是江西宜丰崇文中学的一名中年教师黄敏。暑假期间来昆明的女儿家度假。他女儿在昆明与人合伙开了一家汗蒸馆,7月30日开业。8月9日下午6点30分许,他到昆明东二环圣世一品小区为女儿的汗蒸馆散发宣传小广告。小区物管前来制止,他就停止发放行为。黄敏称,小区物管保安人员要求他收回散发出去的广告单,他就去收了。物管要求他到办公室里去接受处理,他也配合了。让他没有想到的是,在处理的过程中,七八个年轻的物管人员突然对他拳打脚踢,他极力躲避时还被追着打,而且这一切,是在小区物管领导的注视下发生的。黄敏说,被打后,他立即报了警。除身上多处软组织挫伤外,伤得最严重的是右手大拇指粉碎性骨折,一掌骨骨折。他到云南省第三人民医院住了7天院,医生说无法手术,只能用夹板固定,也不吃药,待其自然修复,至少要3个月以上,右手大拇指还有可能伤残。为证明自己的说法,黄敏还拿出了官渡区公安分局菊花派出所出具的伤情鉴定委托书。他的伤情被鉴定为轻伤二级。说法帮女儿发宣传小广告教师在小区里被殴打昨日,��者拨通了黄敏的电话。他说,当时他看见该小区的大门没有关,也没有保安值班。于是,他就进到了小区里帮女儿的汗蒸馆发广告单。在楼栋值班的保安没有阻止的前提下,他乘电梯来到了楼上,为了不影响住户,他将名片放在了房门的把手上。被保安发现时,他才发了四五十张。保安问他干什么?他回答,家里开了汗蒸馆,来宣传一下。两名保安叫他不要发了,并要求他到物管办公室等待领导处理。交谈中,由于对方一直在说方言,黄敏只能听清楚的一句话是,物管叫他去收回小广告。他当即同意了,准备去收。这时,小区的七八名工作人员就殴打了他,其中有穿保安服装的,也有身着便衣的。让他气愤的是,他试图逃跑躲起来,依然被追着殴打。黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相";
> System.out.println(sentence.length());
>// String sentence = "女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事件真相。";
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>  // System.out.println(termAttr.toString());
> }
> 
> analyzer.close();
>   }
> The text length in above sample is 1027, with this sample, the sentences are 
> like this:
> .
> Sentence:黄敏说,女儿将他被打又维权无门的遭遇发到了微信上,希望找到相关视频和照片,还原事
> Sentence:件真相
> The last 3 characters are detected as an individual sentence, so 还原事件真相 is 
> tokenized as 还原|事|件|真相. when the correct tokens should be 还原|事件|真相。
> Override isSafeEnd method in HMMChineseTokenizer fixes this issue by consider 
> ','  or '。'  as a safe end of text:
> public class HMMChineseTokenizer extends SegmentingTokenizerBase {
> 
>  /** For sentence tokenization, these are the unambiguous break positions. */
>   protected boolean isSafeEnd(char ch) {
> switch(ch) {
>   case 0x000D:
>   case 0x000A:
>   case 0x0085:
>   case 0x2028:
>   case 0x2029:
>+   case '。':
>+   case ',':
> return true;
>   default:
> return false;
> }
>   }
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

2016-12-01 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714052#comment-15714052
 ] 

Chang KaiShin edited comment on LUCENE-7509 at 12/2/16 7:56 AM:


This is not a bug. The underlying Viterbi algorithm segmenting Chinese 
sentences is based on the probability of the occurrences of the Chinese 
Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 
meanings. If it is placed in the end of the sentence. It means daily newspaper. 
However, if placed with conjunctions with other Chinese Characters. It is meant 
to report something. So the algorithm segments "报" as independent word to mean 
reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean 
daily newspaper. You need to add some words to the dictionary to let the 
algorithms to learn, so that you get the correct result you wanted. 

The same induction applies to the case "碧绿的眼珠,". It was segmented into 碧绿|的|眼| 
珠,
The punctuation "," is a stopword, so the result is 碧绿|的|眼| 珠. I suggest put 
the word "眼珠" into the dictionary , the problem should be solved.


was (Author: gushgg):
This is not a bug. The underlying Viterbi algorithm segmenting Chinese 
sentences is based on the probability of the occurrences of the Chinese 
Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 
meanings. If it is placed in the end of the sentence. It means daily newspaper. 
However, if placed with conjunctions with other Chinese Characters. It is meant 
to report something. So the algorithm segments "报" as independent word to mean 
reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean 
daily newspaper. You need to add some words to the dictionary to let the 
algorithms to learn, so that you get the correct result you wanted. 

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> --
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1===");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2===");
> 
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
> 
> analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>   System.out.println(termAttr.toString());
> }
> tokens.close();
>   }
> Output:
> Sample1===
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2===
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-7509) [smartcn] Some chinese text is not tokenized correctly with Chinese punctuation marks appended

2016-12-01 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-7509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714052#comment-15714052
 ] 

Chang KaiShin commented on LUCENE-7509:
---

This is not a bug. The underlying Viterbi algorithm segmenting Chinese 
sentences is based on the probability of the occurrences of the Chinese 
Characters. Take sentence "生活报8月4号" as an example. The "报" here is meant 2 
meanings. If it is placed in the end of the sentence. It means daily newspaper. 
However, if placed with conjunctions with other Chinese Characters. It is meant 
to report something. So the algorithm segments "报" as independent word to mean 
reporting. On the Contrary,  "生活报" is assumed to have higher chance to mean 
daily newspaper. You need to add some words to the dictionary to let the 
algorithms to learn, so that you get the correct result you wanted. 

> [smartcn] Some chinese text is not tokenized correctly with Chinese 
> punctuation marks appended
> --
>
> Key: LUCENE-7509
> URL: https://issues.apache.org/jira/browse/LUCENE-7509
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 6.2.1
> Environment: Mac OS X 10.10
>Reporter: peina
>  Labels: chinese, tokenization
>
> Some chinese text is not tokenized correctly with Chinese punctuation marks 
> appended.
> e.g.
> 碧绿的眼珠 is tokenized as 碧绿|的|眼珠. Which is correct.
> But 
> 碧绿的眼珠,(with a Chinese punctuation appended )is tokenized as 碧绿|的|眼|珠,
> The similar case happens when text with numbers appended.
> e.g.
> 生活报8月4号 -->生活|报|8|月|4|号
> 生活报-->生活报
> Test Sample:
> public static void main(String[] args) throws IOException{
> Analyzer analyzer = new SmartChineseAnalyzer(); /* will load stopwords */
> System.out.println("Sample1===");
> String sentence = "生活报8月4号";
> printTokens(analyzer, sentence);
> sentence = "生活报";
> printTokens(analyzer, sentence);
> System.out.println("Sample2===");
> 
> sentence = "碧绿的眼珠,";
> printTokens(analyzer, sentence);
> sentence = "碧绿的眼珠";
> printTokens(analyzer, sentence);
> 
> analyzer.close();
>   }
>   private static void printTokens(Analyzer analyzer, String sentence) throws 
> IOException{
> System.out.println("sentence:" + sentence);
> TokenStream tokens = analyzer.tokenStream("dummyfield", sentence);
> tokens.reset();
> CharTermAttribute termAttr = (CharTermAttribute) 
> tokens.getAttribute(CharTermAttribute.class);
> while (tokens.incrementToken()) {
>   System.out.println(termAttr.toString());
> }
> tokens.close();
>   }
> Output:
> Sample1===
> sentence:生活报8月4号
> 生活
> 报
> 8
> 月
> 4
> 号
> sentence:生活报
> 生活报
> Sample2===
> sentence:碧绿的眼珠,
> 碧绿
> 的
> 眼
> 珠
> sentence:碧绿的眼珠
> 碧绿
> 的
> 眼珠



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6435) java.util.ConcurrentModificationException: Removal from the cache failed error in SimpleNaiveBayesClassifier

2015-10-02 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14940825#comment-14940825
 ] 

Chang KaiShin commented on LUCENE-6435:
---

Good to hear the problem solved!

> java.util.ConcurrentModificationException: Removal from the cache failed 
> error in SimpleNaiveBayesClassifier
> 
>
> Key: LUCENE-6435
> URL: https://issues.apache.org/jira/browse/LUCENE-6435
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 5.1
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: Trunk, 6.0
>
> Attachments: LUCENE-6435.patch, patch.rtf
>
>
> While using {{SimpleNaiveBayesClassifier}} on a very large index (all Italian 
> Wikipedia articles) I see the following code triggering a 
> {{ConcurrentModificationException}} when evicting the {{Query}} from the 
> {{LRUCache}}.
> {code}
> BooleanQuery booleanQuery = new BooleanQuery();
> BooleanQuery subQuery = new BooleanQuery();
> for (String textFieldName : textFieldNames) {
>   subQuery.add(new BooleanClause(new TermQuery(new Term(textFieldName, 
> word)), BooleanClause.Occur.SHOULD));
> }
> booleanQuery.add(new BooleanClause(subQuery, BooleanClause.Occur.MUST));
> booleanQuery.add(new BooleanClause(new TermQuery(new Term(classFieldName, 
> c)), BooleanClause.Occur.MUST));
> //...
> TotalHitCountCollector totalHitCountCollector = new 
> TotalHitCountCollector();
> indexSearcher.search(booleanQuery, totalHitCountCollector);
> return totalHitCountCollector.getTotalHits();
> {code}
> this is the complete stacktrace:
> {code}
> java.util.ConcurrentModificationException: Removal from the cache failed! 
> This is probably due to a query which has been modified after having been put 
> into  the cache or a badly implemented clone(). Query class: [class 
> org.apache.lucene.search.BooleanQuery], query: [#text:panoram #cat:1356]
>   at 
> __randomizedtesting.SeedInfo.seed([B6513DEC3681FEF5:138235BE33532634]:0)
>   at 
> org.apache.lucene.search.LRUQueryCache.evictIfNecessary(LRUQueryCache.java:285)
>   at 
> org.apache.lucene.search.LRUQueryCache.putIfAbsent(LRUQueryCache.java:268)
>   at 
> org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:569)
>   at 
> org.apache.lucene.search.ConstantScoreWeight.scorer(ConstantScoreWeight.java:82)
>   at org.apache.lucene.search.Weight.bulkScorer(Weight.java:137)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.getWordFreqForClass(SimpleNaiveBayesClassifier.java:288)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.calculateLogLikelihood(SimpleNaiveBayesClassifier.java:248)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClassNormalizedList(SimpleNaiveBayesClassifier.java:169)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClass(SimpleNaiveBayesClassifier.java:125)
>   at 
> org.apache.lucene.classification.WikipediaTest.testItalianWikipedia(WikipediaTest.java:126)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1627)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:836)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:872)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:886)
>   at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
>   at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
>   at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
>   at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
>   at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
>   at 
> 

[jira] [Updated] (LUCENE-6435) java.util.ConcurrentModificationException: Removal from the cache failed error in SimpleNaiveBayesClassifier

2015-09-16 Thread Chang KaiShin (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chang KaiShin updated LUCENE-6435:
--
Attachment: patch.rtf

> java.util.ConcurrentModificationException: Removal from the cache failed 
> error in SimpleNaiveBayesClassifier
> 
>
> Key: LUCENE-6435
> URL: https://issues.apache.org/jira/browse/LUCENE-6435
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 5.1
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: Trunk
>
> Attachments: patch.rtf
>
>
> While using {{SimpleNaiveBayesClassifier}} on a very large index (all Italian 
> Wikipedia articles) I see the following code triggering a 
> {{ConcurrentModificationException}} when evicting the {{Query}} from the 
> {{LRUCache}}.
> {code}
> BooleanQuery booleanQuery = new BooleanQuery();
> BooleanQuery subQuery = new BooleanQuery();
> for (String textFieldName : textFieldNames) {
>   subQuery.add(new BooleanClause(new TermQuery(new Term(textFieldName, 
> word)), BooleanClause.Occur.SHOULD));
> }
> booleanQuery.add(new BooleanClause(subQuery, BooleanClause.Occur.MUST));
> booleanQuery.add(new BooleanClause(new TermQuery(new Term(classFieldName, 
> c)), BooleanClause.Occur.MUST));
> //...
> TotalHitCountCollector totalHitCountCollector = new 
> TotalHitCountCollector();
> indexSearcher.search(booleanQuery, totalHitCountCollector);
> return totalHitCountCollector.getTotalHits();
> {code}
> this is the complete stacktrace:
> {code}
> java.util.ConcurrentModificationException: Removal from the cache failed! 
> This is probably due to a query which has been modified after having been put 
> into  the cache or a badly implemented clone(). Query class: [class 
> org.apache.lucene.search.BooleanQuery], query: [#text:panoram #cat:1356]
>   at 
> __randomizedtesting.SeedInfo.seed([B6513DEC3681FEF5:138235BE33532634]:0)
>   at 
> org.apache.lucene.search.LRUQueryCache.evictIfNecessary(LRUQueryCache.java:285)
>   at 
> org.apache.lucene.search.LRUQueryCache.putIfAbsent(LRUQueryCache.java:268)
>   at 
> org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:569)
>   at 
> org.apache.lucene.search.ConstantScoreWeight.scorer(ConstantScoreWeight.java:82)
>   at org.apache.lucene.search.Weight.bulkScorer(Weight.java:137)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.getWordFreqForClass(SimpleNaiveBayesClassifier.java:288)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.calculateLogLikelihood(SimpleNaiveBayesClassifier.java:248)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClassNormalizedList(SimpleNaiveBayesClassifier.java:169)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClass(SimpleNaiveBayesClassifier.java:125)
>   at 
> org.apache.lucene.classification.WikipediaTest.testItalianWikipedia(WikipediaTest.java:126)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1627)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:836)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:872)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:886)
>   at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
>   at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
>   at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
>   at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
>   at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
> com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:365)
>   at 
> 

[jira] [Commented] (LUCENE-6435) java.util.ConcurrentModificationException: Removal from the cache failed error in SimpleNaiveBayesClassifier

2015-09-16 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747126#comment-14747126
 ] 

Chang KaiShin commented on LUCENE-6435:
---

By running the JUnit test in debugging mode , I get the 
ConcurrentModificationException - Removal from the cache failed! This is 
probably due to a query which has been modified after having been put into the 
cache or a badly implemented clone(). As the description suggests, I started 
looking into the LRUQueryCache class that contains the most recently used 
queries and find out that that class TermQuery hashCode()  function doesn't 
remain the same value. So that the HashMap(mostRecentlyUsedQueries) is unable 
to locate the index to remove it from the cache. However , I'm not diving into 
the question why TermQuery hashCode()  function failed to remain the save 
value.  

> java.util.ConcurrentModificationException: Removal from the cache failed 
> error in SimpleNaiveBayesClassifier
> 
>
> Key: LUCENE-6435
> URL: https://issues.apache.org/jira/browse/LUCENE-6435
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 5.1
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: Trunk
>
> Attachments: patch.rtf
>
>
> While using {{SimpleNaiveBayesClassifier}} on a very large index (all Italian 
> Wikipedia articles) I see the following code triggering a 
> {{ConcurrentModificationException}} when evicting the {{Query}} from the 
> {{LRUCache}}.
> {code}
> BooleanQuery booleanQuery = new BooleanQuery();
> BooleanQuery subQuery = new BooleanQuery();
> for (String textFieldName : textFieldNames) {
>   subQuery.add(new BooleanClause(new TermQuery(new Term(textFieldName, 
> word)), BooleanClause.Occur.SHOULD));
> }
> booleanQuery.add(new BooleanClause(subQuery, BooleanClause.Occur.MUST));
> booleanQuery.add(new BooleanClause(new TermQuery(new Term(classFieldName, 
> c)), BooleanClause.Occur.MUST));
> //...
> TotalHitCountCollector totalHitCountCollector = new 
> TotalHitCountCollector();
> indexSearcher.search(booleanQuery, totalHitCountCollector);
> return totalHitCountCollector.getTotalHits();
> {code}
> this is the complete stacktrace:
> {code}
> java.util.ConcurrentModificationException: Removal from the cache failed! 
> This is probably due to a query which has been modified after having been put 
> into  the cache or a badly implemented clone(). Query class: [class 
> org.apache.lucene.search.BooleanQuery], query: [#text:panoram #cat:1356]
>   at 
> __randomizedtesting.SeedInfo.seed([B6513DEC3681FEF5:138235BE33532634]:0)
>   at 
> org.apache.lucene.search.LRUQueryCache.evictIfNecessary(LRUQueryCache.java:285)
>   at 
> org.apache.lucene.search.LRUQueryCache.putIfAbsent(LRUQueryCache.java:268)
>   at 
> org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:569)
>   at 
> org.apache.lucene.search.ConstantScoreWeight.scorer(ConstantScoreWeight.java:82)
>   at org.apache.lucene.search.Weight.bulkScorer(Weight.java:137)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.getWordFreqForClass(SimpleNaiveBayesClassifier.java:288)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.calculateLogLikelihood(SimpleNaiveBayesClassifier.java:248)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClassNormalizedList(SimpleNaiveBayesClassifier.java:169)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClass(SimpleNaiveBayesClassifier.java:125)
>   at 
> org.apache.lucene.classification.WikipediaTest.testItalianWikipedia(WikipediaTest.java:126)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1627)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:836)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:872)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:886)
>   at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
>   at 
> 

[jira] [Commented] (LUCENE-6435) java.util.ConcurrentModificationException: Removal from the cache failed error in SimpleNaiveBayesClassifier

2015-09-16 Thread Chang KaiShin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14747081#comment-14747081
 ] 

Chang KaiShin commented on LUCENE-6435:
---

The hashcode of class TermQuery doesn't remain consistent that cause the 
iterator.remove() failing.
I'll post the patch later time

> java.util.ConcurrentModificationException: Removal from the cache failed 
> error in SimpleNaiveBayesClassifier
> 
>
> Key: LUCENE-6435
> URL: https://issues.apache.org/jira/browse/LUCENE-6435
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/classification
>Affects Versions: 5.1
>Reporter: Tommaso Teofili
>Assignee: Tommaso Teofili
> Fix For: Trunk
>
>
> While using {{SimpleNaiveBayesClassifier}} on a very large index (all Italian 
> Wikipedia articles) I see the following code triggering a 
> {{ConcurrentModificationException}} when evicting the {{Query}} from the 
> {{LRUCache}}.
> {code}
> BooleanQuery booleanQuery = new BooleanQuery();
> BooleanQuery subQuery = new BooleanQuery();
> for (String textFieldName : textFieldNames) {
>   subQuery.add(new BooleanClause(new TermQuery(new Term(textFieldName, 
> word)), BooleanClause.Occur.SHOULD));
> }
> booleanQuery.add(new BooleanClause(subQuery, BooleanClause.Occur.MUST));
> booleanQuery.add(new BooleanClause(new TermQuery(new Term(classFieldName, 
> c)), BooleanClause.Occur.MUST));
> //...
> TotalHitCountCollector totalHitCountCollector = new 
> TotalHitCountCollector();
> indexSearcher.search(booleanQuery, totalHitCountCollector);
> return totalHitCountCollector.getTotalHits();
> {code}
> this is the complete stacktrace:
> {code}
> java.util.ConcurrentModificationException: Removal from the cache failed! 
> This is probably due to a query which has been modified after having been put 
> into  the cache or a badly implemented clone(). Query class: [class 
> org.apache.lucene.search.BooleanQuery], query: [#text:panoram #cat:1356]
>   at 
> __randomizedtesting.SeedInfo.seed([B6513DEC3681FEF5:138235BE33532634]:0)
>   at 
> org.apache.lucene.search.LRUQueryCache.evictIfNecessary(LRUQueryCache.java:285)
>   at 
> org.apache.lucene.search.LRUQueryCache.putIfAbsent(LRUQueryCache.java:268)
>   at 
> org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:569)
>   at 
> org.apache.lucene.search.ConstantScoreWeight.scorer(ConstantScoreWeight.java:82)
>   at org.apache.lucene.search.Weight.bulkScorer(Weight.java:137)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:560)
>   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:367)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.getWordFreqForClass(SimpleNaiveBayesClassifier.java:288)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.calculateLogLikelihood(SimpleNaiveBayesClassifier.java:248)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClassNormalizedList(SimpleNaiveBayesClassifier.java:169)
>   at 
> org.apache.lucene.classification.SimpleNaiveBayesClassifier.assignClass(SimpleNaiveBayesClassifier.java:125)
>   at 
> org.apache.lucene.classification.WikipediaTest.testItalianWikipedia(WikipediaTest.java:126)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1627)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:836)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:872)
>   at 
> com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:886)
>   at 
> org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:50)
>   at 
> org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:46)
>   at 
> org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:49)
>   at 
> org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:65)
>   at 
> org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:48)
>   at 
> com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
>   at 
>