date:20150814

Filtered docs and positions enum

2015-08-14 Thread Jamie Johnson

First sorry for the post to here and the solr list, not sure where this is
most appropriately asked but since there is no response there I figured I'd
try here...

I have what I believe to be a fairly unique use case (as i have not seen it
mentioned before) that I'm looking for some thoughts on.  I currently have
a need to filter terms based on a users authorizations, the implementation
is currently based on
https://github.com/jej2003/lucure-core/blob/master/src/main/java/com/lucure/core/codec/AccessFilteredDocsAndPositionsEnum.java

The current implementation that we're using wraps a DocsAndPositionsEnum,
but there is a bit of an unknown that I am not sure is or is not an issue
around freq() and positions for a particular term.  Specifically right now
freq() is unmodified as is provided by the wrapped DocsAndPositionsEnum,
but when a caller calls nextPosition and encounters a term with
authorizations they don't have access to we simply call nextPosition on the
wrapped DocsAndPositionsEnum.  In this scenario we've said for instance
that freq() was 2, but the caller only had access to 1.  Currently there is
no equivalent to the no more docs constant for positions so we are
currently returning -1 (though we're considering changing to MAX_INTEGER).
We've already seen possible issues with this in the phrase scorer (thus the
reason we were considering returning MAX_INTEGER), but the only way I can
truly see to remedy this in the current implementation is to get freq()
right from the start, I unfortunately can't see how to do that without
processing all of the items up front to get freq correct given the users
authorizations.

Ok, that was long so now for the question.  Is returning a huge number (say
MAX_INTEGER) from nextPosition() ok for situations like this?  Is there
specific places we should be looking to verify?

I know ideally we instead would look to get the frequencies correct given
the authorizations, but if there aren't any negative consequences to the
current approach I would prefer to avoid the upfront processing.

As always any feedback would be appreciated

How to use case in-sentive search

2015-08-14 Thread vardhaman narasagoudar

Dear Team,

I am trying to build a search engine for fetching person info based on name
or  email Id. For this I have standard Analyzer & wildcard. If I enter case
senstive query I get the result. but how to go about for case in-senstive

I mean if I search for rohan or Rohan should be same, Currently I  search
as per DB that is Rohan , I get the result & not for rohan.

I have posted the same query in Stack overflow
http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385

Please help me out, is there any refernce where I can look in

-- 
Thanks & Regards
Vardhaman B.N
9945840928

Re: How to use case in-sentive search

2015-08-14 Thread Erick Erickson

Add LowercaseFilterFactory to your analysis chain for the fieldType
both at query and index time. You'll need to re-index.

The admin UI/analysis page will help you understand the effects
of each analysis step defined in your fieldTypes.

Best,
Erick

On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar
 wrote:
> Dear Team,
>
> I am trying to build a search engine for fetching person info based on name
> or  email Id. For this I have standard Analyzer & wildcard. If I enter case
> senstive query I get the result. but how to go about for case in-senstive
>
> I mean if I search for rohan or Rohan should be same, Currently I  search
> as per DB that is Rohan , I get the result & not for rohan.
>
> I have posted the same query in Stack overflow
> http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385
>
> Please help me out, is there any refernce where I can look in
>
> --
> Thanks & Regards
> Vardhaman B.N
> 9945840928

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: How to use case in-sentive search

2015-08-14 Thread Jack Krupansky

I was assuming this was a Lucene question...

The StandardAnalyzer already includes the lower case filter, so the default
should be case-insensitive query.

See:
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html

If the question was really how to get case-sensitive query, simply create
your own analyzer without the lower case filter.


-- Jack Krupansky

On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson 
wrote:

> Add LowercaseFilterFactory to your analysis chain for the fieldType
> both at query and index time. You'll need to re-index.
>
> The admin UI/analysis page will help you understand the effects
> of each analysis step defined in your fieldTypes.
>
> Best,
> Erick
>
> On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar
>  wrote:
> > Dear Team,
> >
> > I am trying to build a search engine for fetching person info based on
> name
> > or  email Id. For this I have standard Analyzer & wildcard. If I enter
> case
> > senstive query I get the result. but how to go about for case in-senstive
> >
> > I mean if I search for rohan or Rohan should be same, Currently I  search
> > as per DB that is Rohan , I get the result & not for rohan.
> >
> > I have posted the same query in Stack overflow
> >
> http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385
> >
> > Please help me out, is there any refernce where I can look in
> >
> > --
> > Thanks & Regards
> > Vardhaman B.N
> > 9945840928
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Re: How to use case in-sentive search

2015-08-14 Thread Uwe Schindler

Hi,

Wildcard queries don't use the Analyzer, so they are case sensitive. Most of 
Lucene's query parsers allow to lowercase although there is a wildcard, but xou 
have to enable this. 

In most cases it is recommended to use a plain simple analyzer for fields using 
wildcards. If you also have stemming this will not work correctly with 
wildcards.

In general, if your queries require wildcards by default then you should review 
your analysis! A good configured analysis chain should allow the user to find 
stuff without using wildcards!!!

Uwe

Am 14. August 2015 16:12:46 MESZ, schrieb Jack Krupansky 
:
>I was assuming this was a Lucene question...
>
>The StandardAnalyzer already includes the lower case filter, so the
>default
>should be case-insensitive query.
>
>See:
>https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>
>If the question was really how to get case-sensitive query, simply
>create
>your own analyzer without the lower case filter.
>
>
>-- Jack Krupansky
>
>On Fri, Aug 14, 2015 at 10:07 AM, Erick Erickson
>
>wrote:
>
>> Add LowercaseFilterFactory to your analysis chain for the fieldType
>> both at query and index time. You'll need to re-index.
>>
>> The admin UI/analysis page will help you understand the effects
>> of each analysis step defined in your fieldTypes.
>>
>> Best,
>> Erick
>>
>> On Fri, Aug 14, 2015 at 3:44 AM, vardhaman narasagoudar
>>  wrote:
>> > Dear Team,
>> >
>> > I am trying to build a search engine for fetching person info based
>on
>> name
>> > or  email Id. For this I have standard Analyzer & wildcard. If I
>enter
>> case
>> > senstive query I get the result. but how to go about for case
>in-senstive
>> >
>> > I mean if I search for rohan or Rohan should be same, Currently I 
>search
>> > as per DB that is Rohan , I get the result & not for rohan.
>> >
>> > I have posted the same query in Stack overflow
>> >
>>
>http://stackoverflow.com/questions/30881355/java-lucene-4-5-how-to-search-by-case-insensitive/30926385#30926385
>> >
>> > Please help me out, is there any refernce where I can look in
>> >
>> > --
>> > Thanks & Regards
>> > Vardhaman B.N
>> > 9945840928
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

--
Uwe Schindler
H.-H.-Meier-Allee 63, 28213 Bremen
http://www.thetaphi.de

getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Wayne Xin

Hi,



I am new with Lucene Analyzer. I would like to get the full English tokens
from SmartChineseAnalyzer. But I’m only getting stems. The following code
has predefined the sentence in "testStr":
String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
晋级决赛secure position. congratulations.";

The printed tokenized result is:

女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul

As you can see some long English tokens such as Japanese, position and
congratulations are cut short in the tokenization process. I hope I didn't
use it wrong.

Test code:

private static void testChineseTokenizer() {
String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
晋级决赛secure position. congratulations.";
Analyzer analyzer = new SmartChineseAnalyzer();
List result = new ArrayList();
StringReader sr = new StringReader(testStr);

try {
TokenStream stream = analyzer.tokenStream(null,sr);
CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
stream.reset();
while (stream.incrementToken())
{ String token = cattr.toString(); result.add(token); }

stream.end();
stream.close();
sr.close();
analyzer.close();
stream = null;
for (String tok: result)
{ System.out.print(" " + tok); }

System.out.println();
}
catch(IOException e)
{ // not thrown b/c we're using a string reader... }

}

Re: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Michael Mastroianni

The easiest thing to do is to create your own analyzer, cut and paste the
code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into it,
and get rid of the line in createComponents(String fieldName, Reader
reader)  that says

result = new PorterStemFilter(result);


On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin  wrote:

> Hi,
>
>
>
> I am new with Lucene Analyzer. I would like to get the full English tokens
> from SmartChineseAnalyzer. But I’m only getting stems. The following code
> has predefined the sentence in "testStr":
> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
> 晋级决赛secure position. congratulations.";
>
> The printed tokenized result is:
>
> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>
> As you can see some long English tokens such as Japanese, position and
> congratulations are cut short in the tokenization process. I hope I didn't
> use it wrong.
>
> Test code:
>
> private static void testChineseTokenizer() {
> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
> 晋级决赛secure position. congratulations.";
> Analyzer analyzer = new SmartChineseAnalyzer();
> List result = new ArrayList();
> StringReader sr = new StringReader(testStr);
>
> try {
> TokenStream stream = analyzer.tokenStream(null,sr);
> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
> stream.reset();
> while (stream.incrementToken())
> { String token = cattr.toString(); result.add(token); }
>
> stream.end();
> stream.close();
> sr.close();
> analyzer.close();
> stream = null;
> for (String tok: result)
> { System.out.print(" " + tok); }
>
> System.out.println();
> }
> catch(IOException e)
> { // not thrown b/c we're using a string reader... }
>
> }
>
>
>
>

Re: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Wayne Xin

Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
final, otherwise we could overwrite createComponents().

New output:

女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
first seed 同 处 1 4 区 3 号
种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池 铉
先 要 过 日本 小将 
japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛
secure position 
congratulations

-Wayne



On 8/14/15, 8:48 AM, "Michael Mastroianni" 
wrote:

>The easiest thing to do is to create your own analyzer, cut and paste the
>code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer into
>it,
>and get rid of the line in createComponents(String fieldName, Reader
>reader)  that says
>
>result = new PorterStemFilter(result);
>
>
>On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin  wrote:
>
>> Hi,
>>
>>
>>
>> I am new with Lucene Analyzer. I would like to get the full English
>>tokens
>> from SmartChineseAnalyzer. But I’m only getting stems. The following
>>code
>> has predefined the sentence in "testStr":
>> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
>> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
>> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
>> 晋级决赛secure position. congratulations.";
>>
>> The printed tokenized result is:
>>
>> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手 马 林
>> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成 池
>> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原 希望 这
>> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>>
>> As you can see some long English tokens such as Japanese, position and
>> congratulations are cut short in the tokenization process. I hope I
>>didn't
>> use it wrong.
>>
>> Test code:
>>
>> private static void testChineseTokenizer() {
>> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
>> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成池铉处在2/4区，不
>> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区，6号种子王仪涵若想
>> 晋级决赛secure position. congratulations.";
>> Analyzer analyzer = new SmartChineseAnalyzer();
>> List result = new ArrayList();
>> StringReader sr = new StringReader(testStr);
>>
>> try {
>> TokenStream stream = analyzer.tokenStream(null,sr);
>> CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class);
>> stream.reset();
>> while (stream.incrementToken())
>> { String token = cattr.toString(); result.add(token); }
>>
>> stream.end();
>> stream.close();
>> sr.close();
>> analyzer.close();
>> stream = null;
>> for (String tok: result)
>> { System.out.print(" " + tok); }
>>
>> System.out.println();
>> }
>> catch(IOException e)
>> { // not thrown b/c we're using a string reader... }
>>
>> }
>>
>>
>>
>>



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Uwe Schindler

Hi,

it's much easier to create own analyzers since Lucene 5.0 (without defining 
your own classes):
https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/analysis/custom/CustomAnalyzer.html
Using the builder you can create your own analyzer just with a few lines of 
code. The names and params used are the factories known from Apache Solr.

Analyzers are final by design.

Uwe
-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


> -Original Message-
> From: Wayne Xin [mailto:wayne_...@hotmail.com]
> Sent: Friday, August 14, 2015 8:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: getting full english word from tokenizing with
> SmartChineseAnalyzer
> 
> Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
> final, otherwise we could overwrite createComponents().
> 
> New output:
> 
> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
> 马 林
> first seed 同 处 1 4 区 3 号
> 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池
> 铉
> 先 要 过 日本 小将
> japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级
> 决赛
> secure position
> congratulations
> 
> -Wayne
> 
> 
> 
> On 8/14/15, 8:48 AM, "Michael Mastroianni" 
> wrote:
> 
> >The easiest thing to do is to create your own analyzer, cut and paste
> >the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
> >into it, and get rid of the line in createComponents(String fieldName,
> >Reader
> >reader)  that says
> >
> >result = new PorterStemFilter(result);
> >
> >
> >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin 
> wrote:
> >
> >> Hi,
> >>
> >>
> >>
> >> I am new with Lucene Analyzer. I would like to get the full English
> >>tokens  from SmartChineseAnalyzer. But I’m only getting stems. The
> >>following code  has predefined the sentence in "testStr":
> >> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军
> 西班牙选手马
> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
> 池铉处在2/4区，不
> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
> ，6号种子王仪涵若想
> >> 晋级决赛secure position. congratulations.";
> >>
> >> The printed tokenized result is:
> >>
> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选
> 手 马 林
> >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成
> 池
> >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原
> 希望 这
> >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
> >>
> >> As you can see some long English tokens such as Japanese, position
> >>and  congratulations are cut short in the tokenization process. I hope
> >>I didn't  use it wrong.
> >>
> >> Test code:
> >>
> >> private static void testChineseTokenizer() { String testStr =
> >> "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
> 池铉处在2/4区，不
> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
> ，6号种子王仪涵若想
> >> 晋级决赛secure position. congratulations."; Analyzer analyzer = new
> >> SmartChineseAnalyzer(); List result = new
> >> ArrayList(); StringReader sr = new StringReader(testStr);
> >>
> >> try {
> >> TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute
> >> cattr = stream.addAttribute(CharTermAttribute.class);
> >> stream.reset();
> >> while (stream.incrementToken())
> >> { String token = cattr.toString(); result.add(token); }
> >>
> >> stream.end();
> >> stream.close();
> >> sr.close();
> >> analyzer.close();
> >> stream = null;
> >> for (String tok: result)
> >> { System.out.print(" " + tok); }
> >>
> >> System.out.println();
> >> }
> >> catch(IOException e)
> >> { // not thrown b/c we're using a string reader... }
> >>
> >> }
> >>
> >>
> >>
> >>
> 
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: getting full english word from tokenizing with SmartChineseAnalyzer

2015-08-14 Thread Wayne Xin

Thanks Uwe. This seems to be a handy tool. My problem is I need a better
example (tutorial maybe) to show me what are necessary/default filters a
SmartChineseAnalyzer or JapaneseAnalyzer needs. In this case, I guess I
need a HMMChineseTokenzier and a stop filter but not a porter stem filter.
I could give a try later but a tutorial would be nice. Thanks for the
suggestion though.

-Wayne

On 8/14/15, 4:40 PM, "Uwe Schindler"  wrote:

>Hi,
>
>it's much easier to create own analyzers since Lucene 5.0 (without
>defining your own classes):
>https://lucene.apache.org/core/5_2_1/analyzers-common/org/apache/lucene/an
>alysis/custom/CustomAnalyzer.html
>Using the builder you can create your own analyzer just with a few lines
>of code. The names and params used are the factories known from Apache
>Solr.
>
>Analyzers are final by design.
>
>Uwe
>-
>Uwe Schindler
>H.-H.-Meier-Allee 63, D-28213 Bremen
>http://www.thetaphi.de
>eMail: u...@thetaphi.de
>
>
>> -Original Message-
>> From: Wayne Xin [mailto:wayne_...@hotmail.com]
>> Sent: Friday, August 14, 2015 8:44 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: getting full english word from tokenizing with
>> SmartChineseAnalyzer
>> 
>> Thanks Michael. That works well. Not sure why SmartChineseAnalyzer is
>> final, otherwise we could overwrite createComponents().
>> 
>> New output:
>> 
>> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选手
>> 马 林
>> first seed 同 处 1 4 区 3 号
>> 种子 李 雪 芮 和 韩国 选手 korean player 成 池 铉 处在 2 4 区 不过 成 池
>> 铉
>> 先 要 过 日本 小将
>> japanese player 奥 原 希望 这 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级
>> 决赛
>> secure position
>> congratulations
>> 
>> -Wayne
>> 
>> 
>> 
>> On 8/14/15, 8:48 AM, "Michael Mastroianni" 
>> wrote:
>> 
>> >The easiest thing to do is to create your own analyzer, cut and paste
>> >the code from org.apache.lucene.analysis.cn.smart.SmartChineseAnalyzer
>> >into it, and get rid of the line in createComponents(String fieldName,
>> >Reader
>> >reader)  that says
>> >
>> >result = new PorterStemFilter(result);
>> >
>> >
>> >On Fri, Aug 14, 2015 at 11:20 AM, Wayne Xin 
>> wrote:
>> >
>> >> Hi,
>> >>
>> >>
>> >>
>> >> I am new with Lucene Analyzer. I would like to get the full English
>> >>tokens  from SmartChineseAnalyzer. But I’m only getting stems. The
>> >>following code  has predefined the sentence in "testStr":
>> >> String testStr = "女单方面，王适娴second seed和头号种子卫冕冠军
>> 西班牙选手马
>> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
>> 池铉处在2/4区，不
>> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
>> ，6号种子王仪涵若想
>> >> 晋级决赛secure position. congratulations.";
>> >>
>> >> The printed tokenized result is:
>> >>
>> >> 女 单 方面 王 适 娴 second seed 和 头号 种子 卫冕 冠军 西班牙 选
>> 手 马 林
>> >> first seed 同 处 1 4 区 3 号 种子 李 雪 芮 和 韩国 选手 korean player 成
>> 池
>> >> 铉 处在 2 4 区 不过 成 池 铉 先 要 过 日本 小将 japanes player 奥 原
>> 希望 这
>> >> 关 下 半 区 6 号 种子 王 仪 涵 若 想 晋级 决赛 secur posit congratul
>> >>
>> >> As you can see some long English tokens such as Japanese, position
>> >>and  congratulations are cut short in the tokenization process. I hope
>> >>I didn't  use it wrong.
>> >>
>> >> Test code:
>> >>
>> >> private static void testChineseTokenizer() { String testStr =
>> >> "女单方面，王适娴second seed和头号种子卫冕冠军西班牙选手马
>> >> 林first seed同处1/4区，3号种子李雪芮和韩国选手Korean player成
>> 池铉处在2/4区，不
>> >> 过成池铉先要过日本小将(Japanese player)奥原希望这关。下半区
>> ，6号种子王仪涵若想
>> >> 晋级决赛secure position. congratulations."; Analyzer analyzer = new
>> >> SmartChineseAnalyzer(); List result = new
>> >> ArrayList(); StringReader sr = new StringReader(testStr);
>> >>
>> >> try {
>> >> TokenStream stream = analyzer.tokenStream(null,sr); CharTermAttribute
>> >> cattr = stream.addAttribute(CharTermAttribute.class);
>> >> stream.reset();
>> >> while (stream.incrementToken())
>> >> { String token = cattr.toString(); result.add(token); }
>> >>
>> >> stream.end();
>> >> stream.close();
>> >> sr.close();
>> >> analyzer.close();
>> >> stream = null;
>> >> for (String tok: result)
>> >> { System.out.print(" " + tok); }
>> >>
>> >> System.out.println();
>> >> }
>> >> catch(IOException e)
>> >> { // not thrown b/c we're using a string reader... }
>> >>
>> >> }
>> >>
>> >>
>> >>
>> >>
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>-
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org
>



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Filtered docs and positions enum

How to use case in-sentive search

Re: How to use case in-sentive search

Re: How to use case in-sentive search

Re: How to use case in-sentive search

getting full english word from tokenizing with SmartChineseAnalyzer

Re: getting full english word from tokenizing with SmartChineseAnalyzer

Re: getting full english word from tokenizing with SmartChineseAnalyzer

RE: getting full english word from tokenizing with SmartChineseAnalyzer

Re: getting full english word from tokenizing with SmartChineseAnalyzer

10 matches

Site Navigation

Mail list logo

Footer information