Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Hi,

I am looking for a tokenizer, where I could specify a delimiter by which
the words are tokenized, for example if I choose the delimiters as ' ' and
'_' the following string:
"foo__bar doo"
would be tokenized into:
"foo", "", "bar", "doo"
(The analyzer could further filter empty tokens, since having the empty
string token is not critical).

Is such functionality built into Lucene (I'm working with 7.1.0) and does
this seem like the correct approach to the problem?

Regards,
Armīns


Re: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Thanks, I was able to use the module, however my Analyzer is not invoked
upon the IndexWriter.addDocument(), even thought I pass it to constructor
upon creating IndexWriterConfig and when I test the Analyzer, by calling it
explicitly using the instructions in
http://lucene.apache.org/core/7_1_0/core/org/apache/lucene/analysis/package-summary.html
under Invoking the Analyzer, the Analyzer works as expected.

Do you know what could I be missing?
Please let me know if you need any more of my code.

Regards,
Armīns

On Mon, Jan 8, 2018 at 3:27 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> Hi
>
> It is part of the analyzers-common module, it is not included in Lucene's
> core. Lucene's core module only has a single analyzer (StandardAnalyzer)
> and some helper classes, but not the full set of multi-purpose and language
> specific ones.
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> > Sent: Monday, January 8, 2018 2:09 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Looking For Tokenizer With Custom Delimeter
> >
> > Thanks for the solution, however I am unable to access CharTokenizer
> class,
> > when I import it using:
> >
> > import org.apache.lucene.analysis.util.*;
> >
> > Although I am able to access classes directly under analysis (or
> > analysis.standard) just fine with the import statement:
> > import org.apache.lucene.analysis.*;
> >
> > Does this appear as a Lucene specific problem?
> >
> > P.S. I'm using Maven for managing my dependencies with the following two
> > statements for Lucene:
> >
> > 
> > org.apache.lucene
> > lucene-core
> > 7.1.0
> > 
> >
> > 
> > org.apache.lucene
> > lucene-queryparser
> > 7.1.0
> > 
> >
> > Regards,
> > Armīns
> >
> > On Mon, Jan 8, 2018 at 12:53 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> >
> > > Moin,
> > >
> > > Plain easy to do customize with lambdas! E.g., an elegant way to
> create a
> > > tokenizer which behaves exactly as WhitespaceTokenizer and
> > LowerCaseFilter
> > > is:
> > >
> > > Tokenizer tok =
> > CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace,
> > > Character::toLowerCase);
> > >
> > > Adjust with Lambdas and you can create any tokenizer based on any
> > > character check, so to check for whitespace or underscore:
> > >
> > > Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch ->
> > > Character.isWhitespace || ch == '_');
> > >
> > > Uwe
> > >
> > > -
> > > Uwe Schindler
> > > Achterdiek 19, D-28357 Bremen
> > > http://www.thetaphi.de
> > > eMail: u...@thetaphi.de
> > >
> > > > -Original Message-
> > > > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> > > > Sent: Monday, January 8, 2018 11:30 AM
> > > > To: java-user@lucene.apache.org
> > > > Subject: Looking For Tokenizer With Custom Delimeter
> > > >
> > > > Hi,
> > > >
> > > > I am looking for a tokenizer, where I could specify a delimiter by
> which
> > > > the words are tokenized, for example if I choose the delimiters as '
> '
> > > and
> > > > '_' the following string:
> > > > "foo__bar doo"
> > > > would be tokenized into:
> > > > "foo", "", "bar", "doo"
> > > > (The analyzer could further filter empty tokens, since having the
> empty
> > > > string token is not critical).
> > > >
> > > > Is such functionality built into Lucene (I'm working with 7.1.0) and
> does
> > > > this seem like the correct approach to the problem?
> > > >
> > > > Regards,
> > > > Armīns
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > >
> > >
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


RE: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Uwe Schindler
Hi

It is part of the analyzers-common module, it is not included in Lucene's core. 
Lucene's core module only has a single analyzer (StandardAnalyzer) and some 
helper classes, but not the full set of multi-purpose and language specific 
ones.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> Sent: Monday, January 8, 2018 2:09 PM
> To: java-user@lucene.apache.org
> Subject: Re: Looking For Tokenizer With Custom Delimeter
> 
> Thanks for the solution, however I am unable to access CharTokenizer class,
> when I import it using:
> 
> import org.apache.lucene.analysis.util.*;
> 
> Although I am able to access classes directly under analysis (or
> analysis.standard) just fine with the import statement:
> import org.apache.lucene.analysis.*;
> 
> Does this appear as a Lucene specific problem?
> 
> P.S. I'm using Maven for managing my dependencies with the following two
> statements for Lucene:
> 
> 
> org.apache.lucene
> lucene-core
> 7.1.0
> 
> 
> 
> org.apache.lucene
> lucene-queryparser
> 7.1.0
> 
> 
> Regards,
> Armīns
> 
> On Mon, Jan 8, 2018 at 12:53 PM, Uwe Schindler <u...@thetaphi.de> wrote:
> 
> > Moin,
> >
> > Plain easy to do customize with lambdas! E.g., an elegant way to create a
> > tokenizer which behaves exactly as WhitespaceTokenizer and
> LowerCaseFilter
> > is:
> >
> > Tokenizer tok =
> CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace,
> > Character::toLowerCase);
> >
> > Adjust with Lambdas and you can create any tokenizer based on any
> > character check, so to check for whitespace or underscore:
> >
> > Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch ->
> > Character.isWhitespace || ch == '_');
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -Original Message-
> > > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> > > Sent: Monday, January 8, 2018 11:30 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Looking For Tokenizer With Custom Delimeter
> > >
> > > Hi,
> > >
> > > I am looking for a tokenizer, where I could specify a delimiter by which
> > > the words are tokenized, for example if I choose the delimiters as ' '
> > and
> > > '_' the following string:
> > > "foo__bar doo"
> > > would be tokenized into:
> > > "foo", "", "bar", "doo"
> > > (The analyzer could further filter empty tokens, since having the empty
> > > string token is not critical).
> > >
> > > Is such functionality built into Lucene (I'm working with 7.1.0) and does
> > > this seem like the correct approach to the problem?
> > >
> > > Regards,
> > > Armīns
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Thanks for the solution, however I am unable to access CharTokenizer class,
when I import it using:

import org.apache.lucene.analysis.util.*;

Although I am able to access classes directly under analysis (or
analysis.standard) just fine with the import statement:
import org.apache.lucene.analysis.*;

Does this appear as a Lucene specific problem?

P.S. I'm using Maven for managing my dependencies with the following two
statements for Lucene:


org.apache.lucene
lucene-core
7.1.0



org.apache.lucene
lucene-queryparser
7.1.0


Regards,
Armīns

On Mon, Jan 8, 2018 at 12:53 PM, Uwe Schindler <u...@thetaphi.de> wrote:

> Moin,
>
> Plain easy to do customize with lambdas! E.g., an elegant way to create a
> tokenizer which behaves exactly as WhitespaceTokenizer and LowerCaseFilter
> is:
>
> Tokenizer tok = 
> CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace,
> Character::toLowerCase);
>
> Adjust with Lambdas and you can create any tokenizer based on any
> character check, so to check for whitespace or underscore:
>
> Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch ->
> Character.isWhitespace || ch == '_');
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> > Sent: Monday, January 8, 2018 11:30 AM
> > To: java-user@lucene.apache.org
> > Subject: Looking For Tokenizer With Custom Delimeter
> >
> > Hi,
> >
> > I am looking for a tokenizer, where I could specify a delimiter by which
> > the words are tokenized, for example if I choose the delimiters as ' '
> and
> > '_' the following string:
> > "foo__bar doo"
> > would be tokenized into:
> > "foo", "", "bar", "doo"
> > (The analyzer could further filter empty tokens, since having the empty
> > string token is not critical).
> >
> > Is such functionality built into Lucene (I'm working with 7.1.0) and does
> > this seem like the correct approach to the problem?
> >
> > Regards,
> > Armīns
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


RE: Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Uwe Schindler
Moin,

Plain easy to do customize with lambdas! E.g., an elegant way to create a 
tokenizer which behaves exactly as WhitespaceTokenizer and LowerCaseFilter is:

Tokenizer tok = 
CharTokenizer.fromSeparatorCharPredicate(Character::isWhitespace, 
Character::toLowerCase);

Adjust with Lambdas and you can create any tokenizer based on any character 
check, so to check for whitespace or underscore:

Tokenizer tok = CharTokenizer.fromSeparatorCharPredicate(ch -> 
Character.isWhitespace || ch == '_');

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Armins Stepanjans [mailto:armins.bagr...@gmail.com]
> Sent: Monday, January 8, 2018 11:30 AM
> To: java-user@lucene.apache.org
> Subject: Looking For Tokenizer With Custom Delimeter
> 
> Hi,
> 
> I am looking for a tokenizer, where I could specify a delimiter by which
> the words are tokenized, for example if I choose the delimiters as ' ' and
> '_' the following string:
> "foo__bar doo"
> would be tokenized into:
> "foo", "", "bar", "doo"
> (The analyzer could further filter empty tokens, since having the empty
> string token is not critical).
> 
> Is such functionality built into Lucene (I'm working with 7.1.0) and does
> this seem like the correct approach to the problem?
> 
> Regards,
> Armīns


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Looking For Tokenizer With Custom Delimeter

2018-01-08 Thread Armins Stepanjans
Hi,

I am looking for a tokenizer, where I could specify a delimiter by which
the words are tokenized, for example if I choose the delimiters as ' ' and
'_' the following string:
"foo__bar doo"
would be tokenized into:
"foo", "", "bar", "doo"
(The analyzer could further filter empty tokens, since having the empty
string token is not critical).

Is such functionality built into Lucene (I'm working with 7.1.0) and does
this seem like the correct approach to the problem?

Regards,
Armīns