Re: simple tokenizer question

Josh Lincoln Sun, 08 Dec 2013 13:21:32 -0800

Have you tried adding autoGeneratePhraseQueries=true to the fieldType
without changing the index analysis behavior.


This works at querytime only, and will convert 12-34 to "12 34", as if the
user entered the query as a phrase. This gives the expected behavior as
long as the tokenization is the same for analysis and query.
This'll work for the 80-IA structure, and I think it'll also work for
the 9(1)(vii)
example (converting it to "9 1 vii"), but I haven't tested it. Also, I
would think the 12AA example should already be working as you expect,
unless maybe you're already using the worddelimiterfilterfactory. When I
test the standardTokenizer on 12AA it preserves the string, resulting in
just one token of 12aa.

autoGeneratePhraseQueries is at least worth a quick try - it doesn't
require reindexing.

Two things to note
1) don't use autoGeneratePhraseQueries if you have CJK languages...probably
applies to any language that's not whitespace delimited. You mentioned
Indian, I presume Hindi, which I don't think will be an issue
2) In very rare cases you may have a few odd results if the
non-alphanumeric characters differ but generate the same phrase query. E.g.
9(1)(vii) would produce the same phrase as 9-1(vii), but this doesn't seem
worth considering until you know it's a problem.


On Sun, Dec 8, 2013 at 10:29 AM, Upayavira <u...@odoko.co.uk> wrote:

> If you want to just split on whitespace, then the WhitespaceTokenizer
> will do the job.
>
> However, this will mean that these two tokens aren't the same, and won't
> match each other:
>
> cat
> cat.
>
> A simple regex filter could handle those cases, remove a comma or dot
> when at the end of a word. Although there are other similar situations
> (quotes, colons, etc) that you may want to handle eventually.
>
> Upayavira
>
> On Sun, Dec 8, 2013, at 11:51 AM, Vulcanoid Developer wrote:
> > Thanks for your email.
> >
> > Great, I will look at the WordDelimiterFactory. Just to make clear, I
> > DON'T
> > want any other tokenizing on digits, specialchars, punctuations etc done
> > other than word delimiting on whitespace.
> >
> > All I want for my first version is NO removal of punctuations/special
> > characters at indexing time and during search time i.e., input as-is and
> > search as-is (like a simple sql db?) . I was assuming this would be a
> > trivial case with SOLR and not sure what I am missing here.
> >
> > thanks
> > Vulcanoid
> >
> >
> >
> > On Sun, Dec 8, 2013 at 4:33 AM, Upayavira <u...@odoko.co.uk> wrote:
> >
> > > Have you tried a WhitespaceTokenizerFactory followed by the
> > > WordDelimiterFilterFactory? The latter is perhaps more configurable at
> > > what it does. Alternatively, you could use a RegexFilterFactory to
> > > remove extraneous punctuation that wasn't removed by the Whitespace
> > > Tokenizer.
> > >
> > > Upayavira
> > >
> > > On Sat, Dec 7, 2013, at 06:15 PM, Vulcanoid Developer wrote:
> > > > Hi,
> > > >
> > > > I am new to solr and I guess this is a basic tokenizer question so
> please
> > > > bear with me.
> > > >
> > > > I am trying to use SOLR to index a few (Indian) legal judgments in
> text
> > > > form and search against them. One of the key points with these
> documents
> > > > is
> > > > that the sections/provisions of law usually have punctuation/special
> > > > characters in them. For example search queries will TYPICALLY be
> section
> > > > 12AA, section 80-IA, section 9(1)(vii) and the text of the judgments
> > > > themselves will contain these sort of text with section references
> all
> > > > over
> > > > the place.
> > > >
> > > > Now, using a default schema setup with standardtokenizer, which
> seems to
> > > > delimit on whitespace AND punctuations, I get really bad results
> because
> > > > it
> > > > looks like 12AA is split and results such having 12 and AA in them
> turn
> > > > up.
> > > >  It becomes worse with 9(1)(vii) with results containing 9 and 1 etc
> > > >  being
> > > > turned up.
> > > >
> > > > What is the best solution here? I really just want to index the
> document
> > > > as-is and also to do whitespace tokenizing on the search and nothing
> > > > more.
> > > >
> > > > So in other words:
> > > > a) I would like the text document to be indexed as-is with say 12AA
> and
> > > > 9(1)(vii) in the document stored as it is mentioned.
> > > > b) I would like to be able to search for 12AA and for 9(1)(vii) and
> get
> > > > proper full matches on them without any splitting up/munging etc.
> > > >
> > > > Any suggestions are appreciated.  Thank you for your time.
> > > >
> > > > Thanks
> > > > Vulcanoid
> > >
>

Re: simple tokenizer question

Reply via email to