On 10/27/2014 6:20 AM, Robust Links wrote:
> 1) we want to index and search all tokens in a document (i.e. we do not
> rely on external stores)
> 
> 2) we need search time to be fast and willing to pay larger indexing time
> and index size,
> 
> 3)  be able to search as fast as possible ngrams of 3 tokens or less (i.e,
> unigrams, bigrams and trigrams).
> 
> 
> To satisfy (1) we used the default
> <maxFieldLength>2147483647</maxFieldLength> in
> solrconfig.xml of 3.6.1 index to specify the total number of tokens to
> index in an article. In solr 4 we are specifying it via the tokenizer in
> the analyzer chain
> 
> 
> <tokenizer class="solr.ClassicTokenizerFactory" maxTokenLength="2147483647
> "/>
> 
> 
> To satisfy 2 and 3 in our 3.6.1 index we indexed using the following
> shingedFilterFactory in the analyzer chain
> 
> 
> <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> maxShingleSize="3”/>
> 
> 
> This was based on this thread:
> 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200808.mbox/%3c856ac15f0808161539p54417df2ga5a6fdfa35889...@mail.gmail.com%3E
> 
> 
> The open questions we are trying to understand now are:
> 
> 
> 1) whether shingling is still the best strategy for phrase (ngram) search
> given our requirements above?
> 
> 2) if not then what would be a better strategy.

The maxFieldLength setting is different than maxTokenLength.  The former
is the number of tokens that are allowed.  The latter is the number of
characters allowed in *each* token.  Since the value you were using
should be the default value for maxFieldLength, you don't need it in
your config.

As for maxTokenLength, if the older version worked right without that
setting, you probably don't need it now.  Really long tokens are usually
useless, unless a later step in the analysis will be breaking it up into
additional tokens (terms).  It's exceptionally rare that people will use
or type a "word" that's 256 characters.  I have seen documents that
exceed the token length on keyword fields where the input is only
separated by commas -- there are no spaces for the WhiteSpaceTokenizer
to split on, so a document with a lot of keywords ends up indexing none
of them because the tokenizer ignores the input due to length.  If it
had indexed them, they would have been further tokenized by the
WordDelimiterFilter.

Shingles may or may not be required to match the way you have described.
 It all depends on the *exact* nature of your queries.  I haven't
wrapped my head around the possibilities, so I can't give you an
example.  Since it's been working on your older index, chances are
excellent that it will continue to work on the newer index.  Shingles
can indeed increase search performance, if the conditions are right.

Search performance in general is better in 4.x than it was in 3.x.

It's always a good idea to look at this wiki page (and even dive into
the Lucene javadocs) from time to time in order to determine whether
there's a better way of doing your analysis:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

It sounds like you've been at this a while, so you probably already know
this next part, but it would be irresponsible of me to talk about all
this without mentioning it.  When you change your index analysis, you
must reindex.

http://wiki.apache.org/solr/HowToReindex

Thanks,
Shawn

Reply via email to