Re: Question on Reverse Indexing

Dmitry Kan Thu, 19 Jan 2012 04:27:47 -0800

A quick immediate observation:

first in the analysis and query chains you have some customer tokenizer
factory. Could it, by some chance, affect on the leading wildcard setting?
This setting does not require storing the reversed tokens in the index. It
is just run-time leading wildcard expansion and match. It is turned off by
default, because it is inefficient to search against the entire term
dictionary, which may contain billions (depending on data variation).


Also, since you said you use 4.0, did also build it from source code your
self or used some pre-built package?

On Thu, Jan 19, 2012 at 1:06 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> Yes my app and the Solr admin search results are giving me similar results.
>
> Excerpt from schema.xml:
> As you can see solr.ReversedWildcardFilterFactory is commented out.
>
>  <types>
>   <fieldtype name="string"  class="solr.StrField" sortMissingLast="true"
> omitNorms="true"/>
>    <fieldType name="text_rev" class="solr.TextField"
> sortMissingLast="true" omitNorms="true">
>     <analyzer type="index">
>                <tokenizer
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
>                <filter class="solr.StopFilterFactory"
> words="stopwords.txt" ignoreCase="true"/>
>                <filter
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
>                <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                <!--filter class="solr.ReversedWildcardFilterFactory"
> withOriginal="true"
>                maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/-->
>     </analyzer>
>     <analyzer type="query">
>                <tokenizer class="com
> es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory"
> words="stopwords.txt" ignoreCase="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                <filter class="solr.StopFilterFactory"
> words="stopwords.txt" ignoreCase="true"/>
>     </analyzer>
>  </fieldType>
>  </types>
>
>  <fields>
>  <!-- general -->
>  <field name="id"               type="string"    indexed="true"
>  stored="true"  multiValued="false" required="true"/>
>  <field name="doc_type"         type="string"    indexed="true"
>  stored="true"  multiValued="false" />
>  <field name="access_level"     type="string"    indexed="true"
>  stored="true"  multiValued="false" />
>  <field name="primary_product"  type="string"    indexed="true"
>  stored="true"  multiValued="false" />
>  <field name="products"         type="string"    indexed="true"
>  stored="true"  multiValued="true" />
>  <field name="version"          type="string"    indexed="true"
>  stored="true"  multiValued="true" />
>  <field name="dow_version"      type="string"    indexed="true"
>  stored="false" multiValued="false" />
>  <!-- Since for title we are using the regular highlighter no need to
> enabling termVectors. -->
>  <field name="title"            type="text_rev"    indexed="true"
>  stored="true"  multiValued="false" />
>  <field name="date_last_updated" type="string"   indexed="true"
>  stored="true"  multiValued="false" />
>  <field name="file_name"        type="string"    indexed="false"
>  stored="true"  multiValued="false" />
>  <field name="format"           type="string"    indexed="false"
>  stored="true"  multiValued="false" />
>  <!-- Enabling termvector indexing so we could use fast highlighter. -->
>  <field name="indexed_content"  type="text_rev"          indexed="true"
>  stored="true"  multiValued="false" termVectors="true" termPositions="true"
> termOffsets="true"/>
>  <field name="url"              type="string"    indexed="false"
>  stored="true"  multiValued="false" />
>  <field name="doc_sub_type"     type="string"    indexed="true"
>  stored="true"  multiValued="false" />
>  <field name="attachment_urls"  type="string"    indexed="false"
>  stored="true"  multiValued="true" />
>  <!-- Since for attachment_titles we are using the regular highlighter no
> need to enabling termVectors. -->
>  <field name="attachment_titles"  type="text_rev"    indexed="true"
>  stored="true"  multiValued="true"/>
>  <!-- Enabling termvector indexing so we could use fast highlighter. -->
>  <field name="attachment_bodies"  type="text_rev"    indexed="true"
>  stored="true"  multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true" />
>  <field name="teaser"  type="string"    indexed="false"  stored="true"
>  multiValued="false" />
>  <field name="attachment_bodies_teaser"  type="string"    indexed="false"
>  stored="true"  multiValued="true" />
>  </fields>
>
>
> Excerpt from solrconfig.xml:
>
>  <requestHandler name="employee" class="solr.SearchHandler" >
>   <lst name="defaults">
>     <str name="echoParams">explicit</str>
>     <str name="rows">10</str>
>     <str name="fl">id title teaser url doc_type doc_sub_type products
> date_last_updated version attachment_titles attachment_urls
> attachment_bodies_teaser access_level</str>
>     <str name="hl">true</str>
>     <!--str name="hl.fl">title indexed_content attachment_titles</str-->
>     <str name="hl.fl">title indexed_content attachment_titles
> attachment_bodies</str>
>     <str name="f.title.hl.useFastVectorHighlighter">false</str>
>     <str name="f.title.hl.fragsize">500</str>
>     <str name="f.title.hl.maxAnalyzedChars">500</str>
>     <str name="f.indexed_content.hl.useFastVectorHighlighter">true</str>
>     <str name="f.indexed_content.hl.fragsize">500</str>
>     <str name="f.indexed_content.hl.maxAnalyzedChars">1024000</str>
>     <str name="f.attachment_titles.hl.useFastVectorHighlighter">false</str>
>     <str name="f.attachment_titles.hl.fragsize">500</str>
>     <str name="f.attachment_titles.hl.maxAnalyzedChars">500</str>
> <!--
>     <str name="f.attachment_bodies.hl.useFastVectorHighlighter">false</str>
>     <str name="f.attachment_bodies.hl.fragsize">500</str>
>     <str name="f.attachment_bodies.hl.maxAnalyzedChars">51200</str>
> -->
>     <str name="facet">true</str>
>     <str name="facet.mincount">0</str>
>     <str name="facet.limit">-1</str>
>     <str name="defType">edismax</str>
>     <str name="qf">
>        title^15.0 indexed_content^1.0 attachment_titles^5.0
> attachment_bodies^1.0
>     </str>
>   </lst>
>  </requestHandler>
>
>
> -Shyam
>
> -----Original Message-----
> From: Dmitry Kan [mailto:dmitry....@gmail.com]
> Sent: Thursday, January 19, 2012 4:20 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> Oh, I see, haven't noticed you used solr 4.0. luke can only read 3.5 at
> most, at the moment.
> So when you search with a leading wildcard, do both your app and the SOLR
> admin search give the same results?
>
> Probably you can show relevant parts of your schema and solrconfig? Like
> type(s) definition from schema and searchers from solrconfig.
>
> Do you have any custom query parser implementations or any other custom
> components?
>
> On Thu, Jan 19, 2012 at 9:26 AM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Dimitry,
> >
> > I have used lukeall-3.5.0.jar and when trying to open the index it gives
> > me the error "No Valid Directory at the location, try another location"
> >
> > When using the below command I see this error "luke
> > java.lang.ArrayIndexOutOfBoundsException: 1"
> > java -cp C:\lukeall-3.5.0.jar org.getopt.luke.Luke -index
> > C:\solr\home\data\docs_index\index\
> >
> > We are using Solr 4.0
> >
> > -Shyam
> >
> > -----Original Message-----
> > From: Shyam Bhaskaran [mailto:shyam.bhaska...@synopsys.com]
> > Sent: Thursday, January 19, 2012 11:49 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Question on Reverse Indexing
> >
> > Dimitry,
> >
> > I downloaded Luke but it was not working for me against solr indexes.
> >
> > But using the solr analysis page I did not find any reversed sequences on
> > the field.
> >
> > -Shyam
> >
> >
> > -----Original Message-----
> > From: Shyam Bhaskaran [mailto:shyam.bhaska...@synopsys.com]
> > Sent: Thursday, January 19, 2012 6:29 AM
> > To: solr-user@lucene.apache.org
> > Subject: RE: Question on Reverse Indexing
> >
> > Dimitry,
> >
> > Completed a clean index and I still see the same behavior.
> >
> > Did not use Luke but from the search page we use leading wild card search
> > is working.
> >
> > -Shyam
> >
> > -----Original Message-----
> > From: Dmitry Kan [mailto:dmitry....@gmail.com]
> > Sent: Wednesday, January 18, 2012 5:07 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Shyam,
> >
> > You still didn't say if you have started re-indexing from the clean
> index,
> > i.e. if you have removed all the data prior to re-indexing.
> > You can use the luke (http://code.google.com/p/luke/) to check the
> > contents
> > of your text field, and see if it still contains reversed sequences.
> >
> > On Wed, Jan 18, 2012 at 1:09 PM, Shyam Bhaskaran <
> > shyam.bhaska...@synopsys.com> wrote:
> >
> > > Dimitry,
> > >
> > > We are using Solr 4.0. To confirm server caching issues I have
> restarted
> > > our tomcat webserver after performing a re-index.
> > >
> > > For reverseIndexing we have defined a fieldType "text_rev" and this
> > > fieldyType was used against the fields.
> > >
> > >  <fieldType name="text_rev" class="solr.TextField"
> sortMissingLast="true"
> > > omitNorms="true">
> > >     <analyzer type="index">
> > >                <tokenizer
> > > class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> > >                <filter class="solr.StopFilterFactory"
> > > words="stopwords.txt" ignoreCase="true"/>
> > >                <filter
> > > class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
> > >                <filter class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> > >                <filter
> > >
> >
> class="com.es.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
> > >                <filter class="solr.LowerCaseFilterFactory"/>
> > >                 <filter class="solr.ReversedWildcardFilterFactory"
> > > withOriginal="true"
> > >                maxPosAsterisk="3" maxPosQuestion="2"
> > > maxFractionAsterisk="0.33"/>
> > >      </analyzer>
> > >     <analyzer type="query">
> > >                <tokenizer
> > > class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> > >                <filter class="solr.StopFilterFactory"
> > > words="stopwords.txt" ignoreCase="true"/>
> > >                <filter
> > > class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
> > >                <filter class="solr.LowerCaseFilterFactory"/>
> > >                <filter class="solr.StopFilterFactory"
> > > words="stopwords.txt" ignoreCase="true"/>
> > >     </analyzer>
> > >  </fieldType>
> > >
> > > But when it was found that ReversedWildcardFilterFactory is adding
> > > performance burden we removed the ReversedWildcardFilterFactory filter
> > >                 <filter class="solr.ReversedWildcardFilterFactory"
> > > withOriginal="true"
> > >                maxPosAsterisk="3" maxPosQuestion="2"
> > > maxFractionAsterisk="0.33"/>
> > > and the whole collection was re-indexed.
> > >
> > > But even after removing the ReversedWildcardFilterFactory leading wild
> > > card search like *lock is working.
> > >
> > > -Shyam
> > >
> > > -----Original Message-----
> > > From: Dmitry Kan [mailto:dmitry....@gmail.com]
> > > Sent: Wednesday, January 18, 2012 4:26 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Question on Reverse Indexing
> > >
> > > OK. Not sure what is your system architecture there, but could your
> > queries
> > > stay cached in some server caches even after you have re-indexed your
> > data?
> > > The way the index level leading wildcard works (reading SOLR 3.4 code,
> > but
> > > seems to be true circa 1.4) is that the following check is done for the
> > > analysis chain:
> > >
> > > [code src=SolrQueryParser.java]
> > > boolean allow = false;
> > > ...
> > >          if (factory instanceof ReversedWildcardFilterFactory) {
> > >            allow = true;
> > >            ...
> > >          }
> > > ...
> > >    if (allow) {
> > >      setAllowLeadingWildcard(true);
> > >    }
> > > [/code]
> > >
> > > so practically what you described can happen if
> > > the ReversedWildcardFilterFactory is still mentioned in one of your
> > shards.
> > > A weird question, but have you reindexed your data to a clean index or
> on
> > > top of the existing one?
> > >
> > > On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
> > > shyam.bhaska...@synopsys.com> wrote:
> > >
> > > > Dimitry,
> > > >
> > > > Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the
> > > query
> > > > *lock and did not find ReversedWildcardFilterFactory to the indexer
> or
> > > any
> > > > other filters that could do the reversing.
> > > >
> > > > -Shyam
> > > >
> > > > -----Original Message-----
> > > > From: Dmitry Kan [mailto:dmitry....@gmail.com]
> > > > Sent: Wednesday, January 18, 2012 2:26 PM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: Question on Reverse Indexing
> > > >
> > > > Just to play safe here, can you double check that the reversing is
> not
> > > any
> > > > more the case by issuing a query through the admin analysis page?
> > > >
> > > > Dmitry
> > > >
> > > > On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
> > > > shyam.bhaska...@synopsys.com> wrote:
> > > >
> > > > > Hi Francois,
> > > > >
> > > > > I understand that disabling of ReversedWildcardFilterFactory has
> > > improved
> > > > > the performance.
> > > > >
> > > > > But I am puzzled over how the leading wild card search like *lock
> is
> > > > > working even though I have now disabled the
> > > ReversedWildcardFilterFactory
> > > > > and the indexes have been created without ReversedWildcardFilter ?
> > > > >
> > > > > How does reverse indexing work even after disabling
> > > > > ReversedWildcardFilterFactory?
> > > > >
> > > > > Can anyone explain me how this feature is working.
> > > > >
> > > > > -Shyam
> > > > >
> > > > > -----Original Message-----
> > > > > From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> > > > > Sent: Wednesday, January 18, 2012 7:49 AM
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Question on Reverse Indexing
> > > > >
> > > > > Using ReversedWildcardFilterFactory will double the size of your
> > > > > dictionary (more or less), maybe the drop in performance that you
> are
> > > > > seeing is a result of that?
> > > > >
> > > > > François
> > > > >
> > > > > On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > For reverse indexing we are using the
> ReversedWildcardFilterFactory
> > > on
> > > > > Solr 4.0
> > > > > >
> > > > > >
> > > > > > <filter class="solr.ReversedWildcardFilterFactory"
> > > withOriginal="true"
> > > > > >
> > > > > > maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
> > > > > >
> > > > > >
> > > > > > ReversedWildcardFilterFactory was helping us to perform leading
> > wild
> > > > > card searches like *lock.
> > > > > >
> > > > > > But it was observed that the performance of the searches was not
> > good
> > > > > after introducing ReversedWildcardFilterFactory filter.
> > > > > >
> > > > > > Hence we disabled ReversedWildcardFilterFactory filter and
> > re-created
> > > > > the indexes and this time we found the performance of Solr query to
> > be
> > > > > faster.
> > > > > >
> > > > > > But surprisingly it is observed that leading wild card searches
> > were
> > > > > still working inspite of disabling the
> ReversedWildcardFilterFactory
> > > > filter.
> > > > > >
> > > > > >
> > > > > > This behavior is puzzling everyone and wanted to know how this
> > > behavior
> > > > > of reverse indexing works?
> > > > > >
> > > > > > Can anyone share with me on this Solr behavior.
> > > > > >
> > > > > > -Shyam
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Dmitry Kan
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Regards,

Dmitry Kan

Re: Question on Reverse Indexing

Reply via email to