I am upgrading to solr 6.6.3 and one of my fields uses text_en_splitting. Are there any recommendations on how to adjust the fieldtype definition for these fields. Thanks
Thanks Jay Potharaju On Wed, Feb 7, 2018 at 5:09 AM, Steve Rowe <sar...@gmail.com> wrote: > Thanks Webster, > > I created https://issues.apache.org/jira/browse/SOLR-11955 to work on > this. > > -- > Steve > www.lucidworks.com > > > On Feb 6, 2018, at 2:47 PM, Webster Homer <webster.ho...@sial.com> > wrote: > > > > I noticed that in some of the current example schemas that are shipped > with > > Solr, there is a fieldtype, text_en_splitting, that feeds the output > > of SynonymGraphFilterFactory into WordDelimiterGraphFilterFactory. So if > > this isn't supported, the example should probably be updated or removed. > > > > On Mon, Feb 5, 2018 at 10:27 AM, Steve Rowe <sar...@gmail.com> wrote: > > > >> Hi Александр, > >> > >>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >>> > >>> There should be no problem with using them together. > >> > >> I believe Shawn is wrong. > >> > >> From <http://lucene.apache.org/core/7_2_0/analyzers-common/ > >> org/apache/lucene/analysis/synonym/SynonymGraphFilter.html>: > >> > >>> NOTE: this cannot consume an incoming graph; results will be undefined. > >> > >> Unfortunately, the ref guide entry for Synonym Graph Filter < > >> https://lucene.apache.org/solr/guide/7_2/filter- > descriptions.html#synonym- > >> graph-filter> doesn’t include a warning about this, but it should, like > >> the warning on Word Delimiter Graph Filter <https://lucene.apache.org/ > >> solr/guide/7_2/filter-descriptions.html#word-delimiter-graph-filter>: > >> > >>> Note: although this filter produces correct token graphs, it cannot > >> consume an input token graph correctly. > >> > >> (I’ve just committed a change to the ref guide source to add this also > on > >> the Synonym Graph Filter and Managed Synonym Graph Filter entries, to be > >> included in the ref guide for Solr 7.3.) > >> > >> In short, the combination of the two filters is not supported, because > >> WDGF produces a token graph, which SGF cannot correctly interpret. > >> > >> Other filters also have this issue, see e.g. < > https://issues.apache.org/ > >> jira/browse/LUCENE-3475> for ShingleFilter; this issue has gotten some > >> attention recently, and hopefully it will inspire fixes elsewhere. > >> > >> Patches welcome! > >> > >> -- > >> Steve > >> www.lucidworks.com > >> > >> > >>> On Feb 5, 2018, at 11:19 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >>> > >>> On 2/5/2018 3:55 AM, Александр Шестак wrote: > >>>> > >>>> Hi, I have misunderstanding about usage of SynonymGraphFilterFactory > >>>> and WordDelimiterGraphFilterFactory. Can they be used together? > >>>> > >>> > >>> There should be no problem with using them together. But it is always > >>> possible that the behavior will surprise you, while working 100% as > >>> designed. > >>> > >>>> I have solr type configured in next way > >>>> > >>>> <fieldtype name="fulltext_en" class="solr.TextField" > >>>> autoGeneratePhraseQueries="true"> > >>>> <analyzer type="index"> > >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>>> <filter class="solr.WordDelimiterGraphFilterFactory" > >>>> generateWordParts="1" generateNumberParts="1" > >>>> splitOnNumerics="1" > >>>> catenateWords="1" catenateNumbers="1" catenateAll="0" > >>>> preserveOriginal="1" protected="protwords_en.txt"/> > >>>> <filter class="solr.FlattenGraphFilterFactory"/> > >>>> </analyzer> > >>>> <analyzer type="query"> > >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>>> <filter class="solr.WordDelimiterGraphFilterFactory" > >>>> generateWordParts="1" generateNumberParts="1" > >>>> splitOnNumerics="1" > >>>> catenateWords="0" catenateNumbers="0" catenateAll="0" > >>>> preserveOriginal="1" protected="protwords_en.txt"/> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> <filter class="solr.SynonymGraphFilterFactory" > >>>> synonyms="synonyms_en.txt" ignoreCase="true" > expand="true"/> > >>>> </analyzer> > >>>> </fieldtype> > >>>> > >>>> So on query time it uses SynonymGraphFilterFactory after > >>>> WordDelimiterGraphFilterFactory. > >>>> Synonyms are configured in next way: > >>>> b=>b,boron > >>>> 2=>ii,2 > >>>> > >>>> Query in solr analysis tool looks so. It is shown that terms after SGF > >>>> have positions 3 and 4. Is it correct? I thought that they should had > >>>> 1 and 2 positions. > >>>> > >>> > >>> What matters is the *relative* positions. The exact position number > >>> doesn't matter much. Something new that the Graph implementations use > >>> is the position length. That feature is necessary for multi-term > >>> synonyms to function correctly in phrase queries. > >>> > >>> In your analysis screenshot, WDGF creates three tokens. The two tokens > >>> created by splitting the input are at positions 1 and 2, which I think > >>> is 100% as expected. It also sets the positionLength of the first term > >>> to 2, probably because it has split that term into 2 additional terms. > >>> > >>> Then the SGF takes those last two terms and expands them. Each of the > >>> synonyms is at the same position as the original term, and the relative > >>> positions of the two synonym pairs have not changed -- the second one > is > >>> still one higher than the first. I think the reason that SGF moves the > >>> positions two higher is because the positionLength on the "b2" term is > >>> 2, previously set by WDGF. Someone with more knowledge about the Graph > >>> implementations may have to speak up as to whether this behavior is > >> correct. > >>> > >>> Because the relative positions of the split terms don't change when SGF > >>> runs, I think this is probably working as designed. > >>> > >>> Thanks, > >>> Shawn > >> > >> > > > > -- > > > > > > This message and any attachment are confidential and may be privileged or > > otherwise protected from disclosure. If you are not the intended > recipient, > > you must not copy this message or attachment or disclose the contents to > > any other person. If you have received this transmission in error, please > > notify the sender immediately and delete the message and any attachment > > from your system. Merck KGaA, Darmstadt, Germany and any of its > > subsidiaries do not accept liability for any omissions or errors in this > > message which may arise as a result of E-Mail-transmission or for damages > > resulting from any unauthorized changes of the content of this message > and > > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its > > subsidiaries do not guarantee that this message is free of viruses and > does > > not accept liability for any damages caused by any virus transmitted > > therewith. > > > > Click http://www.emdgroup.com/disclaimer to access the German, French, > > Spanish and Portuguese versions of this disclaimer. > >