SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Raúl Cardozo
I'm migrating from 3.x to 4.x and I'm running some queries to verify that
everything works like before. I've found however that the query galaxy s3
is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

Here's the relevant schema part:

fieldtype name=text_pt class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=false
   analyzer type=index
   charFilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=IIIHYPHENIII/
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.PatternReplaceFilterFactory
pattern=IIIHYPHENIII replacement=-/
   filter class=solr.ASCIIFoldingFilterFactory /
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 preserveOriginal=1
catenateWords=1 catenateNumbers=1 catenateAll=0/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=false
words=portugueseStopWords.txt/
   filter class=solr.BrazilianStemFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer
   analyzer type=query
   charFilter class=solr.PatternReplaceCharFilterFactory
pattern=- replacement=IIIHYPHENIII/
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.PatternReplaceFilterFactory
pattern=IIIHYPHENIII replacement=-/
   filter class=solr.ASCIIFoldingFilterFactory /
   filter class=solr.SynonymFilterFactory ignoreCase=true
synonyms=portugueseSynonyms.txt expand=true/
   filter class=solr.WordDelimiterFilterFactory
generateWordParts=1 generateNumberParts=1 catenateWords=0
preserveOriginal=1 catenateNumbers=0 catenateAll=0
protected=protwords.txt/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.StopFilterFactory ignoreCase=false
words=portugueseStopWords.txt/
   filter class=solr.BrazilianStemFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer/fieldtype

The synonyms involved in this query are:

siii, s3
galaxy, galax

My default search operator is AND (in both versions, even if it's
deprecated in 4.x), and the output of the debug is:

SOLR 3.x

str name=parsedquery+(title_search_pt:galaxy
title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
3)/str

SOLR 4.x

str name=parsedquery+((title_search_pt:galaxy
title_search_pt:galax)/no_coord) +(+title_search_pt:sii
+title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str

The weird thing is that it does not return results like 'galaxy s3'. This
is the debug query:

no match on required clause (+title_search_pt:sii +title_search_pt:s3
+title_search_pt:s +title_search_pt:3)
(NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s), *no
match on required clause (title_search_pt:sii)*
(NON-MATCH) no matching term
(MATCH) weight(title_search_pt:s3 in 1834535)
(MATCH) weight(title_search_pt:s in 1834535)
(MATCH) weight(title_search_pt:3 in 1834535)

How is that sii is *required* when it should be OR'ed with s and s3 ?

The analysis output shows that sii has token position 2, like it's
synonyms, like so:

galaxy  sii 3
galax   s3
s

Thanks,

Raúl Cardozo.


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Chris Hostetter

: I'm migrating from 3.x to 4.x and I'm running some queries to verify that
: everything works like before. I've found however that the query galaxy s3
: is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

is your entire schema 100% identical in both cases?
what is the luceneMatchVersion set to in your solrconfig.xml?


By the looks of your debug output, it appears that you are using 
autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x -- 
but the fieldType you posted here shows it set to false

: fieldtype name=text_pt class=solr.TextField
: positionIncrementGap=100 autoGeneratePhraseQueries=false

...i haven't tried to reproduce your specific situation, but that 
configuration doesn't smell right compared with what you are showing for 
the 3x output...

: SOLR 3.x
: 
: str name=parsedquery+(title_search_pt:galaxy
: title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
: 3)/str
: 
: SOLR 4.x
: 
: str name=parsedquery+((title_search_pt:galaxy
: title_search_pt:galax)/no_coord) +(+title_search_pt:sii
: +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str


-Hoss


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Fermin Silva
Besides liking or not the behaviour we are getting in 3.x, Im required to
keep everything working as close as possible as before.

Have no idea why this is happening, but setting that field to true solved
the issue, now I get the exact same amount of items in both queries!

I wouldn't bother checking why that was so since we'll be moving away from
the older version, which shows the inconsistency.

But thanks a million.

If you have a SO user I can mark yours as answer here:
http://stackoverflow.com/questions/18661996/solr-4-x-vs-3-x-parsedquery-differences

Cheers
On Sep 6, 2013 4:15 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


 : Our schema is identical except the version.
 : In 3.x it's 1.1 and in 4.x it's 1.5.

 That's kind of a significant difference to leave out -- indepenent of the
 question you are asking about here, it's going to make quite a few
 differences in how things are being being parsed, and what defaults are.

 If i'm understanding correctly: you like the behavior you are getting from
 Solr 3.x where phrases are generated automatically for you.

 what i can't understand, is how/why phrases are being generated
 automatically for you if you have that 'autoGeneratePhraseQueries=false'
 on your fieldType in your 3x schema ... that makes no sense to me.

 if you didn't have autoGeneratePhraseQueries specified at all, then the
 'version=1.1' would explain it (up to version=1.3, the default for
 autoGeneratePhraseQueries was true, but in version=1.4 and above, it
 defaults to false)  but with an explicit
 'autoGeneratePhraseQueries=false' i can't explain why 3x works the way
 you say it works for you.

 Bottom line: if you *want* the auto generated phrase query behavior
 in 4.x, you should just set 'autoGeneratePhraseQueries=true' on your
 fieldType.



 :  : I'm migrating from 3.x to 4.x and I'm running some queries to verify
 that
 :  : everything works like before. I've found however that the query
 galaxy
 :  s3
 :  : is giving much less results. In 3.x numFound=1628, in 4.x
 numFound=70.
 : 
 :  is your entire schema 100% identical in both cases?
 :  what is the luceneMatchVersion set to in your solrconfig.xml?
 : 
 : 
 :  By the looks of your debug output, it appears that you are using
 :  autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x
 --
 :  but the fieldType you posted here shows it set to false
 : 
 :  : fieldtype name=text_pt class=solr.TextField
 :  : positionIncrementGap=100 autoGeneratePhraseQueries=false
 : 
 :  ...i haven't tried to reproduce your specific situation, but that
 :  configuration doesn't smell right compared with what you are showing
 for
 :  the 3x output...
 : 
 :  : SOLR 3.x
 :  :
 :  : str name=parsedquery+(title_search_pt:galaxy
 :  : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
 :  : 3)/str
 :  :
 :  : SOLR 4.x
 :  :
 :  : str name=parsedquery+((title_search_pt:galaxy
 :  : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
 :  : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str
 : 
 : 
 :  -Hoss
 : 
 :

 -Hoss



Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Chris Hostetter

: Our schema is identical except the version.
: In 3.x it's 1.1 and in 4.x it's 1.5.

That's kind of a significant difference to leave out -- indepenent of the 
question you are asking about here, it's going to make quite a few 
differences in how things are being being parsed, and what defaults are.

If i'm understanding correctly: you like the behavior you are getting from 
Solr 3.x where phrases are generated automatically for you.

what i can't understand, is how/why phrases are being generated 
automatically for you if you have that 'autoGeneratePhraseQueries=false' 
on your fieldType in your 3x schema ... that makes no sense to me.

if you didn't have autoGeneratePhraseQueries specified at all, then the 
'version=1.1' would explain it (up to version=1.3, the default for 
autoGeneratePhraseQueries was true, but in version=1.4 and above, it 
defaults to false)  but with an explicit 
'autoGeneratePhraseQueries=false' i can't explain why 3x works the way 
you say it works for you.

Bottom line: if you *want* the auto generated phrase query behavior 
in 4.x, you should just set 'autoGeneratePhraseQueries=true' on your 
fieldType.



:  : I'm migrating from 3.x to 4.x and I'm running some queries to verify that
:  : everything works like before. I've found however that the query galaxy
:  s3
:  : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.
: 
:  is your entire schema 100% identical in both cases?
:  what is the luceneMatchVersion set to in your solrconfig.xml?
: 
: 
:  By the looks of your debug output, it appears that you are using
:  autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x --
:  but the fieldType you posted here shows it set to false
: 
:  : fieldtype name=text_pt class=solr.TextField
:  : positionIncrementGap=100 autoGeneratePhraseQueries=false
: 
:  ...i haven't tried to reproduce your specific situation, but that
:  configuration doesn't smell right compared with what you are showing for
:  the 3x output...
: 
:  : SOLR 3.x
:  :
:  : str name=parsedquery+(title_search_pt:galaxy
:  : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
:  : 3)/str
:  :
:  : SOLR 4.x
:  :
:  : str name=parsedquery+((title_search_pt:galaxy
:  : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
:  : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str
: 
: 
:  -Hoss
: 
: 

-Hoss


Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Shawn Heisey

On 9/6/2013 12:46 PM, Fermin Silva wrote:

Our schema is identical except the version.
In 3.x it's 1.1 and in 4.x it's 1.5.

Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4
i believe) and in 4.x we fixed it to 4_4.


The autoGeneratePhraseQueries parameter didn't exist before schema 
version 1.4.


I'm fairly sure that for your schema that is at version 1.1, the 
autoGeneratePhraseQueries value specified in the field definition will 
be ignored and the actual value that gets used will be true, which 
goes along with what Hoss has said.


See the comment about the version in the example schema on any 4.x Solr 
download.


Thanks,
Shawn



Re: SOLR 4.x vs 3.x parsedquery differences

2013-09-06 Thread Fermin Silva
Hi,

Our schema is identical except the version.
In 3.x it's 1.1 and in 4.x it's 1.5.

Also in solrconfig.xml we have no lucene version for 3.x (so it's using 2_4
i believe) and in 4.x we fixed it to 4_4.

Thanks
On Sep 6, 2013 3:34 PM, Chris Hostetter hossman_luc...@fucit.org wrote:


 : I'm migrating from 3.x to 4.x and I'm running some queries to verify that
 : everything works like before. I've found however that the query galaxy
 s3
 : is giving much less results. In 3.x numFound=1628, in 4.x numFound=70.

 is your entire schema 100% identical in both cases?
 what is the luceneMatchVersion set to in your solrconfig.xml?


 By the looks of your debug output, it appears that you are using
 autoGeneratePhraseQueries=true in 3x, but have it set to false in 4x --
 but the fieldType you posted here shows it set to false

 : fieldtype name=text_pt class=solr.TextField
 : positionIncrementGap=100 autoGeneratePhraseQueries=false

 ...i haven't tried to reproduce your specific situation, but that
 configuration doesn't smell right compared with what you are showing for
 the 3x output...

 : SOLR 3.x
 :
 : str name=parsedquery+(title_search_pt:galaxy
 : title_search_pt:galax) +MultiPhraseQuery(title_search_pt:(sii s3 s)
 : 3)/str
 :
 : SOLR 4.x
 :
 : str name=parsedquery+((title_search_pt:galaxy
 : title_search_pt:galax)/no_coord) +(+title_search_pt:sii
 : +title_search_pt:s3 +title_search_pt:s +title_search_pt:3)/str


 -Hoss