Index not respecting Omit Norms
Please reference the below images: http://lucene.472066.n3.nabble.com/file/n4153863/Schema.png http://lucene.472066.n3.nabble.com/file/n4153863/SolrDescriptionSchemaBrowser.png http://lucene.472066.n3.nabble.com/file/n4153863/SolrDescriptionDebugResults.png As you can see from the first image, the text field-type doesn't define the omitNorms flag, meaning it is set to false. Also on the first image you can see that the description field doesn't define the omitNorms flag, again meaning it is set to false. (Default for omitNorms is false). This can all be confirmed on the second image, where the Properties and Schema rows have omitNorms set to checked. I am having some issues understanding why some results have a fieldNorm set to 1 for matches on the description field. As you can see from the third image, the description field has a rather large number of terms in it, yet the fieldNorm is being set to 1.0 for matching 'supply' on the description field. My guess is that the Omit Norms flag for the 'Index' row is causing the issue. Questions: From the first picture, can anyone tell me what each row (Properties, Schema and Index) refers to? I think the Properties row refers to the flags set when defining the Field Type, which for this field is text. The Schema row refers to the flags set when defining the field, which is description. I'm not as sure where the Index row flags come from, but I'm assuming it defines what the index is really representing? Am I right in assuming the Omit Norms flag in the Index row of the first picture is what is causing fieldNorm issues in the second image? If I am correct in the above question, how do I fix it? Additional information: I am not using the standard request handler. I am using a custom request handler that uses eDisMax. The description_sortAlpha field that the description field is copying to is a text field *but* it has omitNorms set to true My Index Analyzers for the description field are: WhitespaceTokenizerFactory, StopFilterFactory, WordDelimiterFilterFactory, LowerCaseFilterFactory and RemoveDuplicatesTokenFIlterFactory, in that order My Query Analyzers for the description field are: WhitespaceTokenizerFactory, SynonymFilterFactory, StopFilterFactory, WordDelimiterFilterFactory, LowerCaseFilterFactory and RemoveDuplicatesTokenFilterFactory, in that order. The description field is not the only text field to be having this omit norms issue for the Index row. There are actually a couple of others. Thanks, -Tim -- View this message in context: http://lucene.472066.n3.nabble.com/Index-not-respecting-Omit-Norms-tp4153863.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index not respecting Omit Norms
: As you can see from the first image, the text field-type doesn't define the : omitNorms flag, meaning it is set to false. Also on the first image you can : see that the description field doesn't define the omitNorms flag, again : meaning it is set to false. (Default for omitNorms is false). This can all ... : I am having some issues understanding why some results have a fieldNorm set : to 1 for matches on the description field. As you can see from the third ... : From the first picture, can anyone tell me what each row (Properties, Schema : and Index) refers to? I think the Properties row refers to the flags set : when defining the Field Type, which for this field is text. The Schema row : refers to the flags set when defining the field, which is description. I'm : not as sure where the Index row flags come from, but I'm assuming it defines : what the index is really representing? : Am I right in assuming the Omit Norms flag in the Index row of the first : picture is what is causing fieldNorm issues in the second image? : If I am correct in the above question, how do I fix it? From a quick glance at the UI JavaScript code (and the underlying LukeRequestHandler) I'm honestly not sure what the intended difference is between the Properties row and the Schema row. I can tell you that the Index row represents what information about the field can actaully be extracted from the underlying index itself -- completely independently from the schema. The fact that Omit Norms is checked in that row means that there is at least one document in your index that was indexed with omitNormws=true. Most likely what happened is that you indexed a bunch of docs with omitNorms=true in your schema.xml, then later changed your schema to use norms, but those docs are still there in the index. -Hoss http://www.lucidworks.com/
Re: Norms
On Jul 10, 2013, at 4:39 AM, Daniel Collins danwcoll...@gmail.com wrote: QueryNorm is what I'm still trying to get to the bottom of exactly :) If you have not seen it, some reading from the past here… https://issues.apache.org/jira/browse/LUCENE-1896 - Mark
Re: Norms
Thanks. Yeah I don't really want the queryNorm on On Wed, Jul 10, 2013 at 2:39 AM, Daniel Collins danwcoll...@gmail.comwrote: I don't know the full answer to your question, but here's what I can offer. Solr offers 2 types of normalisation, FieldNorm and QueryNorm. FieldNorm is as the name suggests field level normalisation, based on length of the field, and can be controlled by the omitNorms parameter on the field. In your example, fieldNorm is always 1.0, see below, so that suggests you have correctly turned off field normalisation on the name_edgy field. 1.0 = fieldNorm(field=name_edgy, doc=231378) QueryNorm is what I'm still trying to get to the bottom of exactly :) But its something that tries to normalise the results of different term queries so they are broadly comparable. You haven't supplied the query you've run , but based on the qf, bf, I'm assuming it breaks down into a DisMax query on 3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure that the results of those 3 queries can be compared. The exact details of it I'm still trying to get to the bottom of (any volunteers with more info chip in!) From earlier answers to the list, queryNorm is calculated in the Similarity object, I need to dig further, but that's probably a good place to start. On 10 July 2013 04:57, William Bell billnb...@gmail.com wrote: I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1)) -- Bill Bell billnb...@gmail.com cell 720-256-8076 -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Norms
Norms stay in the index even if you delete all of the data. If you just changed the schema, emptied the index, and tested again, you've still got norms in there. You can examine the index with Luke to verify this. On 07/09/2013 08:57 PM, William Bell wrote: I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1))
Re: Norms
I don't know the full answer to your question, but here's what I can offer. Solr offers 2 types of normalisation, FieldNorm and QueryNorm. FieldNorm is as the name suggests field level normalisation, based on length of the field, and can be controlled by the omitNorms parameter on the field. In your example, fieldNorm is always 1.0, see below, so that suggests you have correctly turned off field normalisation on the name_edgy field. 1.0 = fieldNorm(field=name_edgy, doc=231378) QueryNorm is what I'm still trying to get to the bottom of exactly :) But its something that tries to normalise the results of different term queries so they are broadly comparable. You haven't supplied the query you've run , but based on the qf, bf, I'm assuming it breaks down into a DisMax query on 3 fields (name_edgy, name_edge, name_word) so queryNorm is trying to ensure that the results of those 3 queries can be compared. The exact details of it I'm still trying to get to the bottom of (any volunteers with more info chip in!) From earlier answers to the list, queryNorm is calculated in the Similarity object, I need to dig further, but that's probably a good place to start. On 10 July 2013 04:57, William Bell billnb...@gmail.com wrote: I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1)) -- Bill Bell billnb...@gmail.com cell 720-256-8076
Norms
I have a field that has omitNorms=true, but when I look at debugQuery I see that the field is being normalized for the score. What can I do to turn off normalization in the score? I want a simple way to do 2 things: boost geodist() highest at 1 mile and lowest at 100 miles. plus add a boost for a query=edgefield^5. I only want tf() and no queryNorm. I am not even sure I want idf() but I can probably live with rare names being boosted. The results are being normalized. See below. I tried dismax and edismax - bf, bq and boost. requestHandler name=autoproviderdist class=solr.SearchHandler lst name=defaults str name=echoParamsnone/str str name=defTypeedismax/str float name=tie0.01/float str name=fl display_name,city_state,prov_url,pwid,city_state_alternative /str !-- str name=bq_val_:sum(recip(geodist(store_geohash), .5, 6, 6), 0.1)^10/str -- str name=boostsum(recip(geodist(store_geohash), .5, 6, 6), 0.1)/str int name=rows5/int str name=q.alt*:*/str str name=qfname_edgy^.9 name_edge^.9 name_word/str str name=grouptrue/str str name=group.fieldpwid/str str name=group.maintrue/str !-- str name=pfname_edgy/str do not turn on -- str name=sortscore desc, last_name asc/str str name=d100/str str name=pt39.740112,-104.984856/str str name=sfieldstore_geohash/str str name=hlfalse/str str name=hl.flname_edgy/str str name=mm2-1 4-2 6-3/str /lst /requestHandler 0.058555886 = queryNorm product of: 10.854807 = (MATCH) sum of: 1.8391232 = (MATCH) max plus 0.01 times others of: 1.8214592 = (MATCH) weight(name_edge:paul^0.9 in 231378), product of: 0.30982485 = queryWeight(name_edge:paul^0.9), product of: 0.9 = boost 5.8789964 = idf(docFreq=26567, maxDocs=3493655)* 0.058555886 = queryNorm* 5.8789964 = (MATCH) fieldWeight(name_edge:paul in 231378), product of: 1.0 = tf(termFreq(name_edge:paul)=1) 5.8789964 = idf(docFreq=26567, maxDocs=3493655) 1.0 = fieldNorm(field=name_edge, doc=231378) 1.7664119 = (MATCH) weight(name_edgy:paul^0.9 in 231378), product of: 0.30510724 = queryWeight(name_edgy:paul^0.9), product of: 0.9 = boost 5.789479 = idf(docFreq=29055, maxDocs=3493655)* 0.058555886 = queryNorm* 5.789479 = (MATCH) fieldWeight(name_edgy:paul in 231378), product of: 1.0 = tf(termFreq(name_edgy:paul)=1) 5.789479 = idf(docFreq=29055, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 9.015684 = (MATCH) max plus 0.01 times others of: 8.9352665 = (MATCH) weight(name_word:nutting in 231378), product of: 0.72333425 = queryWeight(name_word:nutting), product of: 12.352887 = idf(docFreq=40, maxDocs=3493655) 0.058555886 = queryNorm 12.352887 = (MATCH) fieldWeight(name_word:nutting in 231378), product of: 1.0 = tf(termFreq(name_word:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_word, doc=231378) 8.04174 = (MATCH) weight(name_edgy:nutting^0.9 in 231378), product of: 0.65100086 = queryWeight(name_edgy:nutting^0.9), product of: 0.9 = boost 12.352887 = idf(docFreq=40, maxDocs=3493655)* 0.058555886 = queryNorm* 12.352887 = (MATCH) fieldWeight(name_edgy:nutting in 231378), product of: 1.0 = tf(termFreq(name_edgy:nutting)=1) 12.352887 = idf(docFreq=40, maxDocs=3493655) 1.0 = fieldNorm(field=name_edgy, doc=231378) 1.0855998 = sum(6.0/(0.5*float(geodist(39.74168747663498,-104.9849385023117,39.740112,-104.984856))+6.0),const(0.1)) -- Bill Bell billnb...@gmail.com cell 720-256-8076
Re: Why would solr norms come up different from Lucene norms?
Which Similarity class do you use for the Lucene code? Solr has a custom one. On Fri, May 4, 2012 at 6:30 AM, Benson Margulies bimargul...@gmail.com wrote: So, I've got some code that stores the same documents in a Lucene 3.5.0 index and a Solr 3.5.0 instance. It's only five documents. For a particular field, the Solr norm is always 0.625, while the Lucene norm is .5. I've watched the code in NormsWriterPerField in both cases. In Solr we've got .577, in naked Lucene it's .5. I tried to check for boosts, and I don't see any non-1.0 document or field boosts. The Solr field is: field name=bt_rni_NameHRK_encodedName type=text_ws indexed=true stored=true multiValued=false / -- Lance Norskog goks...@gmail.com
Re: Why would solr norms come up different from Lucene norms?
On Sat, May 5, 2012 at 7:59 PM, Lance Norskog goks...@gmail.com wrote: Which Similarity class do you use for the Lucene code? Solr has a custom one. I am embarassed to report that I also have a custom similarity that I didn't know about, and once I configured that into Solr all was well. On Fri, May 4, 2012 at 6:30 AM, Benson Margulies bimargul...@gmail.com wrote: So, I've got some code that stores the same documents in a Lucene 3.5.0 index and a Solr 3.5.0 instance. It's only five documents. For a particular field, the Solr norm is always 0.625, while the Lucene norm is .5. I've watched the code in NormsWriterPerField in both cases. In Solr we've got .577, in naked Lucene it's .5. I tried to check for boosts, and I don't see any non-1.0 document or field boosts. The Solr field is: field name=bt_rni_NameHRK_encodedName type=text_ws indexed=true stored=true multiValued=false / -- Lance Norskog goks...@gmail.com
Why would solr norms come up different from Lucene norms?
So, I've got some code that stores the same documents in a Lucene 3.5.0 index and a Solr 3.5.0 instance. It's only five documents. For a particular field, the Solr norm is always 0.625, while the Lucene norm is .5. I've watched the code in NormsWriterPerField in both cases. In Solr we've got .577, in naked Lucene it's .5. I tried to check for boosts, and I don't see any non-1.0 document or field boosts. The Solr field is: field name=bt_rni_NameHRK_encodedName type=text_ws indexed=true stored=true multiValued=false /
RE: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index
Thank you guys for responses. Some background on the task: The problem we are trying to solve with Solr is the following. We have to provide a full-text search over documents that partially consist of fields that are always there and partially of additional metadata as key-value pairs where keys are not known beforehand. Yet we need to be able to search on the content of that additional meta-data. Becuase we have to provide FTS abilities we have used Solr and not a HashMap or some BigTable. To address the optionality of additional metadata fields and their searcheability we have decided to use Solr indexed dynamic fields. Questions: 1. Yonik, will your approach work for us with next data: doc1 uniqueFields:[100=boo foo roo,101=bar bar 100 boo] doc2 uniqueFields:[101=boo roo,102=bar foo 101 boo] and we want to fetch documents that contain value 'foo' in metadata with field key: 100? (that is only doc1 should be returned) 2. Should I post issue to JIRA about large index size, or it's expected behaviour in our case? Thanks, Ivan From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley [yo...@lucidimagination.com] Sent: Thursday, November 10, 2011 10:22 PM To: solr-user@lucene.apache.org Subject: Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk ihryts...@softserveinc.com wrote: For 5000 documents (every document has 2 unique fields, 2*5000=1 unique fields in index), index size is 48.24 MB. You might be able to turn this around and encode the unique field information in a multi-valued field: For example, instead of myUniqueField100:foo myUniqueField101:bar you could do uniqueFields:[100=foo,101=bar] The exact details depend on how you are going to use/query these fields of course. -Yonik http://www.lucidimagination.com
[Solr-3.4] Norms file size is large in case of many unique indexed fields in index
Hello everyone, We have large index size in case norms are enabled. schema.xml: type declaration: fieldType name=simpleTokenizer class=solr.TextField positionIncrementGap=100 omitNorms=false analyzer tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType fields declaration: field name=id stored=true indexed=true required=true type=string / field name=name stored=true indexed=true type=string / dynamicField name=unique_* stored=false indexed=true type=simpleTokenizer multiValued=false / For 5000 documents (every document has 2 unique fields, 2*5000=1 unique fields in index), index size is 48.24 MB. But if we enable omitting norms (omitNorms=true), index size is 0.56 MB. Next, if we increase number of unique fields per document to 3 (3*5000=15000 unique fields in index) we receive: 72.23 MB and 0.70 MB respectively. And if we increase number of documents to 1 ( 3*1 unique fields in index) we receive: 287.54 MB and 1.44 MB respectively. We've prepared test application to reproduce mentioned behavior. It can be downloaded here: https://bitbucket.org/coldserenity/solr-large-index-with-norms Could anyone point out if size of index is as expected in mentioned cases? And if it's, what configuration can be applied to reduce size of index. Thank you in advance, Ivan
Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index
what is the point of a unique indexed field? If for all of your fields, there is only one possible document, you don't need length normalization, scoring, or a search engine at all... just use a HashMap? On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk ihryts...@softserveinc.com wrote: Hello everyone, We have large index size in case norms are enabled. schema.xml: type declaration: fieldType name=simpleTokenizer class=solr.TextField positionIncrementGap=100 omitNorms=false analyzer tokenizer class=solr.KeywordTokenizerFactory / /analyzer /fieldType fields declaration: field name=id stored=true indexed=true required=true type=string / field name=name stored=true indexed=true type=string / dynamicField name=unique_* stored=false indexed=true type=simpleTokenizer multiValued=false / For 5000 documents (every document has 2 unique fields, 2*5000=1 unique fields in index), index size is 48.24 MB. But if we enable omitting norms (omitNorms=true), index size is 0.56 MB. Next, if we increase number of unique fields per document to 3 (3*5000=15000 unique fields in index) we receive: 72.23 MB and 0.70 MB respectively. And if we increase number of documents to 1 ( 3*1 unique fields in index) we receive: 287.54 MB and 1.44 MB respectively. We've prepared test application to reproduce mentioned behavior. It can be downloaded here: https://bitbucket.org/coldserenity/solr-large-index-with-norms Could anyone point out if size of index is as expected in mentioned cases? And if it's, what configuration can be applied to reduce size of index. Thank you in advance, Ivan -- lucidimagination.com
Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index
On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk ihryts...@softserveinc.com wrote: For 5000 documents (every document has 2 unique fields, 2*5000=1 unique fields in index), index size is 48.24 MB. You might be able to turn this around and encode the unique field information in a multi-valued field: For example, instead of myUniqueField100:foo myUniqueField101:bar you could do uniqueFields:[100=foo,101=bar] The exact details depend on how you are going to use/query these fields of course. -Yonik http://www.lucidimagination.com
Re: Norms - scoring issue
It seems that fieldNorm difference is coming from the field named 'text'. And you didn't include the definition of text field. Did you omit norms for that field too? By the way I see that you have store=true in some places but it should be store*d*=true. --- On Wed, 9/14/11, Adolfo Castro Menna adolfo.castrome...@gmail.com wrote: From: Adolfo Castro Menna adolfo.castrome...@gmail.com Subject: Norms - scoring issue To: solr-user@lucene.apache.org Date: Wednesday, September 14, 2011, 11:13 PM Hi All, I hope someone could shed some light on the issue I'm facing with solr 3.1.0. It looks like it's computing diferrent fieldNorm values despite my configuration that aims to ignore it. field name=item_name type=textgen indexed=true store=true omitNorms=true omitTermFrequencyAndPositions=true / field name=item_description type=textTight indexed=true store=true omitNorms=true omitTermFrequencyAndPositions=true / field name=item_tags type=text indexed=true stored=true multiValued=true omitNorms=true omitTermFrequencyAndPositions=true / I also have a custom class that extends DefaultSimilarity to override the idf method. Query: str name=qitem_name:octopus seafood OR item_description:octopus seafood OR item_tags:octopus seafood/str str name=sortscore desc,item_ranking desc/str The first 2 results are: doc float name=score0.5217492/float str name=item_nameGrilled Octopus/str arr name=item_tagsstrSeafood, tapas/str/arr /doc doc float name=score0.49379835/float str name=item_nameoctopus marisco/str arr name=item_tagsstrAppetizer, Mexican, Seafood, food/str/arr /doc Does anyone know why they get a different score? I'm expecting them to have the same scoring because both matched the two search terms. I checked the debug information and it seems that the difference involves the fieldNorm values. 1) Grilled Octopus 0.52174926 = (MATCH) product of: 0.7826238 = (MATCH) sum of: 0.4472136 = (MATCH) weight(item_name:octopus in 69), product of: 0.4472136 = queryWeight(item_name:octopus), product of: 1.0 = idf(docFreq=2, maxDocs=449) 0.4472136 = queryNorm 1.0 = (MATCH) fieldWeight(item_name:octopus in 69), product of: 1.0 = tf(termFreq(item_name:octopus)=1) 1.0 = idf(docFreq=2, maxDocs=449) 1.0 = fieldNorm(field=item_name, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.667 = coord(4/6) 2) octopus marisco 0.49379835 = (MATCH) product of: 0.7406975 = (MATCH) sum of: 0.4472136 = (MATCH) weight(item_name:octopus in 81), product of: 0.4472136 = queryWeight(item_name:octopus), product of: 1.0 = idf(docFreq=2, maxDocs=449) 0.4472136 = queryNorm 1.0 = (MATCH) fieldWeight(item_name:octopus in 81), product of: 1.0 = tf(termFreq(item_name:octopus)=1) 1.0 = idf(docFreq=2, maxDocs=449) 1.0 = fieldNorm(field=item_name, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81
Re: Norms - scoring issue
Hi Ashmet, You're right. It was related to the text field which is the defaultSearch field. I also added omitNorms=true in the fieldtype definition and it's now working as expected Thanks, Adolfo.
Norms - scoring issue
Hi All, I hope someone could shed some light on the issue I'm facing with solr 3.1.0. It looks like it's computing diferrent fieldNorm values despite my configuration that aims to ignore it. field name=item_name type=textgen indexed=true store=true omitNorms=true omitTermFrequencyAndPositions=true / field name=item_description type=textTight indexed=true store=true omitNorms=true omitTermFrequencyAndPositions=true / field name=item_tags type=text indexed=true stored=true multiValued=true omitNorms=true omitTermFrequencyAndPositions=true / I also have a custom class that extends DefaultSimilarity to override the idf method. Query: str name=qitem_name:octopus seafood OR item_description:octopus seafood OR item_tags:octopus seafood/str str name=sortscore desc,item_ranking desc/str The first 2 results are: doc float name=score0.5217492/float str name=item_nameGrilled Octopus/str arr name=item_tagsstrSeafood, tapas/str/arr /doc doc float name=score0.49379835/float str name=item_nameoctopus marisco/str arr name=item_tagsstrAppetizer, Mexican, Seafood, food/str/arr /doc Does anyone know why they get a different score? I'm expecting them to have the same scoring because both matched the two search terms. I checked the debug information and it seems that the difference involves the fieldNorm values. 1) Grilled Octopus 0.52174926 = (MATCH) product of: 0.7826238 = (MATCH) sum of: 0.4472136 = (MATCH) weight(item_name:octopus in 69), product of: 0.4472136 = queryWeight(item_name:octopus), product of: 1.0 = idf(docFreq=2, maxDocs=449) 0.4472136 = queryNorm 1.0 = (MATCH) fieldWeight(item_name:octopus in 69), product of: 1.0 = tf(termFreq(item_name:octopus)=1) 1.0 = idf(docFreq=2, maxDocs=449) 1.0 = fieldNorm(field=item_name, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.1118034 = (MATCH) weight(text:seafood in 69), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.25 = (MATCH) fieldWeight(text:seafood in 69), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.25 = fieldNorm(field=text, doc=69) 0.667 = coord(4/6) 2) octopus marisco 0.49379835 = (MATCH) product of: 0.7406975 = (MATCH) sum of: 0.4472136 = (MATCH) weight(item_name:octopus in 81), product of: 0.4472136 = queryWeight(item_name:octopus), product of: 1.0 = idf(docFreq=2, maxDocs=449) 0.4472136 = queryNorm 1.0 = (MATCH) fieldWeight(item_name:octopus in 81), product of: 1.0 = tf(termFreq(item_name:octopus)=1) 1.0 = idf(docFreq=2, maxDocs=449) 1.0 = fieldNorm(field=item_name, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81) 0.09782797 = (MATCH) weight(text:seafood in 81), product of: 0.4472136 = queryWeight(text:seafood), product of: 1.0 = idf(docFreq=8, maxDocs=449) 0.4472136 = queryNorm 0.21875 = (MATCH) fieldWeight(text:seafood in 81), product of: 1.0 = tf(termFreq(text:seafood)=1) 1.0 = idf(docFreq=8, maxDocs=449) 0.21875 = fieldNorm(field=text, doc=81) 0.667 = coord(4/6) Thanks in advance, Adolfo.
Re: Omitting norms question
Should I include not omit-norms on any fields that I would like to boost via a boost-query/function query? You don't have to set norms to use boost queries or functions. Just have to set them when you want to boost docs or fields at indexing time. What about sortable fields? Facetable fields? You can use both without setting norms aswell. See what norms are for: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#lengthNorm%28java.lang.String,%20int%29 blargy wrote: Should I include not omit-norms on any fields that I would like to boost via a boost-query/function query? For example I have a created_on field on one of my documents and I would like to add some sort of function query to this field when querying. In this case does this mean I need to have the norms? What about sortable fields? Facetable fields? Thanks! -- View this message in context: http://old.nabble.com/Omitting-norms-question-tp27950893p27950919.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Omitting norms question
Ok so as if I wanted to add boost to fields at indexing time then I should include norms. On the other hand if I just want to boost at query time then its quite alright to omit norms. Anyone mind explaining what norms are in layman's terms ;) Marc Sturlese wrote: Should I include not omit-norms on any fields that I would like to boost via a boost-query/function query? You don't have to set norms to use boost queries or functions. Just have to set them when you want to boost docs or fields at indexing time. What about sortable fields? Facetable fields? You can use both without setting norms aswell. See what norms are for: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html#lengthNorm%28java.lang.String,%20int%29 blargy wrote: Should I include not omit-norms on any fields that I would like to boost via a boost-query/function query? For example I have a created_on field on one of my documents and I would like to add some sort of function query to this field when querying. In this case does this mean I need to have the norms? What about sortable fields? Facetable fields? Thanks! -- View this message in context: http://old.nabble.com/Omitting-norms-question-tp27950893p27950977.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Omitting norms question
Hi blargy, Norms are: - a field-specific multiplicative document scoring factor - the product of three factors: user-settable 1) field boost and 2) document boost (both default to 1.0), along with the 3) field length norm, defined in DefaultSimilarity as 1/sqrt(# terms). - encoded as a positive 8-bit float - range: 6x10^-10 to 7x10^9; accuracy: about 7/10's of a decimal digit. (I have a table of all 256 possible values if you're interested.) Check out the (fuller, less buggy, and way shinier) explanation at the top of the javadocs page that Marc sent the link to. Steve On 03/19/2010 at 10:51 AM, blargy wrote: Ok so as if I wanted to add boost to fields at indexing time then I should include norms. On the other hand if I just want to boost at query time then its quite alright to omit norms. Anyone mind explaining what norms are in layman's terms ;) Marc Sturlese wrote: Should I include not omit-norms on any fields that I would like to boost via a boost-query/function query? You don't have to set norms to use boost queries or functions. Just have to set them when you want to boost docs or fields at indexing time. What about sortable fields? Facetable fields? You can use both without setting norms aswell. See what norms are for: http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Simila rity.html#lengthNorm%28java.lang.String,%20int%29 blargy wrote: Should I include not omit-norms on any fields that I would like to boost via a boost-query/function query? For example I have a created_on field on one of my documents and I would like to add some sort of function query to this field when querying. In this case does this mean I need to have the norms? What about sortable fields? Facetable fields? Thanks! -- View this message in context: http://old.nabble.com/Omitting-norms- question-tp27950893p27950977.html Sent from the Solr - User mailing list archive at Nabble.com.
Changing encoding norms and boosting...
This is related to an earlier posting (http://www.nabble.com/Document-boost-not-as-expected...-tf3476653.html). I am trying to determine a ranking for users that is between 1 and 1.5. Because of the way the encoding norm is stored, if index time boosting is done, everyone gets a score of 1, 1.25 or 1.5. Is there any way to get around this so that all the values can be retrieved as is (e.g. 1.22, 1.35 etc). Thanks in advance. -- View this message in context: http://www.nabble.com/Changing-encoding-norms-and-boosting...-tf3489245.html#a9744212 Sent from the Solr - User mailing list archive at Nabble.com.