from:"elisabeth benoit"

list of all possible values for REQUESTSTATUS

2020-12-07 Thread elisabeth benoit

Hello all,

I'm unloading a core with async param then sending query with request id
http://localhost:8983/solr/admin/cores?action=UNLOAD=expressions=1001http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS=1001
and would like to find a piece of doc with all possible values of
REQUESTSTATUS. Could someone give me a pointer to the doc, I just cant find
it using a search engine.

I AM NOT looking for
https://lucene.apache.org/solr/guide/8_6/coreadmin-api.html#coreadmin-requeststatus

I would like to have a list of all possible values for STATUS returned
by solr for 
queryhttp://localhost:8983/solr/admin/cores?action=REQUESTSTATUS=1001

Is there an available doc or is the only way around to download the
solr code and search in the code.

Best regards,
Elisabeth

doc for REQUESTSTATUS

2020-12-07 Thread elisabeth benoit

Hello all,

I'm unloading a core with async param then sending query with request id

http://localhost:8983/solr/admin/cores?action=UNLOAD=expressions=1001
http://localhost:8983/solr/admin/cores?action=REQUESTSTATUS=1001


and would like to find a piece of doc with all possible values of
REQUESTSTATUS. Could someone give me a pointer to the doc, I just cant find
it using a search engine.

Best regards,
Elisabeth

write.lock file after unloading core

2020-11-30 Thread elisabeth benoit

Hello all,

We are using solr 7.3.1, with master and slave config.

When we deliver a new index we unload the core, with option delete data dir
= true, then recreate the data folder and copy the new index files into
that folder before sending solr a command to recreate the core (with the
same name).

But we have, at the same time, some batches indexing non stop the core we
just unloaded, and it happens quite frequently that we have an error at
this point, the copy cannot be done, and I guess it is because of a
write.lock file created by a solr index writer in the index directory.

Is it possible, when unloading the core, to stop / kill index writer? I've
tried including a sleep after the unload and before recreation of the index
folder, it seems to work but I was wondering if a better solution exists.

Best regards,
Elisabeth

Re: Ignore accent in a request

2019-02-11 Thread elisabeth benoit

Thanks for the hint. We've been using the char filter for full unidecode
normalization. Is the ICUFoldingFilter supposed to be faster? Or just
simpler to use?

Le lun. 11 févr. 2019 à 09:58, Ere Maijala  a
écrit :

> Please note that mapping characters works well for a small set of
> characters, but if you want full UNICODE normalization, take a look at
> the ICUFoldingFilter:
>
> https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-ICUFoldingFilter
>
> --Ere
>
> elisabeth benoit kirjoitti 8.2.2019 klo 22.47:
> > yes you do
> >
> > and use the char filter at index and query time
> >
> > Le ven. 8 févr. 2019 à 19:20, SAUNIER Maxence  a
> écrit :
> >
> >> For the charFilter, I need to reindex all documents ?
> >>
> >> -Message d'origine-
> >> De : Erick Erickson 
> >> Envoyé : vendredi 8 février 2019 18:03
> >> À : solr-user 
> >> Objet : Re: Ignore accent in a request
> >>
> >> Elisabeth's suggestion is spot on for the accent.
> >>
> >> One other thing I noticed. You are using KeywordTokenizerFactory
> combined
> >> with EdgeNGramFilterFactory. This implies that you can't search for
> >> individual _words_, only prefix queries, i.e.
> >> je
> >> je s
> >> je su
> >> je sui
> >> je suis
> >>
> >> You can't search for "suis" for instance.
> >>
> >> basically this is an efficient way to search anything starting with
> >> three-or-more letter prefixes at the expense of index size. You might be
> >> better off just using wildcards (restrict to three letters at the prefix
> >> though).
> >>
> >> This is perfectly valid, I'm mostly asking if it's your intent.
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence 
> wrote:
> >>>
> >>> Thanks you !
> >>>
> >>> -Message d'origine-
> >>> De : elisabeth benoit  Envoyé : vendredi 8
> >>> février 2019 14:12 À : solr-user@lucene.apache.org Objet : Re: Ignore
> >>> accent in a request
> >>>
> >>> Hello,
> >>>
> >>> We use solr 7 and use
> >>>
> >>>  >>> mapping="mapping-ISOLatin1Accent.txt"/>
> >>>
> >>> with mapping-ISOLatin1Accent.txt
> >>>
> >>> containing lines like
> >>>
> >>> # À => A
> >>> "\u00C0" => "A"
> >>>
> >>> # Á => A
> >>> "\u00C1" => "A"
> >>>
> >>> # Â => A
> >>> "\u00C2" => "A"
> >>>
> >>> # Ã => A
> >>> "\u00C3" => "A"
> >>>
> >>> # Ä => A
> >>> "\u00C4" => "A"
> >>>
> >>> # Å => A
> >>> "\u00C5" => "A"
> >>>
> >>> # Ā Ă Ą =>
> >>> "\u0100" => "A"
> >>> "\u0102" => "A"
> >>> "\u0104" => "A"
> >>>
> >>> # Æ => AE
> >>> "\u00C6" => "AE"
> >>>
> >>> # Ç => C
> >>> "\u00C7" => "C"
> >>>
> >>> # é => e
> >>> "\u00E9" => "e"
> >>>
> >>> Best regards,
> >>> Elisabeth
> >>>
> >>> Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma  >
> >> a écrit :
> >>>
> >>>> We have fixed this type of issue by using Synonyms by adding
> >>>> SynonymFilterFactory(Before Solr 7).
> >>>>
> >>>> -Original Message-
> >>>> From: SAUNIER Maxence 
> >>>> Sent: Friday, February 8, 2019 3:36 PM
> >>>> To: solr-user@lucene.apache.org
> >>>> Subject: RE: Ignore accent in a request
> >>>>
> >>>> Hello,
> >>>>
> >>>> Thanks for you answer.
> >>>>
> >>>> I have test :
> >>>>
> >>>> select?defType=dismax=je suis avarié=content
> >>>> 90.000 results
> >>>>
> >>>> select?defType=dismax=je suis avarie=content
> >>>> 60.000 results
> >>>>
> >>

Re: Ignore accent in a request

2019-02-08 Thread elisabeth benoit

yes you do

and use the char filter at index and query time

Le ven. 8 févr. 2019 à 19:20, SAUNIER Maxence  a écrit :

> For the charFilter, I need to reindex all documents ?
>
> -Message d'origine-
> De : Erick Erickson 
> Envoyé : vendredi 8 février 2019 18:03
> À : solr-user 
> Objet : Re: Ignore accent in a request
>
> Elisabeth's suggestion is spot on for the accent.
>
> One other thing I noticed. You are using KeywordTokenizerFactory combined
> with EdgeNGramFilterFactory. This implies that you can't search for
> individual _words_, only prefix queries, i.e.
> je
> je s
> je su
> je sui
> je suis
>
> You can't search for "suis" for instance.
>
> basically this is an efficient way to search anything starting with
> three-or-more letter prefixes at the expense of index size. You might be
> better off just using wildcards (restrict to three letters at the prefix
> though).
>
> This is perfectly valid, I'm mostly asking if it's your intent.
>
> Best,
> Erick
>
> On Fri, Feb 8, 2019 at 9:35 AM SAUNIER Maxence  wrote:
> >
> > Thanks you !
> >
> > -Message d'origine-
> > De : elisabeth benoit  Envoyé : vendredi 8
> > février 2019 14:12 À : solr-user@lucene.apache.org Objet : Re: Ignore
> > accent in a request
> >
> > Hello,
> >
> > We use solr 7 and use
> >
> >  > mapping="mapping-ISOLatin1Accent.txt"/>
> >
> > with mapping-ISOLatin1Accent.txt
> >
> > containing lines like
> >
> > # À => A
> > "\u00C0" => "A"
> >
> > # Á => A
> > "\u00C1" => "A"
> >
> > # Â => A
> > "\u00C2" => "A"
> >
> > # Ã => A
> > "\u00C3" => "A"
> >
> > # Ä => A
> > "\u00C4" => "A"
> >
> > # Å => A
> > "\u00C5" => "A"
> >
> > # Ā Ă Ą =>
> > "\u0100" => "A"
> > "\u0102" => "A"
> > "\u0104" => "A"
> >
> > # Æ => AE
> > "\u00C6" => "AE"
> >
> > # Ç => C
> > "\u00C7" => "C"
> >
> > # é => e
> > "\u00E9" => "e"
> >
> > Best regards,
> > Elisabeth
> >
> > Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma 
> a écrit :
> >
> > > We have fixed this type of issue by using Synonyms by adding
> > > SynonymFilterFactory(Before Solr 7).
> > >
> > > -Original Message-
> > > From: SAUNIER Maxence 
> > > Sent: Friday, February 8, 2019 3:36 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: RE: Ignore accent in a request
> > >
> > > Hello,
> > >
> > > Thanks for you answer.
> > >
> > > I have test :
> > >
> > > select?defType=dismax=je suis avarié=content
> > > 90.000 results
> > >
> > > select?defType=dismax=je suis avarie=content
> > > 60.000 results
> > >
> > > With avarié, I dont find documents with avarie and with avarie, I
> > > don't find documents with avarié.
> > >
> > > I want to find they 150.000 documents with avarié or avarie.
> > >
> > > Thanks
> > >
> > > -Message d'origine-
> > > De : Erick Erickson  Envoyé : jeudi 7
> > > février
> > > 2019 19:37 À : solr-user  Objet : Re:
> > > Ignore accent in a request
> > >
> > > exactly _how_ is it "not working"?
> > >
> > > Try building your parameters _up_ rather than starting with a lot, e.g.
> > > select?defType=dismax=je suis avarié=title ^^ assumes you
> > > expect a match on title. Then:
> > > select?defType=dismax=je suis avarié=title subject
> > >
> > > etc.
> > >
> > > Because mm=757 looks really wrong. From the docs:
> > > Defines the minimum number of clauses that must match, regardless of
> > > how many clauses there are in total.
> > >
> > > edismax is used much more than dismax as it's more flexible, but
> > > that's not germane here.
> > >
> > > finally, try adding =query to the url to see exactly how the
> > > query is parsed.
> > >
> > > Best,
> > > Erick
> > >
> > > On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence 
> wrote:
> > > >
> > > > Hello,
> > > >
> > > > How can I ignore accent in the query result ?
> > > >
> > > > Request :
> > > > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié;
> > > > qf
> > > > =t
> > > > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> > > >
> > > > I want to have doc with avarié and avarie.
> > > >
> > > > I have add this in my schema :
> > > >
> > > >   {
> > > > "name": "string",
> > > > "positionIncrementGap": "100",
> > > > "analyzer": {
> > > >   "filters": [
> > > > {
> > > >   "class": "solr.LowerCaseFilterFactory"
> > > > },
> > > > {
> > > >   "class": "solr.ASCIIFoldingFilterFactory"
> > > > },
> > > > {
> > > >   "class": "solr.EdgeNGramFilterFactory",
> > > >   "minGramSize": "3",
> > > >   "maxGramSize": "50"
> > > > }
> > > >   ],
> > > >   "tokenizer": {
> > > > "class": "solr.KeywordTokenizerFactory"
> > > >   }
> > > > },
> > > > "stored": true,
> > > > "indexed": true,
> > > > "sortMissingLast": true,
> > > > "class": "solr.TextField"
> > > >   },
> > > >
> > > > But it not working.
> > > >
> > > > Thanks.
> > >
>

Re: Ignore accent in a request

2019-02-08 Thread elisabeth benoit

Hello,

We use solr 7 and use



with mapping-ISOLatin1Accent.txt

containing lines like

# À => A
"\u00C0" => "A"

# Á => A
"\u00C1" => "A"

# Â => A
"\u00C2" => "A"

# Ã => A
"\u00C3" => "A"

# Ä => A
"\u00C4" => "A"

# Å => A
"\u00C5" => "A"

# Ā Ă Ą =>
"\u0100" => "A"
"\u0102" => "A"
"\u0104" => "A"

# Æ => AE
"\u00C6" => "AE"

# Ç => C
"\u00C7" => "C"

# é => e
"\u00E9" => "e"

Best regards,
Elisabeth

Le ven. 8 févr. 2019 à 11:18, Gopesh Sharma  a
écrit :

> We have fixed this type of issue by using Synonyms by adding
> SynonymFilterFactory(Before Solr 7).
>
> -Original Message-
> From: SAUNIER Maxence 
> Sent: Friday, February 8, 2019 3:36 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Ignore accent in a request
>
> Hello,
>
> Thanks for you answer.
>
> I have test :
>
> select?defType=dismax=je suis avarié=content
> 90.000 results
>
> select?defType=dismax=je suis avarie=content
> 60.000 results
>
> With avarié, I dont find documents with avarie and with avarie, I don't
> find documents with avarié.
>
> I want to find they 150.000 documents with avarié or avarie.
>
> Thanks
>
> -Message d'origine-
> De : Erick Erickson  Envoyé : jeudi 7 février
> 2019 19:37 À : solr-user  Objet : Re: Ignore
> accent in a request
>
> exactly _how_ is it "not working"?
>
> Try building your parameters _up_ rather than starting with a lot, e.g.
> select?defType=dismax=je suis avarié=title ^^ assumes you expect a
> match on title. Then:
> select?defType=dismax=je suis avarié=title subject
>
> etc.
>
> Because mm=757 looks really wrong. From the docs:
> Defines the minimum number of clauses that must match, regardless of how
> many clauses there are in total.
>
> edismax is used much more than dismax as it's more flexible, but that's
> not germane here.
>
> finally, try adding =query to the url to see exactly how the query
> is parsed.
>
> Best,
> Erick
>
> On Mon, Feb 4, 2019 at 9:09 AM SAUNIER Maxence  wrote:
> >
> > Hello,
> >
> > How can I ignore accent in the query result ?
> >
> > Request :
> > http://*:8983/solr/***/select?defType=dismax=je+suis+avarié=t
> > itle%5e20+subject%5e15+category%5e1+content%5e0.5=757
> >
> > I want to have doc with avarié and avarie.
> >
> > I have add this in my schema :
> >
> >   {
> > "name": "string",
> > "positionIncrementGap": "100",
> > "analyzer": {
> >   "filters": [
> > {
> >   "class": "solr.LowerCaseFilterFactory"
> > },
> > {
> >   "class": "solr.ASCIIFoldingFilterFactory"
> > },
> > {
> >   "class": "solr.EdgeNGramFilterFactory",
> >   "minGramSize": "3",
> >   "maxGramSize": "50"
> > }
> >   ],
> >   "tokenizer": {
> > "class": "solr.KeywordTokenizerFactory"
> >   }
> > },
> > "stored": true,
> > "indexed": true,
> > "sortMissingLast": true,
> > "class": "solr.TextField"
> >   },
> >
> > But it not working.
> >
> > Thanks.
>

NGramFilterFactory and Similarity

2018-12-11 Thread elisabeth benoit

Hello,

We are trying to use NGramFilterFactory for approximative search with solr
7.

We usually use a similarity with no tf, no idf (our similarity extends
ClassicSimilarity, with tf and idf functions always returning 1).

For ngram search though, it seems inappropriate since it scores a word
matching with one ngram the same as a word matching with, let's say, seven
ngrams.

We would like a similarity that gives a higher score to a document matching
more ngrams, but not using term frequency (we have multivalued fields, and
a word might be repeated in more than one entry of our multivalued field,
but we dont want that document to get a higher score because of that)

Does anyone have experienced the same issues?

Best regards,
Elisabeth

solr 7.3.1 how to parse LatLonPointSpatialField in custom .jar

2018-11-20 Thread elisabeth benoit

Hello,

We are using solr with a home made jar with a custom function.

function(0.1,1.0,43.8341851366,5.7818349,43.8342868634,5.7821059,latlng_pi)

where latlng_pi is a document field of type location



In solr 5.5.2, location was defined like this



and parsed in the jar like this (with fq being an instance of
org.apache.solr.search.FunctionQParser)

value = fp.parseValueSource()



In solr 7.3.1, we changed the definition to




because solr.LatLonType is now deprecated.


we now have an error

"A ValueSource isn't directly available from this field. Instead try a
query using the distance as the score."

from org.apache.solr.schema.AbstractSpatialFieldType

@Override
public ValueSource getValueSource(SchemaField field, QParser parser) {
//This is different from Solr 3 LatLonType's approach which uses the
MultiValueSource concept to directly expose
// the x & y pair of FieldCache value sources.
throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
"A ValueSource isn't directly available from this field. Instead try a
query using the distance as the score.");
}

To correct this error, we tried to see how the value was parsed in
*GeoDistValueSourceParser,
but it seems to us (we are not java programmers) very hacky and complicated
and we would like to know if there is a simple solution to parse a
*LatLonPointSpatialField
in our jar.


Thanks,
Elisabeth

Re: in-places update solr 5.5.2

2017-07-26 Thread elisabeth benoit

Thanks a lot for your answer

2017-07-26 16:35 GMT+02:00 Cassandra Targett <ctarg...@apache.org>:

> The in-place update section you referenced was added in Solr 6.5. On
> p. 224 of the PDF for 5.5, note it says there are only two available
> approaches and the section on in-place updates you see online isn't
> mentioned. I looked into the history of the online page and the
> section on in-place updates was added for Solr 6.5, when SOLR-5944 was
> released.
>
> So, unfortunately, unless someone else has a better option for
> pre-6.5, I believe it was not possible in 5.5.2.
>
> Cassandra
>
> On Wed, Jul 26, 2017 at 2:30 AM, elisabeth benoit
> <elisaelisael...@gmail.com> wrote:
> > Are in place updates available in solr 5.5.2, I find atomic updates in
> the
> > doc
> > https://archive.apache.org/dist/lucene/solr/ref-guide/
> apache-solr-ref-guide-5.5.pdf,
> > which redirects me to the page
> > https://cwiki.apache.org/confluence/display/solr/
> Updating+Parts+of+Documents#UpdatingPartsofDocuments-AtomicUpdates
> > .
> >
> > On that page, for in-place updates, it says
> >
> > the _*version*_ field is also a non-indexed, non-stored single valued
> > docValues field
> >
> > when I try this with solr 5.5.2 I get an error message
> >
> > org.apache.solr.common.SolrException:org.apache.solr.
> common.SolrException:
> > Unable to use updateLog: _version_ field must exist in schema, using
> > indexed=\"true\" or docValues=\"true\", stored=\"true\" and
> > multiValued=\"false\" (_version_ is not stored
> >
> >
> > What I'm looking for is a way to update one field of a doc without
> erasing
> > the non stored fields. Is this possible in solr 5.5.2?
> >
> > best regards,
> > Elisabeth
>

in-places update solr 5.5.2

2017-07-26 Thread elisabeth benoit

Are in place updates available in solr 5.5.2, I find atomic updates in the
doc
https://archive.apache.org/dist/lucene/solr/ref-guide/apache-solr-ref-guide-5.5.pdf,
which redirects me to the page
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents#UpdatingPartsofDocuments-AtomicUpdates
.

On that page, for in-place updates, it says

the _*version*_ field is also a non-indexed, non-stored single valued
docValues field

when I try this with solr 5.5.2 I get an error message

org.apache.solr.common.SolrException:org.apache.solr.common.SolrException:
Unable to use updateLog: _version_ field must exist in schema, using
indexed=\"true\" or docValues=\"true\", stored=\"true\" and
multiValued=\"false\" (_version_ is not stored


What I'm looking for is a way to update one field of a doc without erasing
the non stored fields. Is this possible in solr 5.5.2?

best regards,
Elisabeth

solr 5.5.2 bug in edismax pf2 when boosting term

2017-05-18 Thread elisabeth benoit

Hello,

I am using solr 5.5.2.

I am trying to give a lower score to frequent words in query.

The only way I've found so far is to do like

q=avenue^0.1 de champaubert village suisse 75015 paris

where avenue is a frequent word.

The problem is I'm using edismax, and when I add ^0.1 to avenue, it is not
considered anymore in pf2.

I am looking for a work around, or another way to give lower score to
frequent words in solr.

If anyone could help it would be great.

Elisabeth

solr 5.5.2 using omitNorms=False on multivalued fields

2016-10-18 Thread elisabeth benoit

Hello,

I would like to score higher, or even better to sort documents with same
text scores based on norm

for instance, with query "a b d"

document with

a b d

should score higher  than (or appear before)  document with

a b c d

The problem is my field is multivalued so omitNorms= False is not working.

Does anyone know how to achieve this with a multivalued field on solr 5.5.2?


Best regards,
Elisabeth

Re: migration to solr 5.5.2 highlight on ngrams not working

2016-09-22 Thread elisabeth benoit

and as was said in previous post, we can clearly see in analysis output
that end values for edgengrams are good for solr 4.10.1 and not good for
solr 5.5.2


solr 5.5.2

text
raw_bytes
start
end
positionLength
type
position
p
[70]
0
5
1
word
1
pa
[70 61]
0
5
1
word
1
par
[70 61 72]
0
5
1
word
1
pari
[70 61 72 69]
0
5
1
word
1
paris
[70 61 72 69 73]
0
5
1
word



end is always set to 5, which is false


solr 4.10.1


text
raw_bytes
start
end
positionLength
type
position
p
[70]
0
1
1
word
1
pa
[70 61]
0
2
1
word
1
par
[70 61 72]
0
3
1
word
1
pari
[70 61 72 69]
0
4
1
word
1
paris
[70 61 72 69 73]
0
5
1
word

end is set to 1, 2, 3 or 4 depending on edgengrams length


2016-09-22 14:57 GMT+02:00 elisabeth benoit <elisaelisael...@gmail.com>:

>
> Hello
>
> After migrating from solr 4.10.1 to solr 5.5.2, we dont have the same
> behaviour with highlighting on edge ngrams fields.
>
> We're using it for an autocomplete component. With Solr 4.10.1, if request
> is sol, highlighting on solr is sol<\em>r
>
> with solr 5.5.2, we have solr<\em>.
>
> Same problem as described in http://grokbase.com/t/
> lucene/solr-user/154m4jzv2f/solr-5-hit-highlight-with-
> ngram-edgengram-fields
>
> but nobody answered the post.
>
> Does anyone know we can fix this?
>
> Best regards,
> Elisabeth
>
> Field definition
>
> 
>   
> 
> 
>  pattern="[\s,;:\-\]"/>
>  splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="1"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="1"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> 
>  minGramSize="1"/>
>   
>   
> 
> 
>  pattern="[\s,;:\-\]"/>
>  splitOnNumerics="0"
> generateWordParts="1"
> generateNumberParts="0"
> catenateWords="0"
> catenateNumbers="0"
> catenateAll="0"
> splitOnCaseChange="0"
> preserveOriginal="1"
> types="wdfftypes.txt"
> />
> 
>
>   
> 
>

migration to solr 5.5.2 highlight on ngrams not working

2016-09-22 Thread elisabeth benoit

Hello

After migrating from solr 4.10.1 to solr 5.5.2, we dont have the same
behaviour with highlighting on edge ngrams fields.

We're using it for an autocomplete component. With Solr 4.10.1, if request
is sol, highlighting on solr is sol<\em>r

with solr 5.5.2, we have solr<\em>.

Same problem as described in
http://grokbase.com/t/lucene/solr-user/154m4jzv2f/solr-5-hit-highlight-with-ngram-edgengram-fields

but nobody answered the post.

Does anyone know we can fix this?

Best regards,
Elisabeth

Field definition

Re: solr 5.5.2 dump threads - threads blocked in org.eclipse.jetty.util.BlockingArrayQueue

2016-09-08 Thread elisabeth benoit

Well, we rekicked the machine with puppet, restarted solr and now it seems
ok. dont know what happened.

2016-09-08 11:38 GMT+02:00 elisabeth benoit <elisaelisael...@gmail.com>:

>
> Hello,
>
>
> We are perf testing solr 5.5.2 (with a limit test, i.e. sending as much
> queries/sec as possible) and we see the cpu never goes over 20%, and
> threads are blocked in org.eclipse.jetty.util.BlockingArrayQueue, as we
> can see in solr admin interface thread dumps
>
> qtp706277948-757 (757)
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject@2c4a56cb
>
>- sun.misc.Unsafe.park(Native Method)
>- java.util.concurrent.locks.LockSupport.parkNanos(
>LockSupport.java:215)
>- java.util.concurrent.locks.AbstractQueuedSynchronizer$
>ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
>- org.eclipse.jetty.util.BlockingArrayQueue.poll(
>BlockingArrayQueue.java:389)
>- org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(
>QueuedThreadPool.java:531)
>- org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(
>QueuedThreadPool.java:47)
>- org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(
>QueuedThreadPool.java:590)
>- java.lang.Thread.run(Thread.java:745)
>
>
> We changed two things in jetty configuration,
>
> maxThreads value in /opt/solr/server/solr/jetty.xml
>
>  default="400"/>
>
>
> and we activated the request log, i.e. uncommented the lines
>
>
>
> 
>
>   
>
> 
>
>   
>
> 
>
>   
>
> 
>
>/var/solr/logs/requests.log
>
> 
>
> _MM_dd
>
> 90
>
> true
>
> false
>
> false
>
> UTC
>
> true
>
>   
>
> 
>
>   
>
> 
>
>   
>
> 
>
>
> in jetty.xml
>
>
> We had the same result with maxThreads=1 (default value in solr
> install).
>
>
> Did anyone experiment the same issue with solr 5?
>
>
> Best regards,
>
> Elisabeth
>

solr 5.5.2 dump threads - threads blocked in org.eclipse.jetty.util.BlockingArrayQueue

2016-09-08 Thread elisabeth benoit

Hello,


We are perf testing solr 5.5.2 (with a limit test, i.e. sending as much
queries/sec as possible) and we see the cpu never goes over 20%, and
threads are blocked in org.eclipse.jetty.util.BlockingArrayQueue, as we can
see in solr admin interface thread dumps

qtp706277948-757 (757)

java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@2c4a56cb

   - sun.misc.Unsafe.park(Native Method)
   - java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
   -
   
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
   -
   org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:389)
   -
   
org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:531)
   -
   
org.eclipse.jetty.util.thread.QueuedThreadPool.access$700(QueuedThreadPool.java:47)
   -
   
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:590)
   - java.lang.Thread.run(Thread.java:745)


We changed two things in jetty configuration,

maxThreads value in /opt/solr/server/solr/jetty.xml




and we activated the request log, i.e. uncommented the lines





  



  



  



   /var/solr/logs/requests.log



_MM_dd

90

true

false

false

UTC

true

  



  



  




in jetty.xml


We had the same result with maxThreads=1 (default value in solr
install).


Did anyone experiment the same issue with solr 5?


Best regards,

Elisabeth

threads blocked in LRUcache.get() in solr 5.5.2

2016-08-31 Thread elisabeth benoit

Hello,

We are migrating from solr 4.10.1 to solr 5.5.2. We don't use solr cloud.

We installed the service with installation script and kept the default
configuration, except for a few settings about logs and the gc config (the
same used with solr 4.10.1).

We tested today the performances of solr 5.5.2 with a limit test, and got
really really bad performances, some queries taking up to 29 ms (on our
dev server, which are sub dimensioned, but with no perf test, the query
time is still bigger, but not THAT much)

The server has three cores, one of 8g, one of 3g and one of less than 1g.
The machine has 64g of ram and xmx and xms are set to 16g.

 We check the jvm in visualvm and noticed too many threads were created by
jetty. The max threads was set to 1 in jetty.xml, so we lowered it to
400 (the same number we used with tomcat7)

Then we perf tested again, the queries were still very slow, with not so
much of the cpu used, as we saw with top, 16 cores all used at most at 20%
(but really some juste 5%). After 30 minutes of test, we could see in
visualvm that the threads we're spending 65% of their time in
LRUCache.get() and 25% in LRUCache.put(). We noticed in visualvm the
solrthreads were mostly blocked, and then checked the dump threads in solr
admin interface, and the blocked ones were waiting for LRUcache.get().

We have queries with filters (fq parameter). We use FastLRUCache for filter
cache and LRUCache for document cache, with a min/max size of 512 for
filter and 15000 for document cache. This may seem small but it's the value
we use with solr 4.10.1 in production with what we consider good enough
performances (less than 40 ms).

Does anyone have an idea what is wrong ? Our configuration is ok with solr
4.10.1.

Best regards,
Elisabeth

Re: another log question about solr 5

2016-08-25 Thread elisabeth benoit

Thanks! This is very helpful!

Best regards,
Elisabeth

2016-08-25 17:07 GMT+02:00 Shawn Heisey <apa...@elyograg.org>:

> On 8/24/2016 6:01 AM, elisabeth benoit wrote:
> > I was wondering was is the right way to prevent solr 5 from creating a
> new
> > log file at every startup  (and renaming the actual file mv
> > "$SOLR_LOGS_DIR/solr_gc.log" "$SOLR_LOGS_DIR/solr_gc_log_$(date
> > +"%Y%m%d_%H%M")"
>
> I think if you find and comment/remove the command in the startup script
> that renames the logfile, that would do it.  The default log4j config
> will rotate the logfiles.  You can comment the first part of the
> bin/solr section labeled "backup the log files before starting".  I
> would recommend NOT commenting the next part, which rotates the garbage
> collection log.
>
> You should also modify server/resources/log4j.properties to remove all
> mention of the CONSOLE output.  The console logfile is created by shell
> redirection, which means it is never rotated and can fill up your disk.
> It's a duplicate of information that goes into solr.log, so you don't
> need it.  This means removing ", CONSOLE" from the log4j.rootLogger line
> and entirely removing the lines that start with log4j.appender.CONSOLE.
>
> You might also want to adjust the log4j.appender.file.MaxFileSize line
> in log4j.properties -- 4 megabytes is very small, which means that your
> logfile history might not cover enough time to be useful.
>
> Dev note:I think we really need to include gc logfile rotation in the
> startup script.  If the java heap is properly sized, this file won't
> grow super-quickly, but it WILL grow, and that might cause issues.  I
> also think that the MaxFileSize default in log4j.properties needs to be
> larger.
>
> Thanks,
> Shawn
>
>

Re: equivalent of localhost_access_log for solr 5

2016-08-24 Thread elisabeth benoit

Thanks a lot for your answer.

Best regards,
elisabeth

2016-08-24 16:16 GMT+02:00 Shawn Heisey <apa...@elyograg.org>:

> On 8/24/2016 5:44 AM, elisabeth benoit wrote:
> > I'd like to know what is the best way to have the equivalent of tomcat
> > localhost_access_log for solr 5?
>
> I don't know for sure what that is, but it sounds like a request log.
> If you edit server/etc/jetty.xml you will find a commented out section
> of configuration that enables a request log.  The header says "Configure
> Request Log".  If that's what you want, just uncomment that section and
> restart Solr.
>
> Thanks,
> Shawn
>
>

another log question about solr 5

2016-08-24 Thread elisabeth benoit

Hello again,

We're planning on using solr 5.5.2 on production, using installation
script install_solr_service.sh.

I was wondering was is the right way to prevent solr 5 from creating a new
log file at every startup  (and renaming the actual file mv
"$SOLR_LOGS_DIR/solr_gc.log" "$SOLR_LOGS_DIR/solr_gc_log_$(date
+"%Y%m%d_%H%M")"

Thanks,
Elisabeth

equivalent of localhost_access_log for solr 5

2016-08-24 Thread elisabeth benoit

Hello,

I'd like to know what is the best way to have the equivalent of
tomcat localhost_access_log  for solr 5?

Best regards,
Elisabeth

Re: Solr 5.5.2 mm parameter not working the same

2016-07-27 Thread elisabeth benoit

oh sorry wrote too fast. had to change the defaultOperator to OR.

Elisabeth

2016-07-27 10:11 GMT+02:00 elisabeth benoit <elisaelisael...@gmail.com>:

>
> Hello,
>
> We are migrating from solr 4.10.1 to solr 5.5.2, and it seems that the mm
> parameter is not working the same anymore.
>
> In fact, as soon as there is a word not in the index in the query, no
> matter what mm value I send, I get no answer as if my query is a pure AND
> query.
>
> Does anyone have a clue?
>
> Best regards,
> Elisabeth
>
>

Solr 5.5.2 mm parameter not working the same

2016-07-27 Thread elisabeth benoit

Hello,

We are migrating from solr 4.10.1 to solr 5.5.2, and it seems that the mm
parameter is not working the same anymore.

In fact, as soon as there is a word not in the index in the query, no
matter what mm value I send, I get no answer as if my query is a pure AND
query.

Does anyone have a clue?

Best regards,
Elisabeth

Re: solr 5.5.2 loadOnStartUp does not work

2016-07-26 Thread elisabeth benoit

Hello,

Thanks for your answer.

Yes, it seems a little tricky to me.

Best regards,
Elisabeth

2016-07-25 18:06 GMT+02:00 Erick Erickson <erickerick...@gmail.com>:

> "Load" is a little tricky here, it means "load the core and open a
> searcher.
> The core _descriptor_ which is the internal structure of
> core.properties (plus some other info) _is_ loaded and is what's
> used to show the list of available cores. Else how would you
> even know the core existed?
>
> It's not until you actually try to do anything (even click on the
> item in the "cores" drop-down) that the heavy-duty
> work of opening the core actually executes.
>
> So I think it's working as expected,. But do note
> that this whole area (transient cores, loading on
> startup true/false) is intended for stand-alone
> Solr and is unsupported in SolrCloud.
>
> Best,
> Erick
>
> On Mon, Jul 25, 2016 at 6:09 AM, elisabeth benoit
> <elisaelisael...@gmail.com> wrote:
> > Hello,
> >
> > I have a core.properties with content
> >
> > name=indexer
> > loadOnStartup=false
> >
> >
> > but the core is loaded on start up (it appears on the admin interface).
> >
> > I thougth the core would be unloaded on startup. did I miss something?
> >
> >
> > best regards,
> >
> > elisabeth
>

solr 5.5.2 loadOnStartUp does not work

2016-07-25 Thread elisabeth benoit

Hello,

I have a core.properties with content

name=indexer
loadOnStartup=false


but the core is loaded on start up (it appears on the admin interface).

I thougth the core would be unloaded on startup. did I miss something?


best regards,

elisabeth

Re: Boosting exact match fields.

2016-06-16 Thread elisabeth benoit

In addition to what was proposed

We use the technic described here

https://github.com/cominvent/exactmatch

and it works quite well.

Best regards
Elisabeth

2016-06-15 16:32 GMT+02:00 Alessandro Benedetti :

> In addition to what Erick correctly proposed,
> are you storing norms for your field of interest ( to boost documents with
> shorter field values )?
> If you are, I find suspicious "Sony Ear Phones" to win over "Ear Phones"
> for your "Ear Phones" query.
> What are the other factors currently involved in your relevancy score
> calculus ?
>
> Cheers
>
> On Tue, Jun 14, 2016 at 4:48 PM, Erick Erickson 
> wrote:
>
> > If these are the complete field, i.e. your document
> > contains exactly "ear phones" and not "ear phones
> > are great" use a copyField to put it into an "exact_match"
> > field that uses a much simpler analysis chain based
> > on KeywordTokenizer (plus, perhaps things like
> > lowercaseFilter, maybe strip punctuation and the like".
> > Then you add a clause on exact_match boosted
> > really high.
> >
> > Best,
> > Erick
> >
> > On Tue, Jun 14, 2016 at 1:01 AM, Naveen Pajjuri
> >  wrote:
> > > Hi,
> > >
> > > I have documents with a field (data type definition for that field is
> > > below) values as ear phones, sony ear phones, philips ear phones. when
> i
> > > query for earphones sony ear phones is the top result where as i want
> ear
> > > phones as top result. please suggest how to boost exact matches. PS: I
> > have
> > > earphones => ear phones in my synonyms.txt and the datatype definition
> > for
> > > that field keywords is  > > positionIncrementGap="100">   > > "solr.WhitespaceTokenizerFactory"/>  > class="solr.StopFilterFactory"
> > > ignoreCase="true" words="stopwords.txt"/>  > > "solr.LowerCaseFilterFactory"/>  class="solr.SynonymFilterFactory"
> > > synonyms="synonyms.txt" ignoreCase="true" expand="true"/>  class=
> > > "solr.RemoveDuplicatesTokenFilterFactory"/>   > > "query">   > class=
> > > "solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
> >  > > class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> > ignoreCase="true"
> > > expand="true"/>   > class=
> > > "solr.RemoveDuplicatesTokenFilterFactory"/>  
> > REGARDS,
> > > Naveen
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: Solr join between documents

2016-05-21 Thread elisabeth benoit

Ok, thanks for your answer! That's what I thought but just wanted to be
sure.

Best regards,
Elisabeth

2016-05-21 2:02 GMT+02:00 Erick Erickson <erickerick...@gmail.com>:

> Gosh, I'm not even sure how to start to form such a query.
>
> Let's see, you have StreetB in some city identified by postal code P.
>
> Is what you're wanting "return me all pairs of documents within that
> postal code that have all the terms matching and the polygons enclosing
> those streets plus some distance intersect"?
>
> Seems difficult.
>
> Best,
> Erick
>
> On Thu, May 19, 2016 at 8:35 AM, elisabeth benoit
> <elisaelisael...@gmail.com> wrote:
> > Hello all,
> >
> > I was wondering if there was a solr solution for a problem I have (and
> I'm
> > not the only one I guess)
> >
> > We use solr as a search engine for addresses. We sometimes have requests
> > with let's say for instance
> >
> > street A close to street B City postcode
> >
> > I was wondering if some kind of join between two documents is possible in
> > solr?
> >
> > The query would be: find union of two documents matching all words in
> query.
> >
> > Those documents have a latitude and a longitude, and we would fix a max
> > distance between two documents to be eligible for a join.
> >
> > Is there a way to do this?
> >
> > Best regards,
> > Elisabeth
>

Solr join between documents

2016-05-19 Thread elisabeth benoit

Hello all,

I was wondering if there was a solr solution for a problem I have (and I'm
not the only one I guess)

We use solr as a search engine for addresses. We sometimes have requests
with let's say for instance

street A close to street B City postcode

I was wondering if some kind of join between two documents is possible in
solr?

The query would be: find union of two documents matching all words in query.

Those documents have a latitude and a longitude, and we would fix a max
distance between two documents to be eligible for a join.

Is there a way to do this?

Best regards,
Elisabeth

deactivate coord scoring factor in pf2 pf3

2016-04-28 Thread elisabeth benoit

Hello all,

I am using Solr 4.10.1. I use edismax, with pf2 to boost documents starting
with. I use a start with token (b) automatically added at index time,
and added in request at query time.

I have a problem at this point.

request is *q=b saint denis rer*

the start with field is name_sw

first document *name_sw: Saint-Denis-Université*
second document *name_sw: RER Saint-Denis*

So one will have the pf2 starts with boost and not the other. The problem
is that it has an effect on the scoring of pf2 for all other words.

In other words, my problem is the proximity between "saint" and "denis" is
not scored the same value for those two documents.

>From what I get this is because of the coord scoring factor used for pf2.

In explain output, for first document

0.52612317 Matches Punished by 0.667 (not all query terms matched)
   0.78918475 Sum of the following:
 0.39459237 names_sw:"b saint"^0.21

 0.39459237 Dismax (take winner of below)
   0.39459237 names_sw:"saint denis"^0.21

   0.37580228 catchall:"saint den"^0.2


*So here, matches punished by 0.66*, which corresponds to coord(2/3)

and final score pf2 for proximity between saint and denis

0.263061593153079 names_sw:"saint denis"^0.21


In explain output, for second document


 0.13153079 Matches Punished by 0.3334 (not all query terms matched)
   0.39459237 Dismax (take winner of below)
 0.39459237 names_sw:"saint denis"^0.21

 0.37580228 catchall:"saint den"^0.2


*So here matches punished by 0.33*, which corresponds to coord(1/3)

and final score pf2 for proximity between saint and denis

0.1315307926306158 names_sw:"saint denis"^0.21


I would like to deactivate coord for pf2 pf3. Does anyone know how I
could do this?


Best regards,

Elisabeth

Re: ngrams with position

2016-03-11 Thread elisabeth benoit

Jack, Emir,

Thanks for your answers. Moving ngram logic to client side would be a fast
and easy way to test the solution and compare it with the phonetic one.

Best regards,
Elisabeth

2016-03-11 10:52 GMT+01:00 Emir Arnautovic <emir.arnauto...@sematext.com>:

> Hi Elizabeth,
> In order to see if you will get better results, you can move ngram logic
> outside of analysis chain - simplest solution is to move it to client. In
> such setup, you should be able to use pf2 and pf3 and see if that produces
> desired result.
>
> Regards,
> Emir
>
>
> On 10.03.2016 13:47, elisabeth benoit wrote:
>
>> oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost
>> proximity between words, not between ngrams.
>>
>> Thanks again,
>> Elisabeth
>>
>> 2016-03-10 12:31 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:
>>
>> The reason pf2 and pf3 seems not a good solution to me is the fact that
>>> the
>>> edismax query parser calculate those grams on top of words shingles.
>>> So it takes the query in input, and produces the shingle based on the
>>> white
>>> space separator.
>>>
>>> i.e. if you search :
>>> "white tiger jumping"
>>>   and pf2 configured on field1.
>>> You are going to end up searching in field1 :
>>> "white tiger", "tiger jumping" .
>>> This is really useful in full text search oriented to phrases and partial
>>> phrases match.
>>> But it has nothing to do with the analysis type associated at query time
>>> at
>>> this moment.
>>> First it is used the query parser tokenisation to build the grams and
>>> then
>>> the query time analysis is applied.
>>> This according to my remembering,
>>> I will double check in the code and let you know.
>>>
>>> Cheers
>>>
>>>
>>> On 10 March 2016 at 11:02, elisabeth benoit <elisaelisael...@gmail.com>
>>> wrote:
>>>
>>> That's the use cas, yes. Find Amsterdam with Asmtreadm.
>>>>
>>>> And yes, we're only doing approximative search if we get 0 result.
>>>>
>>>> I don't quite get why pf2 pf3 not a good solution.
>>>>
>>>> We're actually testing a solution close to phonetic. Some kind of word
>>>> reduction.
>>>>
>>>> Thanks for the suggestion (and the link), this makes me think maybe
>>>> phonetic is the good solution.
>>>>
>>>> Thanks for your help,
>>>> Elisabeth
>>>>
>>>> 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenede...@apache.org
>>>> >:
>>>>
>>>>  If I followed your use case is:
>>>>>
>>>>> I type Asmtreadm and I want document matching Amsterdam ( even if the
>>>>>
>>>> edit
>>>>
>>>>> distance is greater than 2) .
>>>>> First of all is something I hope you do only if you get 0 results, if
>>>>>
>>>> not
>>>
>>>> the overhead can be great and you are going to lose a lot of precision
>>>>> causing confusion in the customer.
>>>>>
>>>>> Pf2 and Pf3 is ngram of white space separated tokens, to make partial
>>>>> phrase query to affect the scoring.
>>>>> Not a good fit for your problem.
>>>>>
>>>>> More than grams, have you considered using some sort of phonetic
>>>>>
>>>> matching ?
>>>>
>>>>> Could this help :
>>>>> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 10 March 2016 at 08:47, elisabeth benoit <elisaelisael...@gmail.com
>>>>> wrote:
>>>>>
>>>>> I am trying to do approximative search with solr. We've tried fuzzy
>>>>>>
>>>>> search,
>>>>>
>>>>>> and spellcheck search, it's working ok but edit distance is limited
>>>>>>
>>>>> (to 2
>>>>
>>>>> for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator,
>>>>>>
>>>>> we've
>>>
>>>> had
>>>>>
>>>>>> performance issues, and I don't think you can have an edit distance
>>>>>>
>>>>> more
>>>>
>>>>> than 2.
>>>

Re: ngrams with position

2016-03-10 Thread elisabeth benoit

oh yeah, now that you're saying it, yeah you're right, pf2 pf3 will boost
proximity between words, not between ngrams.

Thanks again,
Elisabeth

2016-03-10 12:31 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:

> The reason pf2 and pf3 seems not a good solution to me is the fact that the
> edismax query parser calculate those grams on top of words shingles.
> So it takes the query in input, and produces the shingle based on the white
> space separator.
>
> i.e. if you search :
> "white tiger jumping"
>  and pf2 configured on field1.
> You are going to end up searching in field1 :
> "white tiger", "tiger jumping" .
> This is really useful in full text search oriented to phrases and partial
> phrases match.
> But it has nothing to do with the analysis type associated at query time at
> this moment.
> First it is used the query parser tokenisation to build the grams and then
> the query time analysis is applied.
> This according to my remembering,
> I will double check in the code and let you know.
>
> Cheers
>
>
> On 10 March 2016 at 11:02, elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > That's the use cas, yes. Find Amsterdam with Asmtreadm.
> >
> > And yes, we're only doing approximative search if we get 0 result.
> >
> > I don't quite get why pf2 pf3 not a good solution.
> >
> > We're actually testing a solution close to phonetic. Some kind of word
> > reduction.
> >
> > Thanks for the suggestion (and the link), this makes me think maybe
> > phonetic is the good solution.
> >
> > Thanks for your help,
> > Elisabeth
> >
> > 2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:
> >
> > >  If I followed your use case is:
> > >
> > > I type Asmtreadm and I want document matching Amsterdam ( even if the
> > edit
> > > distance is greater than 2) .
> > > First of all is something I hope you do only if you get 0 results, if
> not
> > > the overhead can be great and you are going to lose a lot of precision
> > > causing confusion in the customer.
> > >
> > > Pf2 and Pf3 is ngram of white space separated tokens, to make partial
> > > phrase query to affect the scoring.
> > > Not a good fit for your problem.
> > >
> > > More than grams, have you considered using some sort of phonetic
> > matching ?
> > > Could this help :
> > > https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
> > >
> > > Cheers
> > >
> > > On 10 March 2016 at 08:47, elisabeth benoit <elisaelisael...@gmail.com
> >
> > > wrote:
> > >
> > > > I am trying to do approximative search with solr. We've tried fuzzy
> > > search,
> > > > and spellcheck search, it's working ok but edit distance is limited
> > (to 2
> > > > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator,
> we've
> > > had
> > > > performance issues, and I don't think you can have an edit distance
> > more
> > > > than 2.
> > > >
> > > > What we used to do with a database was more efficient: storing
> trigrams
> > > > with position, and then searching arround that position (not
> precisely
> > at
> > > > that position, since it's approximative search)
> > > >
> > > > Position is to avoid  for a trigram like ams (amsterdam) to get
> answers
> > > > where the same trigram is for instance at the end of the word. I
> would
> > > like
> > > > answers with the same relative position between trigrams to score
> > higher.
> > > > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see
> any
> > > > other way. Please tell me if you do.
> > > >
> > > > From you're answer, I get that position is stored, but I dont
> > understand
> > > > how I can preserve relative order between trigrams, apart from using
> > pf2
> > > > pf3.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > > > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <
> abenede...@apache.org
> > >:
> > > >
> > > > > if you store the positions for your tokens ( and it is by default
> if
> > > you
> > > > > don't omit them), you have the relative position in the index. [1]
> > > > > I attach a blog post of mine, describing a little bit more in
> details
> &

Re: ngrams with position

2016-03-10 Thread elisabeth benoit

That's the use cas, yes. Find Amsterdam with Asmtreadm.

And yes, we're only doing approximative search if we get 0 result.

I don't quite get why pf2 pf3 not a good solution.

We're actually testing a solution close to phonetic. Some kind of word
reduction.

Thanks for the suggestion (and the link), this makes me think maybe
phonetic is the good solution.

Thanks for your help,
Elisabeth

2016-03-10 11:32 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:

>  If I followed your use case is:
>
> I type Asmtreadm and I want document matching Amsterdam ( even if the edit
> distance is greater than 2) .
> First of all is something I hope you do only if you get 0 results, if not
> the overhead can be great and you are going to lose a lot of precision
> causing confusion in the customer.
>
> Pf2 and Pf3 is ngram of white space separated tokens, to make partial
> phrase query to affect the scoring.
> Not a good fit for your problem.
>
> More than grams, have you considered using some sort of phonetic matching ?
> Could this help :
> https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
>
> Cheers
>
> On 10 March 2016 at 08:47, elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > I am trying to do approximative search with solr. We've tried fuzzy
> search,
> > and spellcheck search, it's working ok but edit distance is limited (to 2
> > for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've
> had
> > performance issues, and I don't think you can have an edit distance more
> > than 2.
> >
> > What we used to do with a database was more efficient: storing trigrams
> > with position, and then searching arround that position (not precisely at
> > that position, since it's approximative search)
> >
> > Position is to avoid  for a trigram like ams (amsterdam) to get answers
> > where the same trigram is for instance at the end of the word. I would
> like
> > answers with the same relative position between trigrams to score higher.
> > Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any
> > other way. Please tell me if you do.
> >
> > From you're answer, I get that position is stored, but I dont understand
> > how I can preserve relative order between trigrams, apart from using pf2
> > pf3.
> >
> > Best regards,
> > Elisabeth
> >
> > 2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:
> >
> > > if you store the positions for your tokens ( and it is by default if
> you
> > > don't omit them), you have the relative position in the index. [1]
> > > I attach a blog post of mine, describing a little bit more in details
> the
> > > lucene internals.
> > >
> > > Apart from that, can you explain the problem you are trying to solve ?
> > > The high level user experience ?
> > > What kind of search/autocompletion/relevancy tuning are you trying to
> > > achieve ?
> > > Maybe we can help better if we start from the problem :)
> > >
> > > Cheers
> > >
> > > [1]
> > >
> > >
> >
> http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html
> > >
> > > On 9 March 2016 at 15:02, elisabeth benoit <elisaelisael...@gmail.com>
> > > wrote:
> > >
> > > > Hello Alessandro,
> > > >
> > > > You may be right. What would you use to keep relative order between,
> > for
> > > > instance, grams
> > > >
> > > > __a
> > > > _am
> > > > ams
> > > > mst
> > > > ste
> > > > ter
> > > > erd
> > > > rda
> > > > dam
> > > > am_
> > > >
> > > > of amsterdam? pf2 and pf3? That's all I can think about. Please let
> me
> > > know
> > > > if you have more insights.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > > > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti <
> abenede...@apache.org
> > >:
> > > >
> > > > > Elizabeth,
> > > > > out of curiousity, could we know what you are trying to solve with
> > that
> > > > > complex way of tokenisation ?
> > > > > Solr is really good in storing positions along with token, so I am
> > > > curious
> > > > > to know why your are mixing the things up.
> > > > >
> > > > > Cheers
> > > > >
> > > > > On

Re: ngrams with position

2016-03-10 Thread elisabeth benoit

I am trying to do approximative search with solr. We've tried fuzzy search,
and spellcheck search, it's working ok but edit distance is limited (to 2
for DirectSolrSpellChecker in solr 4.10.1). With fuzzy operator, we've had
performance issues, and I don't think you can have an edit distance more
than 2.

What we used to do with a database was more efficient: storing trigrams
with position, and then searching arround that position (not precisely at
that position, since it's approximative search)

Position is to avoid  for a trigram like ams (amsterdam) to get answers
where the same trigram is for instance at the end of the word. I would like
answers with the same relative position between trigrams to score higher.
Maybe using edismax'ss pf2 and pf3 is a way to do this. I don't see any
other way. Please tell me if you do.

>From you're answer, I get that position is stored, but I dont understand
how I can preserve relative order between trigrams, apart from using pf2
pf3.

Best regards,
Elisabeth

2016-03-10 0:02 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:

> if you store the positions for your tokens ( and it is by default if you
> don't omit them), you have the relative position in the index. [1]
> I attach a blog post of mine, describing a little bit more in details the
> lucene internals.
>
> Apart from that, can you explain the problem you are trying to solve ?
> The high level user experience ?
> What kind of search/autocompletion/relevancy tuning are you trying to
> achieve ?
> Maybe we can help better if we start from the problem :)
>
> Cheers
>
> [1]
>
> http://alexbenedetti.blogspot.co.uk/2015/07/exploring-solr-internals-lucene.html
>
> On 9 March 2016 at 15:02, elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > Hello Alessandro,
> >
> > You may be right. What would you use to keep relative order between, for
> > instance, grams
> >
> > __a
> > _am
> > ams
> > mst
> > ste
> > ter
> > erd
> > rda
> > dam
> > am_
> >
> > of amsterdam? pf2 and pf3? That's all I can think about. Please let me
> know
> > if you have more insights.
> >
> > Best regards,
> > Elisabeth
> >
> > 2016-03-08 17:46 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:
> >
> > > Elizabeth,
> > > out of curiousity, could we know what you are trying to solve with that
> > > complex way of tokenisation ?
> > > Solr is really good in storing positions along with token, so I am
> > curious
> > > to know why your are mixing the things up.
> > >
> > > Cheers
> > >
> > > On 8 March 2016 at 10:08, elisabeth benoit <elisaelisael...@gmail.com>
> > > wrote:
> > >
> > > > Thanks for your answer Emir,
> > > >
> > > > I'll check that out.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > > > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic <
> > emir.arnauto...@sematext.com
> > > >:
> > > >
> > > > > Hi Elisabeth,
> > > > > I don't think there is such token filter, so you would have to
> create
> > > > your
> > > > > own token filter that takes token and emits ngram token of specific
> > > > length.
> > > > > It should not be too hard to create such filter - you can take a
> look
> > > how
> > > > > nagram filter is coded - yours should be simpler than that.
> > > > >
> > > > > Regards,
> > > > > Emir
> > > > >
> > > > >
> > > > > On 08.03.2016 08:52, elisabeth benoit wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix
> > > lenght
> > > > >> with a position in the end.
> > > > >>
> > > > >> For instance, with fix lenght 3, Amsterdam would be something
> like:
> > > > >>
> > > > >>
> > > > >> a0 (two spaces added at beginning)
> > > > >> am1
> > > > >> ams2
> > > > >> mst3
> > > > >> ste4
> > > > >> ter5
> > > > >> erd6
> > > > >> rda7
> > > > >> dam8
> > > > >> am9 (one more space in the end)
> > > > >>
> > > > >> The number at the end being the position.
> > > > >>
> > > > >> Does anyone have a clue how to achieve this?
> > > > >>
> > > > >> Best regards,
> > > > >> Elisabeth
> > > > >>
> > > > >>
> > > > > --
> > > > > Monitoring * Alerting * Anomaly Detection * Centralized Log
> > Management
> > > > > Solr & Elasticsearch Support * http://sematext.com/
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: ngrams with position

2016-03-09 Thread elisabeth benoit

Hello Alessandro,

You may be right. What would you use to keep relative order between, for
instance, grams

__a
_am
ams
mst
ste
ter
erd
rda
dam
am_

of amsterdam? pf2 and pf3? That's all I can think about. Please let me know
if you have more insights.

Best regards,
Elisabeth

2016-03-08 17:46 GMT+01:00 Alessandro Benedetti <abenede...@apache.org>:

> Elizabeth,
> out of curiousity, could we know what you are trying to solve with that
> complex way of tokenisation ?
> Solr is really good in storing positions along with token, so I am curious
> to know why your are mixing the things up.
>
> Cheers
>
> On 8 March 2016 at 10:08, elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > Thanks for your answer Emir,
> >
> > I'll check that out.
> >
> > Best regards,
> > Elisabeth
> >
> > 2016-03-08 10:24 GMT+01:00 Emir Arnautovic <emir.arnauto...@sematext.com
> >:
> >
> > > Hi Elisabeth,
> > > I don't think there is such token filter, so you would have to create
> > your
> > > own token filter that takes token and emits ngram token of specific
> > length.
> > > It should not be too hard to create such filter - you can take a look
> how
> > > nagram filter is coded - yours should be simpler than that.
> > >
> > > Regards,
> > > Emir
> > >
> > >
> > > On 08.03.2016 08:52, elisabeth benoit wrote:
> > >
> > >> Hello,
> > >>
> > >> I'm using solr 4.10.1. I'd like to index words with ngrams of fix
> lenght
> > >> with a position in the end.
> > >>
> > >> For instance, with fix lenght 3, Amsterdam would be something like:
> > >>
> > >>
> > >> a0 (two spaces added at beginning)
> > >> am1
> > >> ams2
> > >> mst3
> > >> ste4
> > >> ter5
> > >> erd6
> > >> rda7
> > >> dam8
> > >> am9 (one more space in the end)
> > >>
> > >> The number at the end being the position.
> > >>
> > >> Does anyone have a clue how to achieve this?
> > >>
> > >> Best regards,
> > >> Elisabeth
> > >>
> > >>
> > > --
> > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > > Solr & Elasticsearch Support * http://sematext.com/
> > >
> > >
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>

Re: ngrams with position

2016-03-08 Thread elisabeth benoit

Thanks for your answer Emir,

I'll check that out.

Best regards,
Elisabeth

2016-03-08 10:24 GMT+01:00 Emir Arnautovic <emir.arnauto...@sematext.com>:

> Hi Elisabeth,
> I don't think there is such token filter, so you would have to create your
> own token filter that takes token and emits ngram token of specific length.
> It should not be too hard to create such filter - you can take a look how
> nagram filter is coded - yours should be simpler than that.
>
> Regards,
> Emir
>
>
> On 08.03.2016 08:52, elisabeth benoit wrote:
>
>> Hello,
>>
>> I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght
>> with a position in the end.
>>
>> For instance, with fix lenght 3, Amsterdam would be something like:
>>
>>
>> a0 (two spaces added at beginning)
>> am1
>> ams2
>> mst3
>> ste4
>> ter5
>> erd6
>> rda7
>> dam8
>> am9 (one more space in the end)
>>
>> The number at the end being the position.
>>
>> Does anyone have a clue how to achieve this?
>>
>> Best regards,
>> Elisabeth
>>
>>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>

ngrams with position

2016-03-07 Thread elisabeth benoit

Hello,

I'm using solr 4.10.1. I'd like to index words with ngrams of fix lenght
with a position in the end.

For instance, with fix lenght 3, Amsterdam would be something like:


a0 (two spaces added at beginning)
am1
ams2
mst3
ste4
ter5
erd6
rda7
dam8
am9 (one more space in the end)

The number at the end being the position.

Does anyone have a clue how to achieve this?

Best regards,
Elisabeth

Re: Boost exact search

2016-02-22 Thread elisabeth benoit

Hello,

There was a discussion on this thread about exact match

http://www.mail-archive.com/solr-user%40lucene.apache.org/msg118115.html


they mention an example on this page


https://github.com/cominvent/exactmatch


Best regards,
Elisabeth

2016-02-19 18:01 GMT+01:00 Loïc Stéphan :

> Hello,
>
>
>
> We try to boost exact search to improve relevance.
>
> We followed this article :
> http://everydaydeveloper.blogspot.fr/2012/02/solr-improve-relevancy-by-boosting.html
> and this
> http://stackoverflow.com/questions/29103155/solr-exact-match-boost-over-text-containing-the-exact-match
>  but it doesn’t work for us.
>
>
>
> What is the best way to do this ?
>
>
>
> Thanks in advance
>
>
>
> [image: cid:image001.jpg@01CDD6D4.98875830]
>
>
>
> *--*
>
> *LOIC STEPHAN*
> Responsable TMA
>
> *www.w-seils.com *
>
>
>
> *lstep...@w-seils.com *
> Tel   *+33 (0)2 28 22 75 42 <%2B33%20%280%292%2028%2023%2070%2072>*
>
>
>
>
>

Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit

hello,

yes in the second case I get one document with a higher score. the relative
scoring between documents is not the same anymore.

best regards,
elisabeth

2015-12-22 4:39 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:

> I have one query.
> In the second case do you get two records with the same lower scores or
> just one record with a lower score and the other with a higher one?
>
> On Mon, 21 Dec 2015, 18:45 elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I don't think the query is important in this case.
> >
> > After checking out solr's debug output, I dont think the query norm is
> > relevant either.
> >
> > I think the scoring changes because
> >
> > 1) in first case, I have same slop for catchall and name fields. Bot
> match
> > pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3 results.
> >
> > 2) In second case, I have different slopes, then solr uses sum of values
> > instead of max.
> >
> >
> >
> > If anyone knows how to work around this, please let me know.
> >
> > Elisabeth
> >
> > 2015-12-21 11:22 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
> >
> > > What is your query?
> > >
> > > On Mon, 21 Dec 2015, 14:37 elisabeth benoit <elisaelisael...@gmail.com
> >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> > > >
> > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > >
> > > > my search field (qf) is my catchall field
> > > >
> > > > I'v been trying to change slop in pf2, pf3 for catchall and synonyms
> > > (going
> > > > from 0, or default value for synonyms, to 1)
> > > >
> > > > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > >
> > > > but some results are not ordered the same way anymore even if I get
> the
> > > > same MATCH values in debugQuery output
> > > >
> > > > For instance, for a doc matching bastill in catchall field (and
> nothing
> > > to
> > > > do with pf2, pf3!)
> > > >
> > > > with first pf2, pf3
> > > >
> > > > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> > > [NoTFIDFSimilarity],
> > > > result of:
> > > >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > > > ), product of:
> > > >  * 0.5163083 = queryWeight,* product of:
> > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > 0.5163083 = queryNorm
> > > >   1.0 = fieldWeight in 105256, product of:
> > > > 1.0 = tf(freq=2.0), with freq of:
> > > >   2.0 = termFreq=2.0
> > > > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > > > 1.0 = fieldNorm(doc=105256)
> > > >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > > > [NoTFIDFSimilarity], result of:
> > > > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> > > >
> > > > and when I change pf2 pf3 (the only change, same query, same docs)
> > > >
> > > > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> > > [NoTFIDFSimilarity],
> > > > result of:
> > > >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > > > ), product of:
> > > >  * 0.47504464 = queryWeight*, product of:
> > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > 0.47504464 = queryNorm
> > > >   1.0 = fieldWeight in 105256, product of:
> > > > 1.0 = tf(freq=6.0), with freq of:
> > > >   6.0 = termFreq=6.0
> > > > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > > > 1.0 = fieldNorm(doc=105256)
> > > >
> > > > so in the end, with same MATCH results, in first case I get two
> > documents
> > > > with same score, and in second case, one document has a higher score.
> > > >
> > > > This seem very very strange. Does anyone have a clue what's going on?
> > > >
> > > > Thanks
> > > > Elisabeth
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>

solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit

Hello all,

I am using solr 4.10.1 and I have configured my pf2 pf3 like this

catchall~0^0.2 name~0^0.21 synonyms^0.2
catchall~0^0.2 name~0^0.21 synonyms^0.2

my search field (qf) is my catchall field

I'v been trying to change slop in pf2, pf3 for catchall and synonyms (going
from 0, or default value for synonyms, to 1)

pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2

but some results are not ordered the same way anymore even if I get the
same MATCH values in debugQuery output

For instance, for a doc matching bastill in catchall field (and nothing to
do with pf2, pf3!)

with first pf2, pf3

0.5163083 = (MATCH) weight(catchall:bastill in 105256) [NoTFIDFSimilarity],
result of:
   * 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
), product of:
 * 0.5163083 = queryWeight,* product of:
1.0 = idf(docFreq=134, maxDocs=12258543)
0.5163083 = queryNorm
  1.0 = fieldWeight in 105256, product of:
1.0 = tf(freq=2.0), with freq of:
  2.0 = termFreq=2.0
1.0 = idf(docFreq=134, maxDocs=12258543)
1.0 = fieldNorm(doc=105256)
  0.5163083 = (MATCH) weight(catchall:paris in 105256)
[NoTFIDFSimilarity], result of:
0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0

and when I change pf2 pf3 (the only change, same query, same docs)

0.47504464 = (MATCH) weight(catchall:paris in 105256) [NoTFIDFSimilarity],
result of:
   * 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
), product of:
 * 0.47504464 = queryWeight*, product of:
1.0 = idf(docFreq=10958, maxDocs=12258543)
0.47504464 = queryNorm
  1.0 = fieldWeight in 105256, product of:
1.0 = tf(freq=6.0), with freq of:
  6.0 = termFreq=6.0
1.0 = idf(docFreq=10958, maxDocs=12258543)
1.0 = fieldNorm(doc=105256)

so in the end, with same MATCH results, in first case I get two documents
with same score, and in second case, one document has a higher score.

This seem very very strange. Does anyone have a clue what's going on?

Thanks
Elisabeth

Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit

hello,

That's what I did, like I wrote in my mail yesterday. In first case, solr
computes max. In second case, he sums both results.

That's why I dont get the same relative scoring between docs with the same
query.

2015-12-22 8:30 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:

> Unless the content for both the docs is exactly the same it is highly
> unlikely that you will get the same score for the docs under different
> querying conditions. What you saw in the first case may have been a happy
> coincidence.
> Other than that it is very difficult to say why the scoring is different
> without getting a look at the actual query and the doc content.
>
> If you still wish to dig deeper, try to understand how solr actually scores
> documents that match your query. It takes into account a variety of factors
> to compute the cosine similarity to find the best match.
> You can find this formula and a decent explanation for it in the book solr
> in action or online in the lucene docs:
>
> https://lucene.apache.org/core/3_5_0/api/core/org/apache/lucene/search/Similarity.html
>
> On Tue, 22 Dec 2015, 11:10 elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > hello,
> >
> > yes in the second case I get one document with a higher score. the
> relative
> > scoring between documents is not the same anymore.
> >
> > best regards,
> > elisabeth
> >
> > 2015-12-22 4:39 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
> >
> > > I have one query.
> > > In the second case do you get two records with the same lower scores or
> > > just one record with a lower score and the other with a higher one?
> > >
> > > On Mon, 21 Dec 2015, 18:45 elisabeth benoit <elisaelisael...@gmail.com
> >
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I don't think the query is important in this case.
> > > >
> > > > After checking out solr's debug output, I dont think the query norm
> is
> > > > relevant either.
> > > >
> > > > I think the scoring changes because
> > > >
> > > > 1) in first case, I have same slop for catchall and name fields. Bot
> > > match
> > > > pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3
> > results.
> > > >
> > > > 2) In second case, I have different slopes, then solr uses sum of
> > values
> > > > instead of max.
> > > >
> > > >
> > > >
> > > > If anyone knows how to work around this, please let me know.
> > > >
> > > > Elisabeth
> > > >
> > > > 2015-12-21 11:22 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
> > > >
> > > > > What is your query?
> > > > >
> > > > > On Mon, 21 Dec 2015, 14:37 elisabeth benoit <
> > elisaelisael...@gmail.com
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello all,
> > > > > >
> > > > > > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> > > > > >
> > > > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > > > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > > > > >
> > > > > > my search field (qf) is my catchall field
> > > > > >
> > > > > > I'v been trying to change slop in pf2, pf3 for catchall and
> > synonyms
> > > > > (going
> > > > > > from 0, or default value for synonyms, to 1)
> > > > > >
> > > > > > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > > > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > > > > >
> > > > > > but some results are not ordered the same way anymore even if I
> get
> > > the
> > > > > > same MATCH values in debugQuery output
> > > > > >
> > > > > > For instance, for a doc matching bastill in catchall field (and
> > > nothing
> > > > > to
> > > > > > do with pf2, pf3!)
> > > > > >
> > > > > > with first pf2, pf3
> > > > > >
> > > > > > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> > > > > [NoTFIDFSimilarity],
> > > > > > result of:
> > > > > >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > > > > > ), product of:
> > > >

Re: solr 4.10 I change slop in pf2 pf3 and query norm changes

2015-12-21 Thread elisabeth benoit

Hello,

I don't think the query is important in this case.

After checking out solr's debug output, I dont think the query norm is
relevant either.

I think the scoring changes because

1) in first case, I have same slop for catchall and name fields. Bot match
pf2 pf3. In this case, solr uses max of both for scoring pf2 pf3 results.

2) In second case, I have different slopes, then solr uses sum of values
instead of max.



If anyone knows how to work around this, please let me know.

Elisabeth

2015-12-21 11:22 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:

> What is your query?
>
> On Mon, 21 Dec 2015, 14:37 elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > Hello all,
> >
> > I am using solr 4.10.1 and I have configured my pf2 pf3 like this
> >
> > catchall~0^0.2 name~0^0.21 synonyms^0.2
> > catchall~0^0.2 name~0^0.21 synonyms^0.2
> >
> > my search field (qf) is my catchall field
> >
> > I'v been trying to change slop in pf2, pf3 for catchall and synonyms
> (going
> > from 0, or default value for synonyms, to 1)
> >
> > pf2=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> > pf3=catchall~1^0.2 name~0^0.21 synonyms~1^0.2
> >
> > but some results are not ordered the same way anymore even if I get the
> > same MATCH values in debugQuery output
> >
> > For instance, for a doc matching bastill in catchall field (and nothing
> to
> > do with pf2, pf3!)
> >
> > with first pf2, pf3
> >
> > 0.5163083 = (MATCH) weight(catchall:bastill in 105256)
> [NoTFIDFSimilarity],
> > result of:
> >* 0.5163083 = score(doc=105256,freq=2.0 = termFreq=2.0*
> > ), product of:
> >  * 0.5163083 = queryWeight,* product of:
> > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > 0.5163083 = queryNorm
> >   1.0 = fieldWeight in 105256, product of:
> > 1.0 = tf(freq=2.0), with freq of:
> >   2.0 = termFreq=2.0
> > 1.0 = idf(docFreq=134, maxDocs=12258543)
> > 1.0 = fieldNorm(doc=105256)
> >   0.5163083 = (MATCH) weight(catchall:paris in 105256)
> > [NoTFIDFSimilarity], result of:
> > 0.5163083 = score(doc=105256,freq=6.0 = termFreq=6.0
> >
> > and when I change pf2 pf3 (the only change, same query, same docs)
> >
> > 0.47504464 = (MATCH) weight(catchall:paris in 105256)
> [NoTFIDFSimilarity],
> > result of:
> >* 0.47504464 = score(doc=105256,freq=6.0 = termFreq=6.0*
> > ), product of:
> >  * 0.47504464 = queryWeight*, product of:
> > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > 0.47504464 = queryNorm
> >   1.0 = fieldWeight in 105256, product of:
> > 1.0 = tf(freq=6.0), with freq of:
> >   6.0 = termFreq=6.0
> > 1.0 = idf(docFreq=10958, maxDocs=12258543)
> > 1.0 = fieldNorm(doc=105256)
> >
> > so in the end, with same MATCH results, in first case I get two documents
> > with same score, and in second case, one document has a higher score.
> >
> > This seem very very strange. Does anyone have a clue what's going on?
> >
> > Thanks
> > Elisabeth
> >
> --
> Regards,
> Binoy Dalal
>

Re: pf2 pf3 and stopwords

2015-12-18 Thread elisabeth benoit

ok, thanks a lot for your advice.

i'll try that.



2015-12-17 10:05 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:

> For this case of inversion in particular a slop of 1 won't cause any issues
> since such a reverse match will require the slop to be 2
>
> On Thu, 17 Dec 2015, 14:20 elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > Inversion (paris charonne or charonne paris) cannot be scored the same.
> >
> > 2015-12-16 11:08 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
> >
> > > What is your exact use case?
> > >
> > > On Wed, 16 Dec 2015, 13:40 elisabeth benoit <elisaelisael...@gmail.com
> >
> > > wrote:
> > >
> > > > Thanks for your answer.
> > > >
> > > > Actually, using a slop of 1 is something I can't do (because of other
> > > > specifications)
> > > >
> > > > I guess I'll index differently.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > > > 2015-12-14 16:24 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
> > > >
> > > > > Moreover, the stopword de will work on your queries and not on your
> > > > > documents, meaning if you query 'Gare de Saint Lazare', the terms
> > > > actually
> > > > > searched for will be Gare Saint and Lazare, 'de' will be filtered
> > out.
> > > > >
> > > > > On Mon, Dec 14, 2015 at 8:49 PM Binoy Dalal <
> binoydala...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > This isn't a bug. During pf3 matching, since your query has only
> > > three
> > > > > > tokens, the entire query will be treated as a single phrase, and
> > with
> > > > > slop
> > > > > > = 0, any word that comes in the middle of your query  - 'de' in
> > this
> > > > case
> > > > > > will cause the phrase to not be matched. If you want to get
> around
> > > > this,
> > > > > > try setting your slop = 1 in which case it should match Gare
> Saint
> > > > Lazare
> > > > > > even with the de in it.
> > > > > >
> > > > > > On Mon, Dec 14, 2015 at 7:22 PM elisabeth benoit <
> > > > > > elisaelisael...@gmail.com> wrote:
> > > > > >
> > > > > >> Hello,
> > > > > >>
> > > > > >> I am using solr 4.10.1. I have a field with stopwords
> > > > > >>
> > > > > >>
> > > > > >>  > > > > >> words="stopwords.txt"
> > > > > >> enablePositionIncrements="true"/>
> > > > > >>
> > > > > >> And I use pf2 pf3 on that field with a slop of 0.
> > > > > >>
> > > > > >> If the request is "Gare Saint Lazare", and I have a document
> "Gare
> > > de
> > > > > >> Saint
> > > > > >> Lazare", "de" being a stopword, this document doesn't get the
> pf3
> > > > boost,
> > > > > >> because of "de".
> > > > > >>
> > > > > >> I was wondering, is this normal? is this a bug? is something
> wrong
> > > > with
> > > > > my
> > > > > >> configuration?
> > > > > >>
> > > > > >> Best regards,
> > > > > >> Elisabeth
> > > > > >>
> > > > > > --
> > > > > > Regards,
> > > > > > Binoy Dalal
> > > > > >
> > > > > --
> > > > > Regards,
> > > > > Binoy Dalal
> > > > >
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>

Re: pf2 pf3 and stopwords

2015-12-17 Thread elisabeth benoit

Inversion (paris charonne or charonne paris) cannot be scored the same.

2015-12-16 11:08 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:

> What is your exact use case?
>
> On Wed, 16 Dec 2015, 13:40 elisabeth benoit <elisaelisael...@gmail.com>
> wrote:
>
> > Thanks for your answer.
> >
> > Actually, using a slop of 1 is something I can't do (because of other
> > specifications)
> >
> > I guess I'll index differently.
> >
> > Best regards,
> > Elisabeth
> >
> > 2015-12-14 16:24 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:
> >
> > > Moreover, the stopword de will work on your queries and not on your
> > > documents, meaning if you query 'Gare de Saint Lazare', the terms
> > actually
> > > searched for will be Gare Saint and Lazare, 'de' will be filtered out.
> > >
> > > On Mon, Dec 14, 2015 at 8:49 PM Binoy Dalal <binoydala...@gmail.com>
> > > wrote:
> > >
> > > > This isn't a bug. During pf3 matching, since your query has only
> three
> > > > tokens, the entire query will be treated as a single phrase, and with
> > > slop
> > > > = 0, any word that comes in the middle of your query  - 'de' in this
> > case
> > > > will cause the phrase to not be matched. If you want to get around
> > this,
> > > > try setting your slop = 1 in which case it should match Gare Saint
> > Lazare
> > > > even with the de in it.
> > > >
> > > > On Mon, Dec 14, 2015 at 7:22 PM elisabeth benoit <
> > > > elisaelisael...@gmail.com> wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> I am using solr 4.10.1. I have a field with stopwords
> > > >>
> > > >>
> > > >>  > > >> words="stopwords.txt"
> > > >> enablePositionIncrements="true"/>
> > > >>
> > > >> And I use pf2 pf3 on that field with a slop of 0.
> > > >>
> > > >> If the request is "Gare Saint Lazare", and I have a document "Gare
> de
> > > >> Saint
> > > >> Lazare", "de" being a stopword, this document doesn't get the pf3
> > boost,
> > > >> because of "de".
> > > >>
> > > >> I was wondering, is this normal? is this a bug? is something wrong
> > with
> > > my
> > > >> configuration?
> > > >>
> > > >> Best regards,
> > > >> Elisabeth
> > > >>
> > > > --
> > > > Regards,
> > > > Binoy Dalal
> > > >
> > > --
> > > Regards,
> > > Binoy Dalal
> > >
> >
> --
> Regards,
> Binoy Dalal
>

Re: pf2 pf3 and stopwords

2015-12-16 Thread elisabeth benoit

Thanks for your answer.

Actually, using a slop of 1 is something I can't do (because of other
specifications)

I guess I'll index differently.

Best regards,
Elisabeth

2015-12-14 16:24 GMT+01:00 Binoy Dalal <binoydala...@gmail.com>:

> Moreover, the stopword de will work on your queries and not on your
> documents, meaning if you query 'Gare de Saint Lazare', the terms actually
> searched for will be Gare Saint and Lazare, 'de' will be filtered out.
>
> On Mon, Dec 14, 2015 at 8:49 PM Binoy Dalal <binoydala...@gmail.com>
> wrote:
>
> > This isn't a bug. During pf3 matching, since your query has only three
> > tokens, the entire query will be treated as a single phrase, and with
> slop
> > = 0, any word that comes in the middle of your query  - 'de' in this case
> > will cause the phrase to not be matched. If you want to get around this,
> > try setting your slop = 1 in which case it should match Gare Saint Lazare
> > even with the de in it.
> >
> > On Mon, Dec 14, 2015 at 7:22 PM elisabeth benoit <
> > elisaelisael...@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> I am using solr 4.10.1. I have a field with stopwords
> >>
> >>
> >>  >> words="stopwords.txt"
> >> enablePositionIncrements="true"/>
> >>
> >> And I use pf2 pf3 on that field with a slop of 0.
> >>
> >> If the request is "Gare Saint Lazare", and I have a document "Gare de
> >> Saint
> >> Lazare", "de" being a stopword, this document doesn't get the pf3 boost,
> >> because of "de".
> >>
> >> I was wondering, is this normal? is this a bug? is something wrong with
> my
> >> configuration?
> >>
> >> Best regards,
> >> Elisabeth
> >>
> > --
> > Regards,
> > Binoy Dalal
> >
> --
> Regards,
> Binoy Dalal
>

pf2 pf3 and stopwords

2015-12-14 Thread elisabeth benoit

Hello,

I am using solr 4.10.1. I have a field with stopwords




And I use pf2 pf3 on that field with a slop of 0.

If the request is "Gare Saint Lazare", and I have a document "Gare de Saint
Lazare", "de" being a stopword, this document doesn't get the pf3 boost,
because of "de".

I was wondering, is this normal? is this a bug? is something wrong with my
configuration?

Best regards,
Elisabeth

Re: catchall fields or multiple fields

2015-10-14 Thread elisabeth benoit

Thanks for your suggestion Jack. In fact we're doing geographic search
(fields are country, state, county, town, hamlet, district)

So it's difficult to split.

Best regards,
Elisabeth

2015-10-13 16:01 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:

> Performing a sequence of queries can help too. For example, if users
> commonly search for a product name, you could do an initial query on just
> the product name field which should be much faster than searching the text
> of all product descriptions, and highlighting would be less problematic. If
> that initial query comes up empty, then you could move on to the next
> highest most likely field, maybe product title (short one line
> description), and query voluminous fields like detailed product
> descriptions, specifications, and user comments/reviews only as a last
> resort.
>
> -- Jack Krupansky
>
> On Tue, Oct 13, 2015 at 6:17 AM, elisabeth benoit <
> elisaelisael...@gmail.com
> > wrote:
>
> > Thanks to you all for those informed advices.
> >
> > Thanks Trey for your very detailed point of view. This is now very clear
> to
> > me how a search on multiple fields can grow slower than a search on a
> > catchall field.
> >
> > Our actual search model is problematic: we search on a catchall field,
> but
> > need to know which fields match, so we do highlighting on multi fields
> (not
> > indexed, but stored). To improve performance, we want to get rid of
> > highlighting and use the solr explain output. To get the explain output
> on
> > those fields, we need to do a search on those fields.
> >
> > So I guess we have to test if removing highlighting and adding multi
> fields
> > search will improve performances or not.
> >
> > Best regards,
> > Elisabeth
> >
> >
> >
> > 2015-10-12 17:55 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
> >
> > > I think it may all depend on the nature of your application and how
> much
> > > commonality there is between fields.
> > >
> > > One interesting area is auto-suggest, where you can certainly suggest
> > from
> > > the union of all fields, you may want to give priority to suggestions
> > from
> > > preferred fields. For example, for actual product names or important
> > > keywords rather than random words from the English language that happen
> > to
> > > occur in descriptions, all of which would occur in a catchall.
> > >
> > > -- Jack Krupansky
> > >
> > > On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit <
> > > elisaelisael...@gmail.com
> > > > wrote:
> > >
> > > > Hello,
> > > >
> > > > We're using solr 4.10 and storing all data in a catchall field. It
> > seems
> > > to
> > > > me that one good reason for using a catchall field is when using
> > scoring
> > > > with idf (with idf, a word might not have same score in all fields).
> We
> > > got
> > > > rid of idf and are now considering using multiple fields. I remember
> > > > reading somewhere that using a catchall field might speed up
> searching
> > > > time. I was wondering if some of you have any opinion (or experience)
> > > > related to this subject.
> > > >
> > > > Best regards,
> > > > Elisabeth
> > > >
> > >
> >
>

Re: catchall fields or multiple fields

2015-10-13 Thread elisabeth benoit

Thanks to you all for those informed advices.

Thanks Trey for your very detailed point of view. This is now very clear to
me how a search on multiple fields can grow slower than a search on a
catchall field.

Our actual search model is problematic: we search on a catchall field, but
need to know which fields match, so we do highlighting on multi fields (not
indexed, but stored). To improve performance, we want to get rid of
highlighting and use the solr explain output. To get the explain output on
those fields, we need to do a search on those fields.

So I guess we have to test if removing highlighting and adding multi fields
search will improve performances or not.

Best regards,
Elisabeth

2015-10-12 17:55 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:

> I think it may all depend on the nature of your application and how much
> commonality there is between fields.
>
> One interesting area is auto-suggest, where you can certainly suggest from
> the union of all fields, you may want to give priority to suggestions from
> preferred fields. For example, for actual product names or important
> keywords rather than random words from the English language that happen to
> occur in descriptions, all of which would occur in a catchall.
>
> -- Jack Krupansky
>
> On Mon, Oct 12, 2015 at 8:39 AM, elisabeth benoit <
> elisaelisael...@gmail.com
> > wrote:
>
> > Hello,
> >
> > We're using solr 4.10 and storing all data in a catchall field. It seems
> to
> > me that one good reason for using a catchall field is when using scoring
> > with idf (with idf, a word might not have same score in all fields). We
> got
> > rid of idf and are now considering using multiple fields. I remember
> > reading somewhere that using a catchall field might speed up searching
> > time. I was wondering if some of you have any opinion (or experience)
> > related to this subject.
> >
> > Best regards,
> > Elisabeth
> >
>

catchall fields or multiple fields

2015-10-12 Thread elisabeth benoit

Hello,

We're using solr 4.10 and storing all data in a catchall field. It seems to
me that one good reason for using a catchall field is when using scoring
with idf (with idf, a word might not have same score in all fields). We got
rid of idf and are now considering using multiple fields. I remember
reading somewhere that using a catchall field might speed up searching
time. I was wondering if some of you have any opinion (or experience)
related to this subject.

Best regards,
Elisabeth

Re: spellcheck enabled but not getting any suggestions.

2015-04-17 Thread elisabeth benoit

Shouldn't you specify a spellcheck.dictionary in your request handler?

Best regards,
Elisabeth

2015-04-17 11:24 GMT+02:00 Derek Poh d...@globalsources.com:

 Hi

 I have enabled spellcheck but not getting any suggestions withincorrectly
 spelled keywords.
 I added the spellcheck into the/select request handler.

 What steps did I miss out?

 spellcheck list in return result:
 lst name=spellcheck
 lst name=suggestions/
 /lst


 solrconfig.xml:

 requestHandler name=/select class=solr.SearchHandler
 !-- default values for query parameters can be specified, these
  will be overridden by parameters in the request
   --
  lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dftext/str
!-- Spell checking defaults --
str name=spellcheckon/str
str name=spellcheck.extendedResultsfalse/str
str name=spellcheck.count5/str
str name=spellcheck.alternativeTermCount2/str
str name=spellcheck.maxResultsForSuggest5/str
str name=spellcheck.collatetrue/str
str name=spellcheck.collateExtendedResultstrue/str
str name=spellcheck.maxCollationTries5/str
str name=spellcheck.maxCollations3/str
  /lst

  !-- append spellchecking to our list of components --
  arr name=last-components
 strspellcheck/str
  /arr

 /requestHandler

Re: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-16 Thread elisabeth benoit

For the records, what I finally did is place those words I want spellcheck
to ignore in spellcheck.collateParam.fq and the words I'd like to be
checked in spellcheck.q. collationQuery uses spellcheck.collateParam.fq so
all did_you_mean queries return results containing words in
spellcheck.collateParam.fq.

Best regards,
Elisabeth



2015-04-14 17:05 GMT+02:00 elisabeth benoit elisaelisael...@gmail.com:

 Thanks for your answer!

 I didn't realize this what not supposed to be done (conjunction of
 DirectSolrSpellChecker and FileBasedSpellChecker). I got this idea in the
 mailing list while searching for a solution to get a list of words to
 ignore for the DirectSolrSpellChecker.

 Well well well, I'll try removing the check and see what happens. I'm not
 a java programmer, but if I can find a simple solution I'll let you know.

 Thanks again,
 Elisabeth

 2015-04-14 16:29 GMT+02:00 Dyer, James james.d...@ingramcontent.com:

 Elisabeth,

 Currently ConjunctionSolrSpellChecker only supports adding
 WordBreakSolrSpellchecker to IndexBased- FileBased- or
 DirectSolrSpellChecker.  In the future, it would be great if it could
 handle other Spell Checker combinations.  For instance, if you had a
 (e)dismax query that searches multiple fields, to have a separate
 spellchecker for each of them.

 But CSSC is not hardened for this more general usage, as hinted in the
 API doc.  The check done to ensure all spellcheckers use the same
 stringdistance object, I believe, is a safeguard against using this class
 for functionality it is not able to correctly support.  It looks to me that
 SOLR-6271 was opened to fix the bug in that it is comparing references on
 the stringdistance.  This is not a problem with WBSSC because this one does
 not support string distance at all.

 What you're hoping for, however, is that the requirement for the string
 distances be the same to be removed entirely.  You could try modifying the
 code by removing the check.  However beware that you might not get the
 results you desire!  But should this happen, please, go ahead and fix it
 for your use case and then donate the code.  This is something I've
 personally wanted for a long time.

 James Dyer
 Ingram Content Group


 -Original Message-
 From: elisabeth benoit [mailto:elisaelisael...@gmail.com]
 Sent: Tuesday, April 14, 2015 7:37 AM
 To: solr-user@lucene.apache.org
 Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr
 4.10.1

 Hello,

 I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
 FileBasedSpellchecker in same request.

 I've applied change from patch 135.patch (cf Solr-6271). I've tried
 running
 the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe
 because the patch was a fix to Solr 4.9, so I just replaced line in
 ConjunctionSolrSpellChecker

 else if (!stringDistance.equals(checker.getStringDistance())) {
  throw new IllegalArgumentException(
  All checkers need to use the same StringDistance.);
}


 by

 else if (!stringDistance.equals(checker.getStringDistance())) {
 throw new IllegalArgumentException(
 All checkers need to use the same StringDistance!!! 1: +
 checker.getStringDistance() +  2:  + stringDistance);
   }

 as it was done in the patch

 but still, when I send a spellcheck request, I get the error

 msg: All checkers need to use the same StringDistance!!!
 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
 org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08

 From error message I gather both spellchecker use same distanceMeasure
 LuceneLevenshteinDistance, but they're not same instance of
 LuceneLevenshteinDistance.

 Is the condition all right? What should be done to fix this properly?

 Thanks,
 Elisabeth

Re: using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-14 Thread elisabeth benoit

Thanks for your answer!

I didn't realize this what not supposed to be done (conjunction of
DirectSolrSpellChecker and FileBasedSpellChecker). I got this idea in the
mailing list while searching for a solution to get a list of words to
ignore for the DirectSolrSpellChecker.

Well well well, I'll try removing the check and see what happens. I'm not a
java programmer, but if I can find a simple solution I'll let you know.

Thanks again,
Elisabeth

2015-04-14 16:29 GMT+02:00 Dyer, James james.d...@ingramcontent.com:

 Elisabeth,

 Currently ConjunctionSolrSpellChecker only supports adding
 WordBreakSolrSpellchecker to IndexBased- FileBased- or
 DirectSolrSpellChecker.  In the future, it would be great if it could
 handle other Spell Checker combinations.  For instance, if you had a
 (e)dismax query that searches multiple fields, to have a separate
 spellchecker for each of them.

 But CSSC is not hardened for this more general usage, as hinted in the API
 doc.  The check done to ensure all spellcheckers use the same
 stringdistance object, I believe, is a safeguard against using this class
 for functionality it is not able to correctly support.  It looks to me that
 SOLR-6271 was opened to fix the bug in that it is comparing references on
 the stringdistance.  This is not a problem with WBSSC because this one does
 not support string distance at all.

 What you're hoping for, however, is that the requirement for the string
 distances be the same to be removed entirely.  You could try modifying the
 code by removing the check.  However beware that you might not get the
 results you desire!  But should this happen, please, go ahead and fix it
 for your use case and then donate the code.  This is something I've
 personally wanted for a long time.

 James Dyer
 Ingram Content Group


 -Original Message-
 From: elisabeth benoit [mailto:elisaelisael...@gmail.com]
 Sent: Tuesday, April 14, 2015 7:37 AM
 To: solr-user@lucene.apache.org
 Subject: using DirectSpellChecker and FileBasedSpellChecker with Solr
 4.10.1

 Hello,

 I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
 FileBasedSpellchecker in same request.

 I've applied change from patch 135.patch (cf Solr-6271). I've tried running
 the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe
 because the patch was a fix to Solr 4.9, so I just replaced line in
 ConjunctionSolrSpellChecker

 else if (!stringDistance.equals(checker.getStringDistance())) {
  throw new IllegalArgumentException(
  All checkers need to use the same StringDistance.);
}


 by

 else if (!stringDistance.equals(checker.getStringDistance())) {
 throw new IllegalArgumentException(
 All checkers need to use the same StringDistance!!! 1: +
 checker.getStringDistance() +  2:  + stringDistance);
   }

 as it was done in the patch

 but still, when I send a spellcheck request, I get the error

 msg: All checkers need to use the same StringDistance!!!
 1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
 org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08

 From error message I gather both spellchecker use same distanceMeasure
 LuceneLevenshteinDistance, but they're not same instance of
 LuceneLevenshteinDistance.

 Is the condition all right? What should be done to fix this properly?

 Thanks,
 Elisabeth

using DirectSpellChecker and FileBasedSpellChecker with Solr 4.10.1

2015-04-14 Thread elisabeth benoit

Hello,

I am using Solr 4.10.1 and trying to use DirectSolrSpellChecker and
FileBasedSpellchecker in same request.

I've applied change from patch 135.patch (cf Solr-6271). I've tried running
the command patch -p1 -i 135.patch --dry-run but it didn't work, maybe
because the patch was a fix to Solr 4.9, so I just replaced line in
ConjunctionSolrSpellChecker

else if (!stringDistance.equals(checker.getStringDistance())) {
 throw new IllegalArgumentException(
 All checkers need to use the same StringDistance.);
   }


by

else if (!stringDistance.equals(checker.getStringDistance())) {
throw new IllegalArgumentException(
All checkers need to use the same StringDistance!!! 1: +
checker.getStringDistance() +  2:  + stringDistance);
  }

as it was done in the patch

but still, when I send a spellcheck request, I get the error

msg: All checkers need to use the same StringDistance!!!
1:org.apache.lucene.search.spell.LuceneLevenshteinDistance@15f57db32:
org.apache.lucene.search.spell.LuceneLevenshteinDistance@280f7e08

From error message I gather both spellchecker use same distanceMeasure
LuceneLevenshteinDistance, but they're not same instance of
LuceneLevenshteinDistance.

Is the condition all right? What should be done to fix this properly?

Thanks,
Elisabeth

Re: prefix length in fuzzy search solr 4.10.1

2014-11-01 Thread elisabeth benoit

ok, thanks for the answer.

best regards,
Elisabeth

2014-10-31 22:04 GMT+01:00 Jack Krupansky j...@basetechnology.com:

 No, but it is a reasonable request, as a global default, a
 collection-specific default, a request-specific default, and on an
 individual fuzzy term.

 -- Jack Krupansky

 -Original Message- From: elisabeth benoit
 Sent: Thursday, October 30, 2014 6:07 AM
 To: solr-user@lucene.apache.org
 Subject: prefix length in fuzzy search solr 4.10.1


 Hello all,

 Is there a parameter in solr 4.10.1 api allowing user to fix prefix length
 in fuzzy search.

 Best regards,
 Elisabeth

prefix length in fuzzy search solr 4.10.1

2014-10-30 Thread elisabeth benoit

Hello all,

Is there a parameter in solr 4.10.1 api allowing user to fix prefix length
in fuzzy search.

Best regards,
Elisabeth

fuzzy search and edismax: how to do not sum up

2014-10-15 Thread elisabeth benoit

Hello all,

We are using solr 4.2.1 (but planning to switch to solr 4.10 very soon).

We are trying to do approximative search using ~ operator.

We use catchall_light field without stemming (to do not mix fuzzy and
stemming)

We send a request to solr using fuzzy operator on non frequent words

for instance

q=catchall_light:(lyon 69002~1)

our handler uses edismax

that query gives a higher score to document Lyon, having postal codes
69001, 69002, 69003, 69004,...

than to other documents having only Lyon and postal code 69002 (the debug
output is below)

but we do not want to sum up all scores for Lyon document.

Does anyone knows if it is possible to change that?

Best regards,
Elisabeth


here is the debug output for Lyon
(we use idf for that field but want to get rid of it)

15.728481 = (MATCH) sum of:
  1.2349477 = (MATCH) weight(catchall_light:lyon in 707758)
[NoTFSimilarity], result of:
1.2349477 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
  0.13427915 = queryWeight, product of:
9.196869 = idf(docFreq=2924, maxDocs=10616483)
0.014600528 = queryNorm
  9.196869 = fieldWeight in 707758, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
9.196869 = idf(docFreq=2924, maxDocs=10616483)
1.0 = fieldNorm(doc=707758)
  14.493534 = (MATCH) sum of:
1.576392 = (MATCH) weight(catchall_light:69001^0.8 in 707758)
[NoTFSimilarity], result of:
  1.576392 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.13569424 = queryWeight, product of:
  0.8 = boost
  11.617237 = idf(docFreq=259, maxDocs=10616483)
  0.014600528 = queryNorm
11.617237 = fieldWeight in 707758, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  11.617237 = idf(docFreq=259, maxDocs=10616483)
  1.0 = fieldNorm(doc=707758)
1.8904426 = (MATCH) weight(catchall_light:69002 in 707758)
[NoTFSimilarity], result of:
  1.8904426 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.16613688 = queryWeight, product of:
  11.378826 = idf(docFreq=329, maxDocs=10616483)
  0.014600528 = queryNorm
11.378826 = fieldWeight in 707758, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  11.378826 = idf(docFreq=329, maxDocs=10616483)
  1.0 = fieldNorm(doc=707758)
1.460347 = (MATCH) weight(catchall_light:69003^0.8 in 707758)
[NoTFSimilarity], result of:
  1.460347 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.13060425 = queryWeight, product of:
  0.8 = boost
  11.181466 = idf(docFreq=401, maxDocs=10616483)
  0.014600528 = queryNorm
11.181466 = fieldWeight in 707758, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  11.181466 = idf(docFreq=401, maxDocs=10616483)
  1.0 = fieldNorm(doc=707758)
1.7109065 = (MATCH) weight(catchall_light:69004^0.8 in 707758)
[NoTFSimilarity], result of:
  1.7109065 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.14136517 = queryWeight, product of:
  0.8 = boost
  12.102744 = idf(docFreq=159, maxDocs=10616483)
  0.014600528 = queryNorm
12.102744 = fieldWeight in 707758, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  12.102744 = idf(docFreq=159, maxDocs=10616483)
  1.0 = fieldNorm(doc=707758)
1.5255939 = (MATCH) weight(catchall_light:69005^0.8 in 707758)
[NoTFSimilarity], result of:
  1.5255939 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.13349001 = queryWeight, product of:
  0.8 = boost
  11.428525 = idf(docFreq=313, maxDocs=10616483)
  0.014600528 = queryNorm
11.428525 = fieldWeight in 707758, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  11.428525 = idf(docFreq=313, maxDocs=10616483)
  1.0 = fieldNorm(doc=707758)
1.6497903 = (MATCH) weight(catchall_light:69006^0.8 in 707758)
[NoTFSimilarity], result of:
  1.6497903 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.13881733 = queryWeight, product of:
  0.8 = boost
  11.884614 = idf(docFreq=198, maxDocs=10616483)
  0.014600528 = queryNorm
11.884614 = fieldWeight in 707758, product of:
  1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
  11.884614 = idf(docFreq=198, maxDocs=10616483)
  1.0 = fieldNorm(doc=707758)
1.5892421 = (MATCH) weight(catchall_light:69007^0.8 in 707758)
[NoTFSimilarity], result of:
  1.5892421 = score(doc=707758,freq=1.0 = termFreq=1.0
), product of:
0.13624617 = queryWeight, product of:
  0.8 = boost
  11.66449 = idf(docFreq=247, maxDocs=10616483)
  0.014600528 = queryNorm
11.66449 =

Re: does one need to reindex when changing similarity class

2014-10-14 Thread elisabeth benoit

thanks a lot for your answers!

2014-10-14 6:10 GMT+02:00 Jack Krupansky j...@basetechnology.com:

 To correct myself, the selected Similarity class can have a computeNorm
 method that calculates the norm value that will be stored in the index
 when the document is indexed, so changing the Similarity class will require
 reindexing if the implementation of the computeNorm method is different.

 -- Jack Krupansky

 -Original Message- From: Markus Jelsma
 Sent: Monday, October 13, 2014 5:06 PM

 To: solr-user@lucene.apache.org
 Subject: RE: does one need to reindex when changing similarity class

 Yes, if the replacing similarity has a different implementation on norms,
 you should reindex or gradually update all documents within decent time.



 -Original message-

 From:Ahmet Arslan iori...@yahoo.com.INVALID
 Sent: Thursday 9th October 2014 18:27
 To: solr-user@lucene.apache.org
 Subject: Re: does one need to reindex when changing similarity class

 How about SweetSpotSimilarity? Length norm is saved at index time?



 On Thursday, October 9, 2014 5:44 PM, Jack Krupansky 
 j...@basetechnology.com wrote:
 The similarity class is only invoked at query time, so it doesn't
 participate in indexing.

 -- Jack Krupansky




 -Original Message- From: Markus Jelsma
 Sent: Thursday, October 9, 2014 6:59 AM
 To: solr-user@lucene.apache.org
 Subject: RE: does one need to reindex when changing similarity class

 Hi - no you don't have to, although maybe if you changed on how norms are
 encoded.
 Markus



 -Original message-
  From:elisabeth benoit elisaelisael...@gmail.com
  Sent: Thursday 9th October 2014 12:26
  To: solr-user@lucene.apache.org
  Subject: does one need to reindex when changing similarity class
 
  I've read somewhere that we do have to reindex when changing similarity
  class. Is that right?
 
  Thanks again,
  Elisabeth

per field similarity not working with solr 4.2.1

2014-10-09 Thread elisabeth benoit

Hello,

I am using Solr 4..2.1 and I've tried to use a per field similarity, as
described in

https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml

so in my schema I have

schema name=search version=1.4
similarity class=solr.SchemaSimilarityFactory/

and a custom similarity in fieldtype definition

fieldType name=text class=solr.TextField
positionIncrementGap=100
 similarity
class=com.company.lbs.solr.search.similarity.NoTFSimilarity/
   analyzer type=index
...

but it is not working

when I send a request with debugQuery=on, instead of [
NoTFSimilarity], I see []

or to give an example, I have


weight(catchall:bretagn in 2575) []

instead of weight(catchall:bretagn in 2575) [NoTFSimilarity]

Anyone has a clue what I am doing wrong?

Best regards,
Elisabeth

does one need to reindex when changing similarity class

2014-10-09 Thread elisabeth benoit

I've read somewhere that we do have to reindex when changing similarity
class. Is that right?

Thanks again,
Elisabeth

Re: per field similarity not working with solr 4.2.1

2014-10-09 Thread elisabeth benoit

Thanks for the information!

I've been struggling with that debug output. Any other way to know for sure
my similarity class is being used?

Thanks again,
Elisabeth

2014-10-09 13:03 GMT+02:00 Markus Jelsma markus.jel...@openindex.io:

 Hi - it should work, not seeing your implemenation in the debug output is
 a known issue.


 -Original message-
  From:elisabeth benoit elisaelisael...@gmail.com
  Sent: Thursday 9th October 2014 12:22
  To: solr-user@lucene.apache.org
  Subject: per field similarity not working with solr 4.2.1
 
  Hello,
 
  I am using Solr 4..2.1 and I've tried to use a per field similarity, as
  described in
 
 
 https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml
 
  so in my schema I have
 
  schema name=search version=1.4
  similarity class=solr.SchemaSimilarityFactory/
 
  and a custom similarity in fieldtype definition
 
  fieldType name=text class=solr.TextField
  positionIncrementGap=100
   similarity
  class=com.company.lbs.solr.search.similarity.NoTFSimilarity/
 analyzer type=index
  ...
 
  but it is not working
 
  when I send a request with debugQuery=on, instead of [
  NoTFSimilarity], I see []
 
  or to give an example, I have
 
 
  weight(catchall:bretagn in 2575) []
 
  instead of weight(catchall:bretagn in 2575) [NoTFSimilarity]
 
  Anyone has a clue what I am doing wrong?
 
  Best regards,
  Elisabeth

Re: per field similarity not working with solr 4.2.1

2014-10-09 Thread elisabeth benoit

ok thanks.


I think something is not working here (I'm quite sure my similarity class
is not beeing used because when I use
SchemaSimilarityFactory and a custom fieldtype similarity definition with
NoTFSimilarity, I don't get the same scoring as when I use NoTFSimilarity
as global similarity; but I'll try to gather more evidences).

Thanks again,
Elisabeth

2014-10-09 15:05 GMT+02:00 Markus Jelsma markus.jel...@openindex.io:

 Well, it is either the output of your calculation or writing something to
 System.out
 Markus



 -Original message-
  From:elisabeth benoit elisaelisael...@gmail.com
  Sent: Thursday 9th October 2014 13:31
  To: solr-user@lucene.apache.org
  Subject: Re: per field similarity not working with solr 4.2.1
 
  Thanks for the information!
 
  I've been struggling with that debug output. Any other way to know for
 sure
  my similarity class is being used?
 
  Thanks again,
  Elisabeth
 
  2014-10-09 13:03 GMT+02:00 Markus Jelsma markus.jel...@openindex.io:
 
   Hi - it should work, not seeing your implemenation in the debug output
 is
   a known issue.
  
  
   -Original message-
From:elisabeth benoit elisaelisael...@gmail.com
Sent: Thursday 9th October 2014 12:22
To: solr-user@lucene.apache.org
Subject: per field similarity not working with solr 4.2.1
   
Hello,
   
I am using Solr 4..2.1 and I've tried to use a per field similarity,
 as
described in
   
   
  
 https://apache.googlesource.com/lucene-solr/+/c5bb5cd921e1ce65e18eceb55e738f40591214f0/solr/core/src/test-files/solr/collection1/conf/schema-sim.xml
   
so in my schema I have
   
schema name=search version=1.4
similarity class=solr.SchemaSimilarityFactory/
   
and a custom similarity in fieldtype definition
   
fieldType name=text class=solr.TextField
positionIncrementGap=100
 similarity
class=com.company.lbs.solr.search.similarity.NoTFSimilarity/
   analyzer type=index
...
   
but it is not working
   
when I send a request with debugQuery=on, instead of [
NoTFSimilarity], I see []
   
or to give an example, I have
   
   
weight(catchall:bretagn in 2575) []
   
instead of weight(catchall:bretagn in 2575) [NoTFSimilarity]
   
Anyone has a clue what I am doing wrong?
   
Best regards,
Elisabeth

looking for a solr/search expert in Paris

2014-09-03 Thread elisabeth benoit

Hello,


We are looking for a solr consultant to help us with our devs using solr.
We've been working on this for a little while, and we feel we need an
expert point of view on what we're doing, who could give us insights about
our solr conf, performance issues, error handling issues (big thing). Well
everything.

The entreprise is in the Paris (France) area. Any suggestion is welcomed.

Thanks,
Elisabeth

Re: looking for a solr/search expert in Paris

2014-09-03 Thread elisabeth benoit

Thanks a lot for your answers.

Best regards,
Elisabeth


2014-09-03 17:18 GMT+02:00 Jack Krupansky j...@basetechnology.com:

 Don't forget to check out the Solr Support wiki where consultants
 advertise their services:
 http://wiki.apache.org/solr/Support

 And any Solr or Lucene consultants on this mailing list should be sure
 that they are registered on that support wiki. Hey, it's free! And be
 sure to keep your listing up to date, including regional availability and
 any specialties.

 -- Jack Krupansky

 -Original Message- From: elisabeth benoit
 Sent: Wednesday, September 3, 2014 4:02 AM
 To: solr-user@lucene.apache.org
 Subject: looking for a solr/search expert in Paris


 Hello,


 We are looking for a solr consultant to help us with our devs using solr.
 We've been working on this for a little while, and we feel we need an
 expert point of view on what we're doing, who could give us insights about
 our solr conf, performance issues, error handling issues (big thing). Well
 everything.

 The entreprise is in the Paris (France) area. Any suggestion is welcomed.

 Thanks,
 Elisabeth

Re: spatial search: find result in bbox OR first result outside bbox

2014-07-25 Thread elisabeth benoit

Thanks a lot for your answer David!

I'll check that out.

Elisabeth


2014-07-24 20:28 GMT+02:00 david.w.smi...@gmail.com 
david.w.smi...@gmail.com:

 Hi Elisabeth,

 Sorry for not responding sooner; I forgot.

 You’re in need of some spatial nearest-neighbor code I wrote but it isn’t
 open-sourced yet.  It works on the RPT grid.

 Any way, you should consider doing this in two searches: the first query
 tries the bbox provided, and if that returns nothing then issue a second
 for the closest within the a 1000km distance.  The first query is
 straight-forward as documented.  The second would be close to what you gave
 in your example but sort by distance and return rows=1.  It will *not*
 compute the distance to every document, just those within the 1000km radius
 plus some grid internal grid squares *if* you use spatial RPT
 (“location_rpt” in the example schema).  But use LatLonType for optimal
 sorting performance, not RPT.

 With respect to doing this in one search vs two, that would involve writing
 a custom request handler.  I have a patch to make this easier:
 https://issues.apache.org/jira/browse/SOLR-5005.  If in your case there
 are
 absolutely no other filters and it’s not a distributed search (no
 sharding), then you could approach this with a custom query parser that
 generates and executes one query to know if it should return that query or
 return the fallback.

 Please let me know how this goes.

 ~ David Smiley
 Freelance Apache Lucene/Solr Search Consultant/Developer
 http://www.linkedin.com/in/davidwsmiley


 On Tue, Jul 22, 2014 at 3:12 AM, elisabeth benoit 
 elisaelisael...@gmail.com
  wrote:

  Hello,
 
  I am using solr 4.2.1. I have the following use case.
 
  I should find results inside bbox OR if there is none, first result
 outside
  bbox within a 1000 km distance. I was wondering what is the best way to
  proceed.
 
  I was considering doing a geofilt search from the center of my bounding
 box
  and post filtering results.
 
  fq={!geofilt sfield=store}pt=45.15,-93.85d=1000
 
  From a performance point of view I don't think it's a good solution
 though,
  since solr will have to calculate every document distance, then sort.
 
  I was wondering if there was another way to do this and avoid sending
 more
  than one request to solr.
 
  Thanks,
  Elisabeth

spatial search: find result in bbox OR first result outside bbox

2014-07-22 Thread elisabeth benoit

Hello,

I am using solr 4.2.1. I have the following use case.

I should find results inside bbox OR if there is none, first result outside
bbox within a 1000 km distance. I was wondering what is the best way to
proceed.

I was considering doing a geofilt search from the center of my bounding box
and post filtering results.

fq={!geofilt sfield=store}pt=45.15,-93.85d=1000

From a performance point of view I don't think it's a good solution though,
since solr will have to calculate every document distance, then sort.

I was wondering if there was another way to do this and avoid sending more
than one request to solr.

Thanks,
Elisabeth

split field on json update

2014-06-12 Thread elisabeth benoit

Hello,

Is it possible, in solr 4.2.1, to split a multivalued field with a json
update as it is possible do to with a csv update?

with csv
/update/csv?f.address.split=truef.address.separator=%2Ccommit=true

with json (using a post)
/update/json

Thanks,
Elisabeth

Re: split field on json update

2014-06-12 Thread elisabeth benoit

Thanks for your answer,

best regards,
Elisabeth


2014-06-12 14:07 GMT+02:00 Alexandre Rafalovitch arafa...@gmail.com:

 There is always UpdateRequestProcessor.

 Regards,
 Alex
 On 12/06/2014 7:05 pm, elisabeth benoit elisaelisael...@gmail.com
 wrote:

  Hello,
 
  Is it possible, in solr 4.2.1, to split a multivalued field with a json
  update as it is possible do to with a csv update?
 
  with csv
  /update/csv?f.address.split=truef.address.separator=%2Ccommit=true
 
  with json (using a post)
  /update/json
 
  Thanks,
  Elisabeth

Re: permissive mm value and efficient spellchecking

2014-05-16 Thread elisabeth benoit

ok, thanks a lot, I'll check that out.


2014-05-14 14:20 GMT+02:00 Markus Jelsma markus.jel...@openindex.io:

 Elisabeth, i think you are looking for SOLR-3211 that introduced
 spellcheck.collateParam.* to override e.g. dismax settings.

 Markus

 -Original message-
 From:elisabeth benoit elisaelisael...@gmail.com
 Sent:Wed 14-05-2014 14:01
 Subject:permissive mm value and efficient spellchecking
 To:solr-user@lucene.apache.org;
 Hello,

 I'm using solr 4.2.1.

 I use a very permissive value for mm, to be able to find results even if
 request contains non relevant words.

 At the same time, I'd like to be able to do some efficient spellcheking
 with solrdirectspellchecker.

 So for instance, if user searches for rue de Chraonne Paris, where
 Chraonne is mispelled, because of my permissive mm value I get more than
 100 000 results containing words rue and Paris (de is a stopword),
 which are very frequent terms in my index, but no spellcheck correction for
 Chraonne. If I set mm=3, then I get the expected spellcheck correction
 value: rue de Charonne Paris.

 Is there a way to achieve my two goals in a single solr request?

 Thanks,
 Elisabeth

permissive mm value and efficient spellchecking

2014-05-14 Thread elisabeth benoit

Hello,

I'm using solr 4.2.1.

I use a very permissive value for mm, to be able to find results even if
request contains non relevant words.

At the same time, I'd like to be able to do some efficient spellcheking
with solrdirectspellchecker.

So for instance, if user searches for rue de Chraonne Paris, where
Chraonne is mispelled, because of my permissive mm value I get more than
100 000 results containing words rue and Paris (de is a stopword),
which are very frequent terms in my index, but no spellcheck correction for
Chraonne. If I set mm=3, then I get the expected spellcheck correction
value: rue de Charonne Paris.

Is there a way to achieve my two goals in a single solr request?

Thanks,
Elisabeth

Re: Re: solr 4.2.1 index gets slower over time

2014-04-02 Thread elisabeth benoit

This sounds interesting, I'll check this out.

Thanks!
Elisabeth

2014-04-02 8:54 GMT+02:00 Dmitry Kan solrexp...@gmail.com:

Thanks, Markus, that is useful.
I'm guessing the higher the weight, the longer the op takes?

On Tue, Apr 1, 2014 at 10:39 PM, Markus Jelsma
markus.jel...@openindex.iowrote:

You may want to increase reclaimdeletesweight for tieredmergepolicy from
2
to 3 or 4. By default it may keep too much deleted or updated docs in the
index. This can increase index size by 50%!! Dmitry Kan
solrexp...@gmail.com schreef:Elisabeth,

Yes, I believe you are right in that the deletes are part of the optimize
process. If you delete often, you may consider (if not already) the
TieredMergePolicy, which is suited for this scenario. Check out this
relevant discussion I had with Lucene committers:
https://twitter.com/DmitryKan/status/399820408444051456

HTH,

Dmitry

On Tue, Apr 1, 2014 at 11:34 AM, elisabeth benoit
elisaelisael...@gmail.com
wrote:

Thanks a lot for your answers!

Shawn. Our GC configuration has far less parameters defined, so we'll
check
this out.

Dimitry, about the expungeDeletes option, we'll add that in the delete
process. But from what I read, this is done in the optimize process
(cf.

http://lucene.472066.n3.nabble.com/Does-expungeDeletes-need-calling-during-an-optimize-td1214083.html
).
Or maybe not?

Thanks again,
Elisabeth

2014-04-01 7:52 GMT+02:00 Dmitry Kan solrexp...@gmail.com:

Hi,

We have noticed something like this as well, but with older versions
of
solr, 3.4. In our setup we delete documents pretty often. Internally
in
Lucene, when a document is client requested to be deleted, it is not
physically deleted, but only marked as deleted. Our original
optimization
assumption was such that the deleted documents would get physically
removed on each optimize command issued. We started to suspect it
wasn't
always true as the shards (especially relatively large shards) became
slower over time. So we found out about the expungeDeletes option,
which
purges the deleted docs and is by default false. We have set it to
true.
If your solr update lifecycle includes frequent deletes, try this
out.

This of course does not override working towards finding better
GCparameters.

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

On Mon, Mar 31, 2014 at 3:57 PM, elisabeth benoit
elisaelisael...@gmail.com
wrote:

Hello,

We are currently using solr 4.2.1. Our index is updated on a daily
basis.
After noticing solr query time has increased (two times the initial
size)
without any change in index size or in solr configuration, we tried
an
optimize on the index but it didn't fix our problem. We checked the
garbage
collector, but everything seemed fine. What did in fact fix our
problem
was
to delete all documents and reindex from scratch.

It looks like over time our index gets corrupted and optimize
doesn't
fix
it. Does anyone have a clue how to investigate further this
situation?

Elisabeth

--
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

Re: solr 4.2.1 index gets slower over time

2014-04-01 Thread elisabeth benoit

Thanks a lot for your answers!

Shawn. Our GC configuration has far less parameters defined, so we'll check
this out.

Dimitry, about the expungeDeletes option, we'll add that in the delete
process. But from what I read, this is done in the optimize process (cf.
http://lucene.472066.n3.nabble.com/Does-expungeDeletes-need-calling-during-an-optimize-td1214083.html).
Or maybe not?

Thanks again,
Elisabeth

2014-04-01 7:52 GMT+02:00 Dmitry Kan solrexp...@gmail.com:

Hi,

We have noticed something like this as well, but with older versions of
solr, 3.4. In our setup we delete documents pretty often. Internally in
Lucene, when a document is client requested to be deleted, it is not
physically deleted, but only marked as deleted. Our original optimization
assumption was such that the deleted documents would get physically
removed on each optimize command issued. We started to suspect it wasn't
always true as the shards (especially relatively large shards) became
slower over time. So we found out about the expungeDeletes option, which
purges the deleted docs and is by default false. We have set it to true.
If your solr update lifecycle includes frequent deletes, try this out.

This of course does not override working towards finding better
GCparameters.

https://cwiki.apache.org/confluence/display/solr/Near+Real+Time+Searching

On Mon, Mar 31, 2014 at 3:57 PM, elisabeth benoit
elisaelisael...@gmail.com
wrote:

Hello,

We are currently using solr 4.2.1. Our index is updated on a daily basis.
After noticing solr query time has increased (two times the initial size)
without any change in index size or in solr configuration, we tried an
optimize on the index but it didn't fix our problem. We checked the
garbage
collector, but everything seemed fine. What did in fact fix our problem
was
to delete all documents and reindex from scratch.

It looks like over time our index gets corrupted and optimize doesn't
fix
it. Does anyone have a clue how to investigate further this situation?

Elisabeth

--
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan

solr 4.2.1 index gets slower over time

2014-03-31 Thread elisabeth benoit

Hello,

We are currently using solr 4.2.1. Our index is updated on a daily basis.
After noticing solr query time has increased (two times the initial size)
without any change in index size or in solr configuration, we tried an
optimize on the index but it didn't fix our problem. We checked the garbage
collector, but everything seemed fine. What did in fact fix our problem was
to delete all documents and reindex from scratch.

It looks like over time our index gets corrupted and optimize doesn't fix
it. Does anyone have a clue how to investigate further this situation?


Elisabeth

Re: solr 4.2.1 index gets slower over time

2014-03-31 Thread elisabeth benoit

Hello,

Thanks for your answer.

We use JVisualVM. The CPU usage is very high (90%), but the GC activity
shows less than 0.01% average activity. Plus the heap usage stays low
(below 4G while the max heap size is 16G).

Do you have a different tool to suggest to check the GC? Do you think there
is something else me might not see?

Thanks again,
Elisabeth


2014-03-31 16:26 GMT+02:00 Shawn Heisey s...@elyograg.org:

 On 3/31/2014 6:57 AM, elisabeth benoit wrote:
  We are currently using solr 4.2.1. Our index is updated on a daily basis.
  After noticing solr query time has increased (two times the initial size)
  without any change in index size or in solr configuration, we tried an
  optimize on the index but it didn't fix our problem. We checked the
 garbage
  collector, but everything seemed fine. What did in fact fix our problem
 was
  to delete all documents and reindex from scratch.
 
  It looks like over time our index gets corrupted and optimize doesn't
 fix
  it. Does anyone have a clue how to investigate further this situation?

 That seems very odd.  I have one production copy of my index using
 4.2.1, and it has been working fine for quite a long time.  We are
 transitioning to Solr 4.6.1 now, so the other copy is running that
 version.  We do occasionally do a full rebuild, but that is for index
 content, not for any problems.

 When you say you checked your garbage collector, what tools did you use?
  I was having GC pause problems, but I didn't know it until I started
 using different tools.

 Thanks,
 Shawn

Re: How to handle multiple sub second updates to same SOLR Document

2014-01-26 Thread Elisabeth Benoit

yutz

Envoyé de mon iPhoneippj

Le 26 janv. 2014 à 06:13, Shalin Shekhar Mangar shalinman...@gmail.com a 
écrit :

 There is no timestamp versioning as such in Solr but there is a new
 document based versioning which will allow you to specify your own
 (externally assigned) versions.
 
 See the Document Centric Versioning Constraints section at
 https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents
 
 Sub-second soft auto commit can be expensive but it is hard to say if
 it will be too expensive for your use-case. You must benchmark it
 yourself.
 
 On Sat, Jan 25, 2014 at 11:51 PM, christopher palm cpa...@gmail.com wrote:
 I have a scenario where the same SOLR document is being updated several
 times within a few ms of each other due to how the source system is sending
 in field updates on the document.
 
 The problem I am trying to solve is that the order of these updates isn’t
 guaranteed once the multi threaded SOLRJ client starts sending them to
 SOLR, and older updates are overlaying the newer updates on the same
 document.
 
 I would like to use a timestamp versioning so that the older document
 change won’t be sent into SOLR, but I didn’t see any automated way of doing
 this based on the document timestamp.
 
 Is there a good way to handle this scenario in SOLR 4.6?
 
 It seems that we would have to be soft auto committing with a  subsecond
 level as well, is that even possible?
 
 Thanks,
 
 Chris
 
 
 
 -- 
 Regards,
 Shalin Shekhar Mangar.

Re: autocomplete_edge type split words

2013-09-30 Thread elisabeth benoit

in fact, I've removed the autoGeneratePhraseQuery=true, and it doesn't
change anything. behaviour is the same with or without (ie request with
debugQuery=on is the same)

Thanks for your comments.

Best,
Elisabeth


2013/9/28 Erick Erickson erickerick...@gmail.com

 You've probably been doing this right along, but adding
 debug=query will show the parsed query.

 I really question though, your apparent combination of
 autoGeneratePhraseQuery what looks like an ngram field.
 I'm not at all sure how those would interact...

 Best,
 Erick

 On Fri, Sep 27, 2013 at 10:12 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  Yes!
 
  what I've done is set autoGeneratePhraseQueries to true for my field,
 then
  give it a boost (bq=myAutompleteEdgeNGramField=my query with
 spaces^50).
  This only worked with autoGeneratePhraseQueries=true, for a reason I
 didn't
  understand.
 
  since when I did
 
  q= myAutompleteEdgeNGramField=my query with spaces, I didn't need
  autoGeneratePhraseQueries
  set to true.
 
  and, another thing is when I tried
 
  q=myAutocompleteNGramField:(my query with spaces) OR
  myAutompleteEdgeNGramField=my
  query with spaces
 
  (with a request handler with edismax and default operator field = AND),
 the
  request on myAutocompleteNGramField would OR the grams, so I had to put
 an
  AND (myAutocompleteNGramField:(my AND query AND with AND spaces)), which
  was pretty ugly.
 
  I don't always understand what is exactly going on. If you have a pointer
  to some text I could read to get more insights about this, please let me
  know.
 
  Thanks again,
  Best regards,
  Elisabeth
 
 
 
 
  2013/9/27 Erick Erickson erickerick...@gmail.com
 
  Have you looked at autoGeneratePhraseQueries? That might help.
 
  If that doesn't work, you can always do something like add an OR clause
  like
  OR original query
  and optionally boost it high. But I'd start with the autoGenerate bits.
 
  Best,
  Erick
 
 
  On Fri, Sep 27, 2013 at 7:37 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
   Thanks for your answer.
  
   So I guess if someone wants to search on two fields, on with phrase
 query
   and one with normal query (splitted in words), one has to find a
 way to
   send query twice: one with quote and one without...
  
   Best regards,
   Elisabeth
  
  
   2013/9/27 Erick Erickson erickerick...@gmail.com
  
   This is a classic issue where there's confusion between
   the query parser and field analysis.
  
   Early in the process the query parser has to take the input
   and break it up. that's how, for instance, a query like
   text:term1 term2
   gets parsed as
   text:term1 defaultfield:term2
   This happens long before the terms get to the analysis chain
   for the field.
  
   So your only options are to either quote the string or
   escape the spaces.
  
   Best,
   Erick
  
   On Wed, Sep 25, 2013 at 9:24 AM, elisabeth benoit
   elisaelisael...@gmail.com wrote:
Hello,
   
I am using solr 4.2.1 and I have a autocomplete_edge type defined
 in
schema.xml
   
   
fieldType name=autocomplete_edge class=solr.TextField
  analyzer type=index
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=\s+
replacement=  replace=all/
filter class=solr.EdgeNGramFilterFactory maxGramSize=30
minGramSize=1/
   /analyzer
  analyzer type=query
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=\s+
replacement=  replace=all/
 filter class=solr.PatternReplaceFilterFactory
pattern=^(.{30})(.*)? replacement=$1 replace=all/
  /analyzer
/fieldType
   
When I have a request with more then one word, for instance rue de
  la,
   my
request doesn't match with my autocomplete_edge field unless I use
  quotes
around the query. In other words q=rue de la doesnt work and
 q=rue de
   la
works.
   
I've check the request with debugQuery=on, and I can see in first
  case,
   the
query is splitted into words, and I don't understand why since my
  field
type uses KeywordTokenizerFactory.
   
Does anyone have a clue on how I can request my field without using
   quotes?
   
Thanks,
Elisabeth

Re: autocomplete_edge type split words

2013-09-27 Thread elisabeth benoit

Thanks for your answer.

So I guess if someone wants to search on two fields, on with phrase query
and one with normal query (splitted in words), one has to find a way to
send query twice: one with quote and one without...

Best regards,
Elisabeth


2013/9/27 Erick Erickson erickerick...@gmail.com

 This is a classic issue where there's confusion between
 the query parser and field analysis.

 Early in the process the query parser has to take the input
 and break it up. that's how, for instance, a query like
 text:term1 term2
 gets parsed as
 text:term1 defaultfield:term2
 This happens long before the terms get to the analysis chain
 for the field.

 So your only options are to either quote the string or
 escape the spaces.

 Best,
 Erick

 On Wed, Sep 25, 2013 at 9:24 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  Hello,
 
  I am using solr 4.2.1 and I have a autocomplete_edge type defined in
  schema.xml
 
 
  fieldType name=autocomplete_edge class=solr.TextField
analyzer type=index
  charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PatternReplaceFilterFactory pattern=\s+
  replacement=  replace=all/
  filter class=solr.EdgeNGramFilterFactory maxGramSize=30
  minGramSize=1/
 /analyzer
analyzer type=query
  charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.PatternReplaceFilterFactory pattern=\s+
  replacement=  replace=all/
   filter class=solr.PatternReplaceFilterFactory
  pattern=^(.{30})(.*)? replacement=$1 replace=all/
/analyzer
  /fieldType
 
  When I have a request with more then one word, for instance rue de la,
 my
  request doesn't match with my autocomplete_edge field unless I use quotes
  around the query. In other words q=rue de la doesnt work and q=rue de
 la
  works.
 
  I've check the request with debugQuery=on, and I can see in first case,
 the
  query is splitted into words, and I don't understand why since my field
  type uses KeywordTokenizerFactory.
 
  Does anyone have a clue on how I can request my field without using
 quotes?
 
  Thanks,
  Elisabeth

Re: autocomplete_edge type split words

2013-09-27 Thread elisabeth benoit

Yes!

what I've done is set autoGeneratePhraseQueries to true for my field, then
give it a boost (bq=myAutompleteEdgeNGramField=my query with spaces^50).
This only worked with autoGeneratePhraseQueries=true, for a reason I didn't
understand.

since when I did

q= myAutompleteEdgeNGramField=my query with spaces, I didn't need
autoGeneratePhraseQueries
set to true.

and, another thing is when I tried

q=myAutocompleteNGramField:(my query with spaces) OR
myAutompleteEdgeNGramField=my
query with spaces

(with a request handler with edismax and default operator field = AND), the
request on myAutocompleteNGramField would OR the grams, so I had to put an
AND (myAutocompleteNGramField:(my AND query AND with AND spaces)), which
was pretty ugly.

I don't always understand what is exactly going on. If you have a pointer
to some text I could read to get more insights about this, please let me
know.

Thanks again,
Best regards,
Elisabeth




2013/9/27 Erick Erickson erickerick...@gmail.com

 Have you looked at autoGeneratePhraseQueries? That might help.

 If that doesn't work, you can always do something like add an OR clause
 like
 OR original query
 and optionally boost it high. But I'd start with the autoGenerate bits.

 Best,
 Erick


 On Fri, Sep 27, 2013 at 7:37 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  Thanks for your answer.
 
  So I guess if someone wants to search on two fields, on with phrase query
  and one with normal query (splitted in words), one has to find a way to
  send query twice: one with quote and one without...
 
  Best regards,
  Elisabeth
 
 
  2013/9/27 Erick Erickson erickerick...@gmail.com
 
  This is a classic issue where there's confusion between
  the query parser and field analysis.
 
  Early in the process the query parser has to take the input
  and break it up. that's how, for instance, a query like
  text:term1 term2
  gets parsed as
  text:term1 defaultfield:term2
  This happens long before the terms get to the analysis chain
  for the field.
 
  So your only options are to either quote the string or
  escape the spaces.
 
  Best,
  Erick
 
  On Wed, Sep 25, 2013 at 9:24 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
   Hello,
  
   I am using solr 4.2.1 and I have a autocomplete_edge type defined in
   schema.xml
  
  
   fieldType name=autocomplete_edge class=solr.TextField
 analyzer type=index
   charFilter class=solr.MappingCharFilterFactory
   mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PatternReplaceFilterFactory pattern=\s+
   replacement=  replace=all/
   filter class=solr.EdgeNGramFilterFactory maxGramSize=30
   minGramSize=1/
  /analyzer
 analyzer type=query
   charFilter class=solr.MappingCharFilterFactory
   mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory/
   filter class=solr.PatternReplaceFilterFactory pattern=\s+
   replacement=  replace=all/
filter class=solr.PatternReplaceFilterFactory
   pattern=^(.{30})(.*)? replacement=$1 replace=all/
 /analyzer
   /fieldType
  
   When I have a request with more then one word, for instance rue de
 la,
  my
   request doesn't match with my autocomplete_edge field unless I use
 quotes
   around the query. In other words q=rue de la doesnt work and q=rue de
  la
   works.
  
   I've check the request with debugQuery=on, and I can see in first
 case,
  the
   query is splitted into words, and I don't understand why since my
 field
   type uses KeywordTokenizerFactory.
  
   Does anyone have a clue on how I can request my field without using
  quotes?
  
   Thanks,
   Elisabeth

autocomplete_edge type split words

2013-09-25 Thread elisabeth benoit

Hello,

I am using solr 4.2.1 and I have a autocomplete_edge type defined in
schema.xml


fieldType name=autocomplete_edge class=solr.TextField
  analyzer type=index
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=\s+
replacement=  replace=all/
filter class=solr.EdgeNGramFilterFactory maxGramSize=30
minGramSize=1/
   /analyzer
  analyzer type=query
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.PatternReplaceFilterFactory pattern=\s+
replacement=  replace=all/
 filter class=solr.PatternReplaceFilterFactory
pattern=^(.{30})(.*)? replacement=$1 replace=all/
  /analyzer
/fieldType

When I have a request with more then one word, for instance rue de la, my
request doesn't match with my autocomplete_edge field unless I use quotes
around the query. In other words q=rue de la doesnt work and q=rue de la
works.

I've check the request with debugQuery=on, and I can see in first case, the
query is splitted into words, and I don't understand why since my field
type uses KeywordTokenizerFactory.

Does anyone have a clue on how I can request my field without using quotes?

Thanks,
Elisabeth

homogeneous dispersion in a bbox

2013-03-05 Thread elisabeth benoit

Hello,

I'd like to know if there is some specific way, in Solr 3.6.1, to have
something like an homogeneous dispersion of documents in a bbox.

My use case is I a have a request returning let's say 1000 documents in a
bbox (they all have the same solr score), and I want only 50 documents, but
not all heaped (gathered) in a specific geographical location.

We were thinking of adding a random field in our index and do a sort on
that field, but I'm wondering if there is solr already has a solution for
that king of use case.


best regards,
Elisabeth

matching with whole field

2012-08-02 Thread elisabeth benoit

Hello,

I am using Solr 3.4.

I'm trying to define a type that it is possible to match with only if
request contains exactly the same words.

Let's say I have two different values for ONLY_EXACT_MATCH_FIELD

ONLY_EXACT_MATCH_FIELD: salon de coiffure
ONLY_EXACT_MATCH_FIELD: salon de coiffure pour femmes

I would like to match only with the first ont when requesting Solr with
fq=ONLY_EXACT_MATCH_FIELD:(salon de coiffure)

As far has I understood, the solution is to do not tokenize on white
spaces, and use instead solr.KeywordTokenizerFactory


My actual type is defined as followed in schema.xml

fieldType name=ONLY_EXACT_MATCH_FIELD class=solr.TextField
omitNorms=true positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
filter class=solr.ISOLatin1AccentFilterFactory/
filter class=solr.StandardFilterFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.LengthFilterFactory min=1 max=100 /
  /analyzer
/fieldType

But matching with fields with more then one word doesn't work. Does someone
have a clue what I am doing wrong?

Thanks,
Elisabeth

Re: matching with whole field

2012-08-02 Thread elisabeth benoit

Hello Chantal,

Thanks for your answer.

In fact, my analyzer contains the same tokenizer chain for query. I just
removed it in my email for lisibility (but maybe not good for clarity). And
I did check with the admin interface, and it says there is a match. But
with a real query to Solr, it doesn't match.

I've once read in the mailing list that one should not always trust the
admin interface for analysis...

I don't think this should interfer, but my default request handler (the one
used by fq I guess) is not edismax.


If you have more clues, I'd be glad to read.

Thanks again,
Elisabeth



2012/8/2 Chantal Ackermann c.ackerm...@it-agenten.com

 Hi Elisabeth,

 try adding the same tokenizer chain for query, as well, or simply remove
 the type=index from the analyzer element.

 Your chain is analyzing the input of the indexer and removing diacritics
 and lowercasing. With your current setup, the input to the search is not
 analyzed likewise so inputs that are not lowercased or contain diacritics
 will not match.

 You might want to use the analysis frontend in the Admin UI to see how
 input to the indexer and the searcher is transformed and matched.

 Cheers,
 Chantal

 Am 02.08.2012 um 09:56 schrieb elisabeth benoit:

  Hello,
 
  I am using Solr 3.4.
 
  I'm trying to define a type that it is possible to match with only if
  request contains exactly the same words.
 
  Let's say I have two different values for ONLY_EXACT_MATCH_FIELD
 
  ONLY_EXACT_MATCH_FIELD: salon de coiffure
  ONLY_EXACT_MATCH_FIELD: salon de coiffure pour femmes
 
  I would like to match only with the first ont when requesting Solr with
  fq=ONLY_EXACT_MATCH_FIELD:(salon de coiffure)
 
  As far has I understood, the solution is to do not tokenize on white
  spaces, and use instead solr.KeywordTokenizerFactory
 
 
  My actual type is defined as followed in schema.xml
 
 fieldType name=ONLY_EXACT_MATCH_FIELD class=solr.TextField
  omitNorms=true positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
 filter class=solr.ISOLatin1AccentFilterFactory/
 filter class=solr.StandardFilterFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.LengthFilterFactory min=1 max=100 /
   /analyzer
 /fieldType
 
  But matching with fields with more then one word doesn't work. Does
 someone
  have a clue what I am doing wrong?
 
  Thanks,
  Elisabeth

Re: matching with whole field

2012-08-02 Thread elisabeth benoit

Thanks you so much Franck Brisbart.

It's working!

Best regards,
Elisabeth

2012/8/2 fbrisbart fbrisb...@bestofmedia.com

 It's a parsing problem.
 You must tell the query parser to consider spaces as real characters.
 This should work (backslashing the spaces):
 fq=ONLY_EXACT_MATCH_FIELD:salon\ de\ coiffure

 or you may use something like that :
 fq={!term f=ONLY_EXACT_MATCH_FIELD v=$qq}qq=salon de coiffure


 Hope it helps,
 Franck Brisbart


 Le jeudi 02 août 2012 à 09:56 +0200, elisabeth benoit a écrit :
  Hello,
 
  I am using Solr 3.4.
 
  I'm trying to define a type that it is possible to match with only if
  request contains exactly the same words.
 
  Let's say I have two different values for ONLY_EXACT_MATCH_FIELD
 
  ONLY_EXACT_MATCH_FIELD: salon de coiffure
  ONLY_EXACT_MATCH_FIELD: salon de coiffure pour femmes
 
  I would like to match only with the first ont when requesting Solr with
  fq=ONLY_EXACT_MATCH_FIELD:(salon de coiffure)
 
  As far has I understood, the solution is to do not tokenize on white
  spaces, and use instead solr.KeywordTokenizerFactory
 
 
  My actual type is defined as followed in schema.xml
 
  fieldType name=ONLY_EXACT_MATCH_FIELD class=solr.TextField
  omitNorms=true positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.KeywordTokenizerFactory/
  charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
  filter class=solr.ISOLatin1AccentFilterFactory/
  filter class=solr.StandardFilterFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.LengthFilterFactory min=1 max=100 /
/analyzer
  /fieldType
 
  But matching with fields with more then one word doesn't work. Does
 someone
  have a clue what I am doing wrong?
 
  Thanks,
  Elisabeth

Re: how to read fieldValueCacheStatistics

2012-05-31 Thread elisabeth benoit

ok, thanks a lot for the answer.

Elisabeth

2012/5/31 Chris Hostetter hossman_luc...@fucit.org


 : When I read fieldValueCache statistics I have something that looks like
 :
 : item_ABC_FACET :
 :
 {field=ABC_FACET,memSize=4224,tindexSize=32,time=92,phase1=92,nTerms=0,bigTerms=0,termInstances=0,uses=11}
 :
 :
 : is there a doc somewhere that explains what are

 ...technically that's one stat, showing you and UnInvertedField
 instance in the cache (that's the string-ification of that
 UnInvertedField)

 the specifics of what those numbers mean are definitely what i would
 consider expert level ... off the top of my head the only ones i am
 fairly sure of are:

 memSize - how many bytes of ram it's using
 time - how long it took to build
 nTerms - number of unique terms in that field
 bigTerms - number of big terms, ie: terms that have such a high docFreq,
 they weren't un-inverted because it would be too ineffectient.

 In general, this level of detail is the kind of thing where you should
 probably review the code.


 -Hoss

Re: Multi-words synonyms matching

2012-05-29 Thread elisabeth benoit

Hello Bernd,

Thanks a lot for your answer. I'll work on this.

Best regards,
Elisabeth

2012/5/29 Bernd Fehling bernd.fehl...@uni-bielefeld.de

 Hello Elisabeth,

 my synonyms.txt is like your 2nd example:

 naturwald, φυσικό\ δάσος, естествена\ гора, prírodný\ les, naravni\ gozd,
 foresta\ naturale, natuurbos, natural\ forest, bosque\ natural,
 természetes\ erdő,
 natūralus\ miškas, prirodna\ šuma, dabiskais\ mežs, floresta\ natural,
 naturskov,
 forêt\ naturelle, naturskog, přírodní\ les, luonnonmetsä, pădure\ naturală,
 las\ naturalny, natürlicher\ wald


 An example from my system with debugging turned on and searching for
 naturwald:

 lst name=debug
  str name=rawquerystringnaturwald/str
  str name=querystringnaturwald/str
  str name=parsedquerytextth:naturwald textth:φυσικό δάσος
 textth:естествена гора
 textth:prírodný les textth:naravni gozd textth:foresta naturale
 textth:natuurbos
 textth:natural forest textth:bosque natural textth:természetes erdő
 textth:natūralus miškas textth:prirodna šuma textth:dabiskais mežs
 textth:floresta natural textth:naturskov textth:forêt naturelle
 textth:naturskog
 textth:přírodní les textth:luonnonmetsä textth:pădure naturală
 textth:las naturalny
 textth:natürlicher wald/str
 ...

 As you can see my search for naturwald extends to single and multiword
 synonyms e.g. forêt naturelle


 My SynonymFilterFactory has the following settings:

 org.apache.solr.analysis.SynonymFilterFactory
 {tokenizerFactory=solr.KeywordTokenizerFactory,
 synonyms=synonyms_eurovoc_desc_desc_ufall.txt, expand=true, format=solr,
 ignoreCase=true,
 luceneMatchVersion=LUCENE_36}

 But as I already mentioned, there is much more work to be done to get it
 running than
 just using SynonymFilterFactory.

 Regards
 Bernd



 Am 23.05.2012 08:49, schrieb elisabeth benoit:
  Hello Bernd,
 
  Thanks for your advice.
 
  I have one question: how did you manage to map one word to a multiwords
  synonym???
 
  I've tried (in synonyms.txt)
 
  mairie, hotel de ville
 
  mairie, hotel\ de\ ville
 
  mairie = mairie, hotel de ville
 
  mairie = mairie, hotel\ de\ ville
 
  but nothing prevents mairie from matching with hotel...
 
  The only way I found is to use
  tokenizerFactory=solr.KeywordTokenizerFactory in my synonyms
 declaration
  in schema.xml, but then since mairie is not alone in my index field, it
  doesn't match.
 
 
  best regards,
  Elisabeth
 
 
 
 
  the only way I found, I schema.xml, is to use
 
 
 
  2012/5/15 Bernd Fehling bernd.fehl...@uni-bielefeld.de
 
  Without reading the whole thread let me say that you should not trust
  the solr admin analysis. It takes the whole multiword search and runs
  it all together at once through each analyzer step (factory).
  But this is not how the real system works. First pitfall, the query
 parser
  is also splitting at white space (if not a phrase query). Due to this,
  a multiword query is send chunk after chunk through the analyzer and,
  second pitfall, each chunk runs through the whole analyzer by its own.
 
  So if you are dealing with multiword synonyms you have the following
  problems. Either you turn your query into a phrase so that the whole
  phrase is analyzed at once and therefore looked up as multiword synonym
  but phrase queries are not analyzed !!! OR you send your query chunk
  by chunk through the analyzer but then they are not multiwords anymore
  and are not found in your synonyms.txt.
 
  From my experience I can say that it requires some deep work to get it
 done
  but it is possible. I have connected a thesaurus to solr which is doing
  query time expansion (no need to reindex if the thesaurus changes).
  The thesaurus holds synonyms and used for terms in 24 languages. So
  it is also some kind of language translation. And naturally the
 thesaurus
  translates from single term to multi term synonyms and vice versa.
 
  Regards,
  Bernd
 
 
  Am 14.05.2012 13:54, schrieb elisabeth benoit:
  Just for the record, I'd like to conclude this thread
 
  First, you were right, there was no behaviour difference between fq
 and q
  parameters.
 
  I realized that:
 
  1) my synonym (hotel de ville) has a stopword in it (de) and since I
 used
  tokenizerFactory=solr.KeywordTokenizerFactory in my synonyms
  declaration,
  there was no stopword removal in the indewed expression, so when
  requesting
  hotel de ville, after stopwords removal in query, Solr was comparing
  hotel de ville
  with hotel ville
 
  but my queries never even got to that point since
 
  2) I made a mistake using mairie alone in the admin interface when
  testing my schema. The real field was something like collectivités
  territoriales mairie,
  so the synonym hotel de ville was not even applied, because of the
  tokenizerFactory=solr.KeywordTokenizerFactory in my synonym
 definition
  not splitting field into words when parsing
 
  So my problem is not solved, and I'm considering solving it outside of
  Solr
  scope, unless someone else has a clue

Re: Multi-words synonyms matching

2012-05-23 Thread elisabeth benoit

Hello Bernd,

Thanks for your advice.

I have one question: how did you manage to map one word to a multiwords
synonym???

I've tried (in synonyms.txt)

mairie, hotel de ville

mairie, hotel\ de\ ville

mairie = mairie, hotel de ville

mairie = mairie, hotel\ de\ ville

but nothing prevents mairie from matching with hotel...

The only way I found is to use
tokenizerFactory=solr.KeywordTokenizerFactory in my synonyms declaration
in schema.xml, but then since mairie is not alone in my index field, it
doesn't match.


best regards,
Elisabeth




the only way I found, I schema.xml, is to use



2012/5/15 Bernd Fehling bernd.fehl...@uni-bielefeld.de

 Without reading the whole thread let me say that you should not trust
 the solr admin analysis. It takes the whole multiword search and runs
 it all together at once through each analyzer step (factory).
 But this is not how the real system works. First pitfall, the query parser
 is also splitting at white space (if not a phrase query). Due to this,
 a multiword query is send chunk after chunk through the analyzer and,
 second pitfall, each chunk runs through the whole analyzer by its own.

 So if you are dealing with multiword synonyms you have the following
 problems. Either you turn your query into a phrase so that the whole
 phrase is analyzed at once and therefore looked up as multiword synonym
 but phrase queries are not analyzed !!! OR you send your query chunk
 by chunk through the analyzer but then they are not multiwords anymore
 and are not found in your synonyms.txt.

 From my experience I can say that it requires some deep work to get it done
 but it is possible. I have connected a thesaurus to solr which is doing
 query time expansion (no need to reindex if the thesaurus changes).
 The thesaurus holds synonyms and used for terms in 24 languages. So
 it is also some kind of language translation. And naturally the thesaurus
 translates from single term to multi term synonyms and vice versa.

 Regards,
 Bernd


 Am 14.05.2012 13:54, schrieb elisabeth benoit:
  Just for the record, I'd like to conclude this thread
 
  First, you were right, there was no behaviour difference between fq and q
  parameters.
 
  I realized that:
 
  1) my synonym (hotel de ville) has a stopword in it (de) and since I used
  tokenizerFactory=solr.KeywordTokenizerFactory in my synonyms
 declaration,
  there was no stopword removal in the indewed expression, so when
 requesting
  hotel de ville, after stopwords removal in query, Solr was comparing
  hotel de ville
  with hotel ville
 
  but my queries never even got to that point since
 
  2) I made a mistake using mairie alone in the admin interface when
  testing my schema. The real field was something like collectivités
  territoriales mairie,
  so the synonym hotel de ville was not even applied, because of the
  tokenizerFactory=solr.KeywordTokenizerFactory in my synonym definition
  not splitting field into words when parsing
 
  So my problem is not solved, and I'm considering solving it outside of
 Solr
  scope, unless someone else has a clue
 
  Thanks again,
  Elisabeth
 
 
 
  2012/4/25 Erick Erickson erickerick...@gmail.com
 
  A little farther down the debug info output you'll find something
  like this (I specified fq=name:features)
 
  arr name=parsed_filter_queries
  strname:features/str
  /arr
 
 
  so it may well give you some clue. But unless I'm reading things wrong,
  your
  q is going against a field that has much more information than the
  CATEGORY_ANALYZED field, is it possible that the data from your
  test cases simply isn't _in_ CATEGORY_ANALYZED?
 
  Best
  Erick
 
  On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
  I'm not at the office until next Wednesday, and I don't have my Solr
  under
  hand, but isn't debugQuery=on giving informations only about q
 parameter
  matching and nothing about fq parameter? Or do you mean
  parsed_filter_queries gives information about fq?
 
  CATEGORY_ANALYZED is being populated by a copyField instruction in
  schema.xml, and has the same field type as my catchall field, the
 search
  field for my searchHandler (the one being used by q parameter).
 
  CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is text)
 
  CATEGORY (a string) is copied in catchall field (field type is text),
  and a
  lot of other fields are copied too in that catchall field.
 
  So as far as I can see, the same analysis should be done in both cases,
  but
  obviously I'm missing something, and the only thing I can think of is a
  different behavior between q and fq parameter.
 
  I'll check that parsed_filter_querie first thing in the morning next
  Wednesday.
 
  Thanks a lot for your help.
 
  Elisabeth
 
 
  2012/4/24 Erick Erickson erickerick...@gmail.com
 
  Elisabeth:
 
  What shows up in the debug section of the response when you add
  debugQuery=on? There should be some bit of that section like:
  parsed_filter_queries
 
  My other

Re: solr tokenizer not splitting unbreakable expressions

2012-05-23 Thread elisabeth benoit

Hello Tanguy,

I guess you're right, maybe this shouldn't be done in Solr but inside of
the front-end.

Thanks a lot for your answer.

Elisabeth

2012/5/22 Tanguy Moal tanguy.m...@gmail.com

 Hello Elisabeth,

 Wouldn't it be more simple to have a custom component inside of the
 front-end to your search server that would transform a query like hotel
 de ville paris into hotel de ville paris (I.e. turning each
 occurence of the sequence hotel de ville into a phrase query ) ?

 Concerning protections inside of the tokenizer, I think that is not
 possible actually.
 The main reason for this could be that the QueryParser will break the query
 on each space before passing each query-part through the analysis of every
 searched field. Hence all the smart things you would put at indexing time
 to wrap a sequence of tokens into a single one is not reproducible at query
 time.

 Please someone correct me if I'm wrong!

 Alternatively, I think you might do so with a custom query parser (in order
 to have phrases sent to the analyzers instead of words). But since
 tokenizers don't have support for protected words list, you would need an
 additional custom token filter that would consume the tokens stream and
 annotate those matching an entry in the protection list.
 Unfortunately, if your protected list is long, you will have performance
 issues. Unless you rely on a dedicated data structure, like Trie-based
 structures (Patricia-trie, ...) You can find solid implementations on the
 Internet (see https://github.com/rkapsi/patricia-trie).

 Then you could make your filter consume a sliding window of tokens while
 the window matches in your trie.
 Once you have a complete match in your trie, the filter can set an
 attribute of the type your choice (e.g. MyCustomKeywordAttribute) on the
 first matching token, and make the attribute be the complete match (e.g.
 Hotel de ville).
 If you don't have a complete match, drop the unmatched tokens leaving them
 unmodified.

 I Hope this helps...

 --
 Tanguy


 2012/5/22 elisabeth benoit elisaelisael...@gmail.com

  Hello,
 
  Does someone know if there is a way to configure a tokenizer to split on
  white spaces, all words excluding a bunch of expressions listed in a
 file?
 
  For instance, if a want hotel de ville not to be split in words, a
  request like hotel de ville paris would be split into two tokens:
 
  hotel de ville and paris instead of 4 tokens
 
  hotel
  de
  ville
  paris
 
  I imagine something like
 
  tokenizer class=solr.StandardTokenizerFactory
  protected=protoexpressions.txt/
 
  Thanks a lot,
  Elisabeth

Re: Multi-words synonyms matching

2012-05-14 Thread elisabeth benoit

Just for the record, I'd like to conclude this thread

First, you were right, there was no behaviour difference between fq and q
parameters.

I realized that:

1) my synonym (hotel de ville) has a stopword in it (de) and since I used
tokenizerFactory=solr.KeywordTokenizerFactory in my synonyms declaration,
there was no stopword removal in the indewed expression, so when requesting
hotel de ville, after stopwords removal in query, Solr was comparing
hotel de ville
with hotel ville

but my queries never even got to that point since

2) I made a mistake using mairie alone in the admin interface when
testing my schema. The real field was something like collectivités
territoriales mairie,
so the synonym hotel de ville was not even applied, because of the
tokenizerFactory=solr.KeywordTokenizerFactory in my synonym definition
not splitting field into words when parsing

So my problem is not solved, and I'm considering solving it outside of Solr
scope, unless someone else has a clue

Thanks again,
Elisabeth



2012/4/25 Erick Erickson erickerick...@gmail.com

 A little farther down the debug info output you'll find something
 like this (I specified fq=name:features)

 arr name=parsed_filter_queries
 strname:features/str
 /arr


 so it may well give you some clue. But unless I'm reading things wrong,
 your
 q is going against a field that has much more information than the
 CATEGORY_ANALYZED field, is it possible that the data from your
 test cases simply isn't _in_ CATEGORY_ANALYZED?

 Best
 Erick

 On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  I'm not at the office until next Wednesday, and I don't have my Solr
 under
  hand, but isn't debugQuery=on giving informations only about q parameter
  matching and nothing about fq parameter? Or do you mean
  parsed_filter_queries gives information about fq?
 
  CATEGORY_ANALYZED is being populated by a copyField instruction in
  schema.xml, and has the same field type as my catchall field, the search
  field for my searchHandler (the one being used by q parameter).
 
  CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is text)
 
  CATEGORY (a string) is copied in catchall field (field type is text),
 and a
  lot of other fields are copied too in that catchall field.
 
  So as far as I can see, the same analysis should be done in both cases,
 but
  obviously I'm missing something, and the only thing I can think of is a
  different behavior between q and fq parameter.
 
  I'll check that parsed_filter_querie first thing in the morning next
  Wednesday.
 
  Thanks a lot for your help.
 
  Elisabeth
 
 
  2012/4/24 Erick Erickson erickerick...@gmail.com
 
  Elisabeth:
 
  What shows up in the debug section of the response when you add
  debugQuery=on? There should be some bit of that section like:
  parsed_filter_queries
 
  My other question is are you absolutely sure that your
  CATEGORY_ANALYZED field has the correct content?. How does it
  get populated?
 
  Nothing jumps out at me here
 
  Best
  Erick
 
  On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
   yes, thanks, but this is NOT my question.
  
   I was wondering why I have multiple matches with q=hotel de ville
 and
  no
   match with fq=CATEGORY_ANALYZED:hotel de ville, since in both case
 I'm
   searching in the same solr fieldType.
  
   Why is q parameter behaving differently in that case? Why do the
 quotes
   work in one case and not in the other?
  
   Does anyone know?
  
   Thanks,
   Elisabeth
  
   2012/4/24 Jeevanandam je...@myjeeva.com
  
  
   usage of q and fq
  
   q = is typically the main query for the search request
  
   fq = is Filter Query; generally used to restrict the super set of
   documents without influencing score (more info.
   http://wiki.apache.org/solr/**CommonQueryParameters#q
  http://wiki.apache.org/solr/CommonQueryParameters#q
   )
  
   For example:
   
   q=hotel de ville === returns 100 documents
  
   q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed
 ===
   returns 40 documents from super set of 100 documents
  
  
   hope this helps!
  
   - Jeevanandam
  
  
  
   On 24-04-2012 3:08 pm, elisabeth benoit wrote:
  
   Hello,
  
   I'd like to resume this post.
  
   The only way I found to do not split synonyms in words in
 synonyms.txt
  it
   to use the line
  
filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
   ignoreCase=true expand=true
   tokenizerFactory=solr.**KeywordTokenizerFactory/
  
   in schema.xml
  
   where tokenizerFactory=solr.**KeywordTokenizerFactory
  
   instructs SynonymFilterFactory not to break synonyms into words on
  white
   spaces when parsing synonyms file.
  
   So now it works fine, mairie is mapped into hotel de ville and
  when I
   send request q=hotel de ville (quotes are mandatory to prevent
  analyzer
   to split hotel de ville on white spaces), I get answers with word
   mairie.
  
   But when I

Re: Multi-words synonyms matching

2012-04-25 Thread elisabeth benoit

I'm not at the office until next Wednesday, and I don't have my Solr under
hand, but isn't debugQuery=on giving informations only about q parameter
matching and nothing about fq parameter? Or do you mean
parsed_filter_queries gives information about fq?

CATEGORY_ANALYZED is being populated by a copyField instruction in
schema.xml, and has the same field type as my catchall field, the search
field for my searchHandler (the one being used by q parameter).

CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is text)

CATEGORY (a string) is copied in catchall field (field type is text), and a
lot of other fields are copied too in that catchall field.

So as far as I can see, the same analysis should be done in both cases, but
obviously I'm missing something, and the only thing I can think of is a
different behavior between q and fq parameter.

I'll check that parsed_filter_querie first thing in the morning next
Wednesday.

Thanks a lot for your help.

Elisabeth


2012/4/24 Erick Erickson erickerick...@gmail.com

 Elisabeth:

 What shows up in the debug section of the response when you add
 debugQuery=on? There should be some bit of that section like:
 parsed_filter_queries

 My other question is are you absolutely sure that your
 CATEGORY_ANALYZED field has the correct content?. How does it
 get populated?

 Nothing jumps out at me here

 Best
 Erick

 On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  yes, thanks, but this is NOT my question.
 
  I was wondering why I have multiple matches with q=hotel de ville and
 no
  match with fq=CATEGORY_ANALYZED:hotel de ville, since in both case I'm
  searching in the same solr fieldType.
 
  Why is q parameter behaving differently in that case? Why do the quotes
  work in one case and not in the other?
 
  Does anyone know?
 
  Thanks,
  Elisabeth
 
  2012/4/24 Jeevanandam je...@myjeeva.com
 
 
  usage of q and fq
 
  q = is typically the main query for the search request
 
  fq = is Filter Query; generally used to restrict the super set of
  documents without influencing score (more info.
  http://wiki.apache.org/solr/**CommonQueryParameters#q
 http://wiki.apache.org/solr/CommonQueryParameters#q
  )
 
  For example:
  
  q=hotel de ville === returns 100 documents
 
  q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed ===
  returns 40 documents from super set of 100 documents
 
 
  hope this helps!
 
  - Jeevanandam
 
 
 
  On 24-04-2012 3:08 pm, elisabeth benoit wrote:
 
  Hello,
 
  I'd like to resume this post.
 
  The only way I found to do not split synonyms in words in synonyms.txt
 it
  to use the line
 
   filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.**KeywordTokenizerFactory/
 
  in schema.xml
 
  where tokenizerFactory=solr.**KeywordTokenizerFactory
 
  instructs SynonymFilterFactory not to break synonyms into words on
 white
  spaces when parsing synonyms file.
 
  So now it works fine, mairie is mapped into hotel de ville and
 when I
  send request q=hotel de ville (quotes are mandatory to prevent
 analyzer
  to split hotel de ville on white spaces), I get answers with word
  mairie.
 
  But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it
  doesn't work!!!
 
  CATEGORY_ANALYZED is same field type as default search field. This
 means
  that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de
  ville, solr uses the same analyzer, the one with the line
 
  filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.**KeywordTokenizerFactory/.
 
  Anyone as a clue what is different between q analysis behaviour and fq
  analysis behaviour?
 
  Thanks a lot
  Elisabeth
 
  2012/4/12 elisabeth benoit elisaelisael...@gmail.com
 
   oh, that's right.
 
  thanks a lot,
  Elisabeth
 
 
  2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com
 
   Elisabeth -
 
  As you described, below mapping might suit for your need.
  mairie = hotel de ville, mairie
 
  mairie gets expanded to hotel de ville and mairie at index time.
  So
  mairie and hotel de ville searchable on document.
 
  However, still white space tokenizer splits at query time will be a
  problem as described by Markus.
 
  --Jeevanandam
 
  On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:
 
   Have you tried the =' mapping instead? Something
   like
   hotel de ville = mairie
   might work for you.
  
   Yes, thanks, I've tried it but from what I undestand it doesn't
 solve
  my
   problem, since this means hotel de ville will be replace by mairie
 at
   index time (I use synonyms only at index time). So when user will
 ask
   hôtel de ville, it won't match.
  
   In fact, at index time I have mairie in my data, but I want user
 to be
  able
   to request mairie or hôtel de ville and have mairie as answer,
 and
  not
   have mairie as an answer when requesting hôtel

Re: Multi-words synonyms matching

2012-04-24 Thread elisabeth benoit

Hello,

I'd like to resume this post.

The only way I found to do not split synonyms in words in synonyms.txt it
to use the line

 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/

in schema.xml

where tokenizerFactory=solr.KeywordTokenizerFactory

instructs SynonymFilterFactory not to break synonyms into words on white
spaces when parsing synonyms file.

So now it works fine, mairie is mapped into hotel de ville and when I
send request q=hotel de ville (quotes are mandatory to prevent analyzer
to split hotel de ville on white spaces), I get answers with word mairie.

But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it
doesn't work!!!

CATEGORY_ANALYZED is same field type as default search field. This means
that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de
ville, solr uses the same analyzer, the one with the line

filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/.

Anyone as a clue what is different between q analysis behaviour and fq
analysis behaviour?

Thanks a lot
Elisabeth

2012/4/12 elisabeth benoit elisaelisael...@gmail.com

 oh, that's right.

 thanks a lot,
 Elisabeth


 2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

 Elisabeth -

 As you described, below mapping might suit for your need.
 mairie = hotel de ville, mairie

 mairie gets expanded to hotel de ville and mairie at index time.  So
 mairie and hotel de ville searchable on document.

 However, still white space tokenizer splits at query time will be a
 problem as described by Markus.

 --Jeevanandam

 On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Yes, thanks, I've tried it but from what I undestand it doesn't solve my
  problem, since this means hotel de ville will be replace by mairie at
  index time (I use synonyms only at index time). So when user will ask
  hôtel de ville, it won't match.
 
  In fact, at index time I have mairie in my data, but I want user to be
 able
  to request mairie or hôtel de ville and have mairie as answer, and
 not
  have mairie as an answer when requesting hôtel.
 
 
  To map `mairie` to `hotel de ville` as single token you must escape
 your
  white
  space.
 
  mairie, hotel\ de\ ville
 
  This results in  a problem if your tokenizer splits on white space at
  query
  time.
 
  Ok, I guess this means I have a problem. No simple solution since at
 query
  time my tokenizer do split on white spaces.
 
  I guess my problem is more or less one of the problems discussed in
 
 
 http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
 
 
  Thanks a lot for your answers,
  Elisabeth
 
 
 
 
 
  2012/4/10 Erick Erickson erickerick...@gmail.com
 
  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Best
  Erick
 
  On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
  Hello,
 
  I've read several post on this issue, but can't find a real solution
 to
  my
  multi-words synonyms matching problem.
 
  I have in my synonyms.txt an entry like
 
  mairie, hotel de ville
 
  and my index time analyzer is configured as followed for synonyms.
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 
  The problem I have is that now mairie matches with hotel and I
 would
  only want mairie to match with hotel de ville and mairie.
 
  When I look into the analyzer, I see that mairie is mapped into
  hotel,
  and words de ville are added in second and third position. To change
  that, I tried to do
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in one
 post)
 
  and I can see now in the analyzer that mairie is mapped to hotel de
  ville, but now when I have query hotel de ville, it doesn't match
 at
  all
  with mairie.
 
  Anyone has a clue of what I'm doing wrong?
 
  I'm using Solr 3.4.
 
  Thanks,
  Elisabeth

Re: Multi-words synonyms matching

2012-04-24 Thread elisabeth benoit

yes, thanks, but this is NOT my question.

I was wondering why I have multiple matches with q=hotel de ville and no
match with fq=CATEGORY_ANALYZED:hotel de ville, since in both case I'm
searching in the same solr fieldType.

Why is q parameter behaving differently in that case? Why do the quotes
work in one case and not in the other?

Does anyone know?

Thanks,
Elisabeth

2012/4/24 Jeevanandam je...@myjeeva.com

usage of q and fq

q = is typically the main query for the search request

fq = is Filter Query; generally used to restrict the super set of
documents without influencing score (more info.
http://wiki.apache.org/solr/**CommonQueryParameters#qhttp://wiki.apache.org/solr/CommonQueryParameters#q
)

For example:

q=hotel de ville === returns 100 documents

q=hotel de villefq=price:[100 To *]fq=roomType:King size Bed ===
returns 40 documents from super set of 100 documents

hope this helps!

- Jeevanandam

On 24-04-2012 3:08 pm, elisabeth benoit wrote:

Hello,

I'd like to resume this post.

The only way I found to do not split synonyms in words in synonyms.txt it
to use the line

filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.**KeywordTokenizerFactory/

in schema.xml

where tokenizerFactory=solr.**KeywordTokenizerFactory

instructs SynonymFilterFactory not to break synonyms into words on white
spaces when parsing synonyms file.

So now it works fine, mairie is mapped into hotel de ville and when I
send request q=hotel de ville (quotes are mandatory to prevent analyzer
to split hotel de ville on white spaces), I get answers with word
mairie.

But when I use fq parameter (fq=CATEGORY_ANALYZED:hotel de ville), it
doesn't work!!!

CATEGORY_ANALYZED is same field type as default search field. This means
that when I send q=hotel de ville and fq=CATEGORY_ANALYZED:hotel de
ville, solr uses the same analyzer, the one with the line

filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.**KeywordTokenizerFactory/.

Anyone as a clue what is different between q analysis behaviour and fq
analysis behaviour?

Thanks a lot
Elisabeth

2012/4/12 elisabeth benoit elisaelisael...@gmail.com

oh, that's right.

thanks a lot,
Elisabeth

2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

Elisabeth -

As you described, below mapping might suit for your need.
mairie = hotel de ville, mairie

mairie gets expanded to hotel de ville and mairie at index time. So
mairie and hotel de ville searchable on document.

However, still white space tokenizer splits at query time will be a
problem as described by Markus.

--Jeevanandam

On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

Have you tried the =' mapping instead? Something
like
hotel de ville = mairie
might work for you.

Yes, thanks, I've tried it but from what I undestand it doesn't solve
my
problem, since this means hotel de ville will be replace by mairie at
index time (I use synonyms only at index time). So when user will ask
hôtel de ville, it won't match.

In fact, at index time I have mairie in my data, but I want user to be
able
to request mairie or hôtel de ville and have mairie as answer, and
not
have mairie as an answer when requesting hôtel.

To map `mairie` to `hotel de ville` as single token you must escape
your
white
space.

mairie, hotel\ de\ ville

This results in a problem if your tokenizer splits on white space
at
query
time.

Ok, I guess this means I have a problem. No simple solution since at
query
time my tokenizer do split on white spaces.

I guess my problem is more or less one of the problems discussed in

http://lucene.472066.n3.**nabble.com/Multi-word-**
synonyms-td3716292.html#**a3717215http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215

Thanks a lot for your answers,
Elisabeth

2012/4/10 Erick Erickson erickerick...@gmail.com

Have you tried the =' mapping instead? Something
like
hotel de ville = mairie
might work for you.

Best
Erick

On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
elisaelisael...@gmail.com wrote:
Hello,

I've read several post on this issue, but can't find a real solution
to
my
multi-words synonyms matching problem.

I have in my synonyms.txt an entry like

mairie, hotel de ville

and my index time analyzer is configured as followed for synonyms.

filter class=solr.**SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/

The problem I have is that now mairie matches with hotel and I
would
only want mairie to match with hotel de ville and mairie.

When I look into the analyzer, I see that mairie is mapped into
hotel,
and words de ville are added in second and third position. To
change
that, I tried to do

filter class=solr

Re: Multi-words synonyms matching

2012-04-12 Thread elisabeth benoit

oh, that's right.

thanks a lot,
Elisabeth

2012/4/11 Jeevanandam Madanagopal je...@myjeeva.com

 Elisabeth -

 As you described, below mapping might suit for your need.
 mairie = hotel de ville, mairie

 mairie gets expanded to hotel de ville and mairie at index time.  So
 mairie and hotel de ville searchable on document.

 However, still white space tokenizer splits at query time will be a
 problem as described by Markus.

 --Jeevanandam

 On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:

  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Yes, thanks, I've tried it but from what I undestand it doesn't solve my
  problem, since this means hotel de ville will be replace by mairie at
  index time (I use synonyms only at index time). So when user will ask
  hôtel de ville, it won't match.
 
  In fact, at index time I have mairie in my data, but I want user to be
 able
  to request mairie or hôtel de ville and have mairie as answer, and
 not
  have mairie as an answer when requesting hôtel.
 
 
  To map `mairie` to `hotel de ville` as single token you must escape
 your
  white
  space.
 
  mairie, hotel\ de\ ville
 
  This results in  a problem if your tokenizer splits on white space at
  query
  time.
 
  Ok, I guess this means I have a problem. No simple solution since at
 query
  time my tokenizer do split on white spaces.
 
  I guess my problem is more or less one of the problems discussed in
 
 
 http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
 
 
  Thanks a lot for your answers,
  Elisabeth
 
 
 
 
 
  2012/4/10 Erick Erickson erickerick...@gmail.com
 
  Have you tried the =' mapping instead? Something
  like
  hotel de ville = mairie
  might work for you.
 
  Best
  Erick
 
  On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
  elisaelisael...@gmail.com wrote:
  Hello,
 
  I've read several post on this issue, but can't find a real solution to
  my
  multi-words synonyms matching problem.
 
  I have in my synonyms.txt an entry like
 
  mairie, hotel de ville
 
  and my index time analyzer is configured as followed for synonyms.
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 
  The problem I have is that now mairie matches with hotel and I
 would
  only want mairie to match with hotel de ville and mairie.
 
  When I look into the analyzer, I see that mairie is mapped into
  hotel,
  and words de ville are added in second and third position. To change
  that, I tried to do
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in one
 post)
 
  and I can see now in the analyzer that mairie is mapped to hotel de
  ville, but now when I have query hotel de ville, it doesn't match at
  all
  with mairie.
 
  Anyone has a clue of what I'm doing wrong?
 
  I'm using Solr 3.4.
 
  Thanks,
  Elisabeth

Re: Multi-words synonyms matching

2012-04-11 Thread elisabeth benoit

Have you tried the =' mapping instead? Something
like
hotel de ville = mairie
might work for you.

Yes, thanks, I've tried it but from what I undestand it doesn't solve my
problem, since this means hotel de ville will be replace by mairie at
index time (I use synonyms only at index time). So when user will ask
hôtel de ville, it won't match.

In fact, at index time I have mairie in my data, but I want user to be able
to request mairie or hôtel de ville and have mairie as answer, and not
have mairie as an answer when requesting hôtel.


To map `mairie` to `hotel de ville` as single token you must escape your
white
space.

mairie, hotel\ de\ ville

This results in  a problem if your tokenizer splits on white space at
query
time.

Ok, I guess this means I have a problem. No simple solution since at query
time my tokenizer do split on white spaces.

I guess my problem is more or less one of the problems discussed in

http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215


Thanks a lot for your answers,
Elisabeth





2012/4/10 Erick Erickson erickerick...@gmail.com

 Have you tried the =' mapping instead? Something
 like
 hotel de ville = mairie
 might work for you.

 Best
 Erick

 On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  Hello,
 
  I've read several post on this issue, but can't find a real solution to
 my
  multi-words synonyms matching problem.
 
  I have in my synonyms.txt an entry like
 
  mairie, hotel de ville
 
  and my index time analyzer is configured as followed for synonyms.
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true/
 
  The problem I have is that now mairie matches with hotel and I would
  only want mairie to match with hotel de ville and mairie.
 
  When I look into the analyzer, I see that mairie is mapped into
 hotel,
  and words de ville are added in second and third position. To change
  that, I tried to do
 
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true
  tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in one post)
 
  and I can see now in the analyzer that mairie is mapped to hotel de
  ville, but now when I have query hotel de ville, it doesn't match at
 all
  with mairie.
 
  Anyone has a clue of what I'm doing wrong?
 
  I'm using Solr 3.4.
 
  Thanks,
  Elisabeth

Multi-words synonyms matching

2012-04-10 Thread elisabeth benoit

Hello,

I've read several post on this issue, but can't find a real solution to my
multi-words synonyms matching problem.

I have in my synonyms.txt an entry like

mairie, hotel de ville

and my index time analyzer is configured as followed for synonyms.

filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true/

The problem I have is that now mairie matches with hotel and I would
only want mairie to match with hotel de ville and mairie.

When I look into the analyzer, I see that mairie is mapped into hotel,
and words de ville are added in second and third position. To change
that, I tried to do

filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true
tokenizerFactory=solr.KeywordTokenizerFactory/ (as I read in one post)

and I can see now in the analyzer that mairie is mapped to hotel de
ville, but now when I have query hotel de ville, it doesn't match at all
with mairie.

Anyone has a clue of what I'm doing wrong?

I'm using Solr 3.4.

Thanks,
Elisabeth

disadvantage one field in a catchall field

2012-03-29 Thread elisabeth benoit

Hi all,

I'm using solr 3.4 with a catchall field and an edismaw request handler.
I'd like to score higher answers matching with words not contained in one
of the fields copied into my catchall field.

So my catchallfield is called catchall. It contains, let's say, fields
NAME, CATEGORY, TOWN, WAY and DESCRIPTION.

For one query, I would like to have answers matching NAME, CATEGORY, TOWN
and WAY scored higher, but I still want to search in DESCRIPTION.

I tried

qf=catchall DESCRIPTION^0.001,

but this doesn't seem to change the scoring. When I set debutQuery=on,
parsedquery_toString looks like

(text:paus | DESCRIPTION:pause^0.001) (this seems like an OR to me)

but I see no trace of DESCRIPTION in explain

One solution I guess would be to keep DESCRIPTION in a separate filed, and
do not include it in my catchall field. But I wonder if there is a solution
with the catchall field???

Thanks for your help,
Elisabeth

Re: catchall field minus one field

2012-01-12 Thread elisabeth benoit

thanks a lot for your advice, I'll try that.

Best regards,
Elisabeth

2012/1/11 Erick Erickson erickerick...@gmail.com

 Hmmm, Once the data is included in the catch-all, it's indistinguishable
 from
 all the rest of the data, so I don't see how you could do this. A clause
 like:
 -excludeField:[* TO *] would exclude all documents that had any data in
 the field so that's probably not what you want.

 Could you approach it the other way? Do NOT put the special field in
 the catch-all field in the first place, but massage the input to add
 a clause there? I.e. your usual case would have
 catchall:all your terms exclude_field:all your terms, but your
 special one would just be catchall:all your terms.

 You could set up request handlers to do this under the covers, so your
 queries would really be
 ...solr/usual?q=all your terms
 ...solr/special?q=all your terms
 and two different request handlers (edismax-style I'm thinking)
 would differ only by the qf field containing or not containing
 your special field.

 the other way, of course, would be to have a second catch-all
 field that didn't have your special field, then use one or the other
 depending, but as you say that would increase the size of your
 index...

 Best
 Erick

 On Wed, Jan 11, 2012 at 9:47 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  Hello,
 
  I have a catchall field, and I need to do some request in all fields of
  that catchall field, minus one. To avoid duplicating my index, I'd like
 to
  know if there is a way to use my catch field while excluding that one
 field.
 
  Thanks,
  Elisabeth

catchall field minus one field

2012-01-11 Thread elisabeth benoit

Hello,

I have a catchall field, and I need to do some request in all fields of
that catchall field, minus one. To avoid duplicating my index, I'd like to
know if there is a way to use my catch field while excluding that one field.

Thanks,
Elisabeth

Re: Solr 3.4 problem with words separated by coma without space

2011-12-12 Thread elisabeth benoit

Thanks for the answer.

yes in fact when I look at debugQuery output, I notice that name and number
are never treated as single entries.

I have

(((text:name text:number)) (text:ru) (text:tain) (text:paris)))

so name and number are in same parenthesis, but not exactlly treated as a
phrase, as far as I know, since a phrase would be more like text:name
number.

could you tell me what is the difference between (text:name text:number)
and (text:name number)?

I'll check autoGeneratePhraseQueries.

Best regards,
Elisabeth




2011/12/8 Chris Hostetter hossman_luc...@fucit.org


 : If I check in the solr.admin.analyzer, I get the same analysis for the
 two
 : different requests. But it seems, if fact, that the lacking space after
 : coma prevents name and number from matching.

 query analysis is only part of hte picture ... Did you look at the
 debuqQuery output? ...  i believe you are seeing the effects of the
 QueryParser analyzing name, distinctly from number in one case, vs
 analyzing the entire string name,number in the second case, an treating
 the later as a phrase query (because one input clause produces multiple
 tokens)

 there is a recently added autoGeneratePhraseQueries option that affects
 this.


 -Hoss

Solr 3.4 problem with words separated by coma without space

2011-12-08 Thread elisabeth benoit

Hello,

I'm using Solr 3.4, and I'm having a problem with a request returning
different results if I have or not a space after a coma.

The request name, number rue taine paris returns results with 4 words out
of 5 matching (name, number, rue, paris)

The request name,number rue taine paris (no space between coma and
number) returns no results, unless I set mm=3, and then matching words
are rue, taine, paris.

If I check in the solr.admin.analyzer, I get the same analysis for the two
different requests. But it seems, if fact, that the lacking space after
coma prevents name and number from matching.


My field type is


  analyzer type=query
!-- découpage standard --
tokenizer class=solr.StandardTokenizerFactory/
!-- normalisation des accents, cédilles, e dans l'o,... --
charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
filter class=solr.ASCIIFoldingFilterFactory/
!-- suppression des . (I.B.M. = IBM) --
filter class=solr.StandardFilterFactory/
!-- passage en minuscules --
filter class=solr.LowerCaseFilterFactory/
!-- suppression de la ponctuation --
filter class=solr.PatternReplaceFilterFactory
pattern=^(\p{Punct}*)(.*?)(\p{Punct}*)$ replacement=$2/
!-- suppression des tokens vides et des mots démesurés --
filter class=solr.LengthFilterFactory min=1 max=100 /
!-- découpage des mots composés --
filter class=solr.WordDelimiterFilterFactory
splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1
generateWordParts=1

generateNumberParts=1 catenateWords=0 catenateNumbers=1
catenateAll=0 preserveOriginal=1/
!-- suppression des élisions (l', qu',...) --
filter class=solr.ElisionFilterFactory
articles=elisionwords.txt/
!-- suppression des mots insignifiants --
filter class=solr.StopFilterFactory ignoreCase=1
words=stopwords.txt enablePositionIncrements=true/
!-- lemmatisation (pluriels,...) --
filter class=solr.SnowballPorterFilterFactory language=French
protected=protwords.txt/
!-- suppression des doublons éventuels --
filter class=solr.RemoveDuplicatesTokenFilterFactory/
  /analyzer

Anyone has a clue?

Thanks,
Elisabeth

Re: Solr 3.4 problem with words separated by coma without space

2011-12-08 Thread elisabeth benoit

same problem with Solr 4.0

2011/12/8 elisabeth benoit elisaelisael...@gmail.com



 Hello,

 I'm using Solr 3.4, and I'm having a problem with a request returning
 different results if I have or not a space after a coma.

 The request name, number rue taine paris returns results with 4 words
 out of 5 matching (name, number, rue, paris)

 The request name,number rue taine paris (no space between coma and
 number) returns no results, unless I set mm=3, and then matching words
 are rue, taine, paris.

 If I check in the solr.admin.analyzer, I get the same analysis for the two
 different requests. But it seems, if fact, that the lacking space after
 coma prevents name and number from matching.


 My field type is


   analyzer type=query
 !-- découpage standard --
 tokenizer class=solr.StandardTokenizerFactory/
 !-- normalisation des accents, cédilles, e dans l'o,... --
 charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
 filter class=solr.ASCIIFoldingFilterFactory/
 !-- suppression des . (I.B.M. = IBM) --
 filter class=solr.StandardFilterFactory/
 !-- passage en minuscules --
 filter class=solr.LowerCaseFilterFactory/
 !-- suppression de la ponctuation --
 filter class=solr.PatternReplaceFilterFactory
 pattern=^(\p{Punct}*)(.*?)(\p{Punct}*)$ replacement=$2/
 !-- suppression des tokens vides et des mots démesurés --
 filter class=solr.LengthFilterFactory min=1 max=100 /
 !-- découpage des mots composés --
 filter class=solr.WordDelimiterFilterFactory
 splitOnCaseChange=1 splitOnNumerics=1 stemEnglishPossessive=1
 generateWordParts=1

 generateNumberParts=1 catenateWords=0 catenateNumbers=1
 catenateAll=0 preserveOriginal=1/
 !-- suppression des élisions (l', qu',...) --
 filter class=solr.ElisionFilterFactory
 articles=elisionwords.txt/
 !-- suppression des mots insignifiants --
 filter class=solr.StopFilterFactory ignoreCase=1
 words=stopwords.txt enablePositionIncrements=true/
 !-- lemmatisation (pluriels,...) --
 filter class=solr.SnowballPorterFilterFactory language=French
 protected=protwords.txt/
 !-- suppression des doublons éventuels --
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
   /analyzer

 Anyone has a clue?

 Thanks,
 Elisabeth

Re: Solr cache size information

2011-12-04 Thread elisabeth benoit

Thanks a lot for these answers!

Elisabeth

2011/12/4 Erick Erickson erickerick...@gmail.com

 See below:

 On Thu, Dec 1, 2011 at 10:57 AM, elisabeth benoit
 elisaelisael...@gmail.com wrote:
  Hello,
 
  If anybody can help, I'd like to confirm a few things about Solr's caches
  configuration.
 
  If I want to calculate cache size in memory relativly to cache size in
  solrconfig.xml
 
  For Document cache
 
  size in memory = size in solrconfig.xml * average size of all fields
  defined in fl parameter   ???

 pretty much.

 
  For Filter cache
 
  size in memory = size in solrconfig.xml * WHAT (the size of an id) ??? (I
  don't use facet.enum method)
 

 It Depends(tm). Solr tries to do the best thing here, depending upon
 how many docs match the filter query. One method puts in a bitset for
 each
 entry, which is (maxDocs/8) bytes. maxDocs is reported on the admin/stats
 page.

 If the filter cache only hits a few documents, the size is smaller than
 that.

 You can think of this cache as a map where the key is the
 filter query (which is how they're re-used and how autowarm
 works) and the value for each key is the bitset or list. The
 size of the map is bounded by the size in solrconfig.xml.

  For Query result cache
 
  size in memory = size in solrconfig.xml * the size of an id ???
 
 Pretty much. This is the maximum size, but each entry is
 the query plus a list of IDs that's up to queryResultWindowSize
 long. This cache is, by and large, the least of your worries.


 
  I would also like to know relation between solr's caches sizes and JVM
 max
  size?

 Don't quite know what you're asking for here. There's nothing automatic
 that's sensitive to whether the JVM memory limits are about to be exceeded.
 If the caches get too big, OOMs happen.

 
  If anyone has an answer or a link for further reading to suggest, it
 would
  be greatly appreciated.
 
 There's some information here: http://wiki.apache.org/solr/SolrCaching,
 but
 it often comes down to try your app and monitor

 Here's a work-in-progress that Grant is working on, be aware that it's
 for trunk, not 3x.
 http://java.dzone.com/news/estimating-memory-and-storage


 Best
 Erick

  Thanks,
  Elisabeth

Solr cache size information

2011-12-01 Thread elisabeth benoit

Hello,

If anybody can help, I'd like to confirm a few things about Solr's caches
configuration.

If I want to calculate cache size in memory relativly to cache size in
solrconfig.xml

For Document cache

size in memory = size in solrconfig.xml * average size of all fields
defined in fl parameter   ???

For Filter cache

size in memory = size in solrconfig.xml * WHAT (the size of an id) ??? (I
don't use facet.enum method)

For Query result cache

size in memory = size in solrconfig.xml * the size of an id ???


I would also like to know relation between solr's caches sizes and JVM max
size?

If anyone has an answer or a link for further reading to suggest, it would
be greatly appreciated.

Thanks,
Elisabeth

1 2 >

1 - 100 of 128 matches

Mail list logo