Re: FilterCache size should reduce as index grows?

2017-10-05 Thread S G
So for large indexes, there is a chance that filterCache of 128 can cause
bad GC.
And for smaller indexes, it would really not matter that much because well,
the index size is small and probably whole of it is in OS-cache anyways.
So perhaps a default of 64 would be a much saner choice to get the best of
both the worlds?

On Thu, Oct 5, 2017 at 7:23 AM, Yonik Seeley  wrote:

> On Thu, Oct 5, 2017 at 3:20 AM, Toke Eskildsen  wrote:
> > On Wed, 2017-10-04 at 21:42 -0700, S G wrote:
> >
> > It seems that the memory limit option maxSizeMB was added in Solr 5.2:
> > https://issues.apache.org/jira/browse/SOLR-7372
> > I am not sure if it works with all caches in Solr, but in my world it
> > is way better to define the caches by memory instead of count.
>
> Yes, that will work with the filterCache, but one needs to change the
> cache type as well (maxSizeMB is only an option on LRUCache, and
> filterCache uses FastLRUCache in the default solrconfig.xml)
>
> -Yonik
>


solr and machine learning - recommendations?

2017-10-05 Thread Phil Scadden
Now that I am got a big hunk of documents indexed with Solr, I am looking to 
see whether I can try some machine learning tools to try and extract 
bibliographic references out of the documents. Anyone got some recommendations 
about which kits might be good to play with for something like this?
Notice: This email and any attachments are confidential and may not be used, 
published or redistributed without the prior written consent of the Institute 
of Geological and Nuclear Sciences Limited (GNS Science). If received in error 
please destroy and immediately notify GNS Science. Do not copy or disclose the 
contents.


Re: mm is not working if you have same term multiple times in query

2017-10-05 Thread Chris Hostetter

: I'm using Solr 6.6.0 i have set mm as 100% but when i have the repeated
: search term then mm param is not honoured

: I have 2 docs in index
: Doc1-
: name=lock
: Doc 2-
: name=lock lock
: 
: Now when i'm quering the solr with query
: 
*http://localhost:8983/solr/test2/select?defType=dismax=name=on=100%25=lock%20lock=json

: then it is returning both results but it should return only Doc 2 as no of
: frequency is 2 in query while doc1 has frequency of 1 (lock term frequency).

There's a couple of misconceptions here...

first off: "mm" is a property of the "BooleanQuery" object that contains 
multiple SHOULD clauses -- it has nothign to do with the "frequency" of 
any clause/term -- if your BooleanQuery contains 2 SHOULD clauses, then 
the mm=2 will require that both clauses match.  If the 2 clauses are 
*identical* then BooleanQuery will actally optimize away one instance, and 
reduce the mm=1

second: even if BooleanQuery didn't have that optimization -- which was 
the case until ~6.x -- then your original query would *still* match Doc#1, 
because each clause (aka sub-query) would be evaluated independently.  the 
BooleanQuery would ask clause #1 "do you match doc#1?" and it would say 
"yes" -- then the BooleanQuery owuld ask clause #2 "do you match doc#1" 
and it would also say "yes" and so the BooleanQuery would say "i've 
reached the minimum number of SHOULD clauses i was configured to require 
for a match, so doc#1 is a match"


If you have a special case situation of wanting to require that term 
occurs at least X times -- the only way i can think of off the top of my 
head to do that would be using the termfreq() function.  

something like...

q={!frange l=}termfreq(text,'lock')

https://lucene.apache.org/solr/guide/function-queries.html#termfreq-function
https://lucene.apache.org/solr/guide/other-parsers.html#function-range-query-parser


But i caution that while this might work in the specific example you gave, 
it's not really a drop in replacement for how you _thought_ mm should 
work, so a lot of things you might be trying to do with dismax+mm aren't 
going to have any sort of corollary here.

In general i'm curious as to your broader picture goal, nad if there isn't 
some better solution...


https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341




-Hoss
http://www.lucidworks.com/


Re: Solr not preserving milliseconds precision for zero milliseconds

2017-10-05 Thread Chris Hostetter
: > "startTime":"2013-02-10T18:36:07.000Z"
...
: handler. It gets added successfully but when I retrieve this document back
: using "id" I get following.
...
: > "startTime":"2013-02-10T18:36:07Z",
...
: As you can see, the milliseconds precision in date field "startTime" is
: lost. Precision is preserved for non-zero milliseconds but it's being lost
: for zero values. The field type of "startTime" field is as follows.
...
: Does anyone know how I can preserve milliseconds even if its zero? Or is it
: not possible at all?

ms precision is being preserved -- but as you mentioned, the fractional 
seconds you indexed are "0" therefore they are not needed/preserved when 
writing the response to maintain ms precision.  

This is the correct formatting as specified in the specification for the 
time format that Solr follows...

https://lucene.apache.org/solr/guide/working-with-dates.html
https://www.w3.org/TR/xmlschema-2/#dateTime

>>> 3.2.7.2 Canonical representation
>>> ...
>>> The fractional second string, if present, must not end in '0';



-Hoss
http://www.lucidworks.com/


Re: Solr test runs: test skipping logic

2017-10-05 Thread Chris Hostetter

: I am seeing that in different test runs (e.g., by executing 'ant test' on
: the root folder in 'lucene-solr') a different subset of tests are skipped.
: Where can I find more about it? I am trying to create parity between test
: successes before and after my changes and this is causing  confusion.

The test randomization logic creates an arbitrary "master seed" that is 
assigned by ant.  This master seed is 
then used to generate some randomized default properties for the the 
forked JVMs (default timezones, default Locale, default charset, etc...)

Each test class run in a forked JVM then gets it's own Random seed 
(generated fro mthe master seed as well) which the solr test-framework 
uses to randomize some more things (that are specific to the solr 
test-framework.

In some cases, tests have @Assume of assumeThat(...) logic in if we know 
that certain tests are completely incompatible with certain randomized 
aspects of the environemnt -- for example: some tests won't bothe to run 
if the randomized Locale uses "tr" because of external third-party 
dependencies that break with this Locale (due to upercase/lowercase 
behavior).

This is most likeley the reason you are seeing a diff "set" of tests run 
on diff times.  But if you want true parity between test runs, use the 
same master seed -- which is printed at the begining of every "ant 
test" run, as well as any time a test fails, and can be overridden on the 
ant command line for future runs.

run "ant test-help" for the specifics.


-Hoss
http://www.lucidworks.com/


Solr not preserving milliseconds precision for zero milliseconds

2017-10-05 Thread Pratik Patel
Hello Everyone,

Say I have a document like one below.


> {
> "id":"test",
> "startTime":"2013-02-10T18:36:07.000Z"
> }


I add this document to solr index using the admin UI and "update" request
handler. It gets added successfully but when I retrieve this document back
using "id" I get following.


 {
> "id":"test",
> "startTime":"2013-02-10T18:36:07Z",
> "_version_":1580456021738913792}]
>   }


As you can see, the milliseconds precision in date field "startTime" is
lost. Precision is preserved for non-zero milliseconds but it's being lost
for zero values. The field type of "startTime" field is as follows.

 docValues="true" precisionStep="0"/>


Does anyone know how I can preserve milliseconds even if its zero? Or is it
not possible at all?

Thanks,
Pratik


Re: Rescoring from 0 - full

2017-10-05 Thread Dariusz Wojtas
Hi,
Your answers have helped me a lot.
I've managed to use the LTRQParserPlugin and it does what I need. Full
control over scoring with it's re-ranking functionality.
I define my custom features and may pass custom params to them using the
"efi.*" syntax.
Is there something similar to define weights in the model that uses these
features?
Can I have single model, byt pass feature weights in each request?
How do I pass my custom weights with each request in the example below?

{
  "store" : "myFeaturesStore",
  "name" : "myModel",
  "class" : "org.apache.solr.ltr.model.LinearModel",
  "features" : [
{ "name" : "scorePersonalId" },
{ "name" : "originalScore" }
  ],
  "params" : {
"weights" : {
  "scorePersonalId" : 0.9,
  "originalScore" : 0.1
}
  }
}

I am using SOLR 6.6, soon switching to 7.0

Best regards,
Dariusz Wojtas


On Thu, Sep 21, 2017 at 5:18 PM, Erick Erickson 
wrote:

> Sure, you can take full control of the scoring, just write a custom
> similarity.
>
> What's not at all clear is why you want to. RerankQParserPlugin will
> re-rank the to N documents by pushing them through a different query,
> can you make that work?
>
> Best,
> Erick
>
>
>
> On Thu, Sep 21, 2017 at 4:20 AM, Diego Ceccarelli (BLOOMBERG/ LONDON)
>  wrote:
> > Hi Dariusz,
> > If you use *:* you'll rerank only the top N random documents, as Emir
> said, that will not produce interesting results probably.
> > If you want to replace the original score, you can take a look at the
> learning to rank module [1], that would allow you to reassign a
> > new score to the top N documents returned by your query and then reorder
> them based on that (ignoring the original score, if you want).
> >
> > Cheers,
> > Diego
> >
> > [1] https://cwiki.apache.org/confluence/display/solr/Learning+To+Rank
> >
> > From: solr-user@lucene.apache.org At: 09/21/17 08:49:13
> > To: solr-user@lucene.apache.org
> > Subject: Re: Rescoring from 0 - full
> >
> > Hi Dariusz,
> > You could use fq for filtering (can disable caching to avoid polluting
> filter cache) and q=*:*. That way you’ll get score=1 for all doc and can
> rerank. The issue with this approach is that you rerank top N and without
> score they wouldn’t be ordered so it is no-go.
> > What you could do (did not try) in rescoring divide by score (not sure
> if can access calculated but could calculate) to eliminate score.
> >
> > HTH,
> > Emir
> >
> >> On 20 Sep 2017, at 21:38, Dariusz Wojtas  wrote:
> >>
> >> Hi,
> >> When I use boosting fuctionality, it is always about adding or
> >> multiplicating the score calculated in the 'q' param.
> >> I mau use function queries inside 'q', but this may hit performance on
> >> calling multiple nested functions.
> >> I thaught that 'rerank' could help, but it is still about changing the
> >> original score, not full calculation.
> >>
> >> How can take full control on score in rerank? Is it possible?
> >>
> >> Best regards,
> >> Dariusz Wojtas
> >
> >
>


Re: Recommendations for number of open files?

2017-10-05 Thread Webster Homer
It seems that there was a networking error just prior to the creation of
the 0 length files:
The files from Sep 27 are all written at 17:56.
There was minor packet loss (1 out of 10 packets per 60 second interval)
just prior to that time.

On Thu, Oct 5, 2017 at 3:11 PM, Webster Homer 
wrote:

> buffering is disabled. Indeed we disable it everywhere as all it seems to
> do is leave tlogs around forever.
>
> Autocommit is set to 60 seconds.
>
> The source cdcr request handler looks like this. The first target is the
> problematic one
>
> {"requestHandler":{"/cdcr":{
>   "name":"/cdcr",
>   "class":"solr.CdcrRequestHandler",
>   "replica":[
> {
>   
> "zkHost":"ae1a-ecomqa-mzk01:2181,ae1a-ecomqa-mzk02:2181,ae1a-ecomqa-mzk03:2181/solr",
>   "source":"sial-content-citations",
>   "target":"sial-content-citations"},
> {
>   
> "zkHost":"uc1b-ecomqa-mzk01:2181,uc1b-ecomqa-mzk02:2181,uc1b-ecomqa-mzk03:2181/solr",
>   "source":"sial-content-citations",
>   "target":"sial-content-citations"}],
>   "replicator":{
> "threadPoolSize":2,
> "schedule":1000,
> "batchSize":250},
>   "updateLogSynchronizer":{"schedule":6
>
>
>
> The target looks like:
>
> "requestHandler":{"/cdcr":{
>   "name":"/cdcr",
>   "class":"solr.CdcrRequestHandler",
>   "buffer":{"defaultState":"disabled"}}
>
>
> These are all in our QA environment
>
>
> On Thu, Oct 5, 2017 at 2:43 PM, Erick Erickson 
> wrote:
>
>> OK, never mind about the file handle limits, let's deal with the
>> tlogs. Although unlimited is a good thing.
>>
>> Do you have buffering disabled on the target cluster?
>>
>> Best
>> Erick
>>
>> On Thu, Oct 5, 2017 at 11:19 AM, Webster Homer 
>> wrote:
>> > I wouldn't call it massive. The index is ~9 million documents. So not
>> too
>> > big, the documents themselves are pretty small
>> >
>> > On Thu, Oct 5, 2017 at 12:23 PM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> Well, Lucene keeps an open file handle for _every_ file in _every_
>> >> index directory. So, for instance, let's say a replica has 10
>> >> segments. Each segment is 10-15 individual files. So that's 100-150
>> >> file handles right there. And indexes can have many segments.
>> >>
>> >> Check to see if "cfs" extensions are in your indexing directory,
>> >> that's "compound file system" and if present will reduce the number of
>> >> file handles needed.
>> >>
>> >> A second thing you might be able to do is increase the maximum segment
>> >> size by setting maxMergedSegmentMB in your solrconfig file for
>> >> TieredMergePolicy, something like
>> >> 1
>> >> eventually that'll merge segments into fewer, but that'll take a while.
>> >>
>> >> As to your question, we usually recommend to set the file limit to
>> >> "unlimited". You do have to monitor it however, at some point there's
>> >> a lot of bookkeeping.
>> >>
>> >> one replica trying to open > 8,000 files seems very odd though. Is it
>> >> a massive index? The default max segment size is 5G, so you could have
>> >> a gazillion small segments in which case you might want to split that
>> >> shard up and move the sub-shards to some other machine.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer > >
>> >> wrote:
>> >> > We have begun to see errors around too many open files on one of our
>> >> > solrcloud nodes. One replica tries to open >8000 files. This replica
>> >> tries
>> >> > to startup and then fails the open files are exceeded upon startup
>> as it
>> >> > tries to recover.
>> >> >
>> >> >
>> >> > Our solrclouds have 12 distinct collections. I would think that the
>> >> number
>> >> > of open files would depend upon the number of collections as well as
>> >> > numbers of files per index etc...
>> >> >
>> >> > Our current setting is 8192 open files per process.
>> >> >
>> >> > What values are recommended? is there a normal number of open files?
>> >> >
>> >> > What would lead to there being lots of open files?
>> >> >
>> >> > --
>> >> >
>> >> >
>> >> > This message and any attachment are confidential and may be
>> privileged or
>> >> > otherwise protected from disclosure. If you are not the intended
>> >> recipient,
>> >> > you must not copy this message or attachment or disclose the
>> contents to
>> >> > any other person. If you have received this transmission in error,
>> please
>> >> > notify the sender immediately and delete the message and any
>> attachment
>> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> >> > subsidiaries do not accept liability for any omissions or errors in
>> this
>> >> > message which may arise as a result of E-Mail-transmission or for
>> damages
>> >> > resulting from any unauthorized changes of the content of this
>> message
>> >> and
>> >> > any attachment thereto. 

Re: Recommendations for number of open files?

2017-10-05 Thread Webster Homer
buffering is disabled. Indeed we disable it everywhere as all it seems to
do is leave tlogs around forever.

Autocommit is set to 60 seconds.

The source cdcr request handler looks like this. The first target is the
problematic one

{"requestHandler":{"/cdcr":{
  "name":"/cdcr",
  "class":"solr.CdcrRequestHandler",
  "replica":[
{
  
"zkHost":"ae1a-ecomqa-mzk01:2181,ae1a-ecomqa-mzk02:2181,ae1a-ecomqa-mzk03:2181/solr",
  "source":"sial-content-citations",
  "target":"sial-content-citations"},
{
  
"zkHost":"uc1b-ecomqa-mzk01:2181,uc1b-ecomqa-mzk02:2181,uc1b-ecomqa-mzk03:2181/solr",
  "source":"sial-content-citations",
  "target":"sial-content-citations"}],
  "replicator":{
"threadPoolSize":2,
"schedule":1000,
"batchSize":250},
  "updateLogSynchronizer":{"schedule":6



The target looks like:

"requestHandler":{"/cdcr":{
  "name":"/cdcr",
  "class":"solr.CdcrRequestHandler",
  "buffer":{"defaultState":"disabled"}}


These are all in our QA environment


On Thu, Oct 5, 2017 at 2:43 PM, Erick Erickson 
wrote:

> OK, never mind about the file handle limits, let's deal with the
> tlogs. Although unlimited is a good thing.
>
> Do you have buffering disabled on the target cluster?
>
> Best
> Erick
>
> On Thu, Oct 5, 2017 at 11:19 AM, Webster Homer 
> wrote:
> > I wouldn't call it massive. The index is ~9 million documents. So not too
> > big, the documents themselves are pretty small
> >
> > On Thu, Oct 5, 2017 at 12:23 PM, Erick Erickson  >
> > wrote:
> >
> >> Well, Lucene keeps an open file handle for _every_ file in _every_
> >> index directory. So, for instance, let's say a replica has 10
> >> segments. Each segment is 10-15 individual files. So that's 100-150
> >> file handles right there. And indexes can have many segments.
> >>
> >> Check to see if "cfs" extensions are in your indexing directory,
> >> that's "compound file system" and if present will reduce the number of
> >> file handles needed.
> >>
> >> A second thing you might be able to do is increase the maximum segment
> >> size by setting maxMergedSegmentMB in your solrconfig file for
> >> TieredMergePolicy, something like
> >> 1
> >> eventually that'll merge segments into fewer, but that'll take a while.
> >>
> >> As to your question, we usually recommend to set the file limit to
> >> "unlimited". You do have to monitor it however, at some point there's
> >> a lot of bookkeeping.
> >>
> >> one replica trying to open > 8,000 files seems very odd though. Is it
> >> a massive index? The default max segment size is 5G, so you could have
> >> a gazillion small segments in which case you might want to split that
> >> shard up and move the sub-shards to some other machine.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer 
> >> wrote:
> >> > We have begun to see errors around too many open files on one of our
> >> > solrcloud nodes. One replica tries to open >8000 files. This replica
> >> tries
> >> > to startup and then fails the open files are exceeded upon startup as
> it
> >> > tries to recover.
> >> >
> >> >
> >> > Our solrclouds have 12 distinct collections. I would think that the
> >> number
> >> > of open files would depend upon the number of collections as well as
> >> > numbers of files per index etc...
> >> >
> >> > Our current setting is 8192 open files per process.
> >> >
> >> > What values are recommended? is there a normal number of open files?
> >> >
> >> > What would lead to there being lots of open files?
> >> >
> >> > --
> >> >
> >> >
> >> > This message and any attachment are confidential and may be
> privileged or
> >> > otherwise protected from disclosure. If you are not the intended
> >> recipient,
> >> > you must not copy this message or attachment or disclose the contents
> to
> >> > any other person. If you have received this transmission in error,
> please
> >> > notify the sender immediately and delete the message and any
> attachment
> >> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not accept liability for any omissions or errors in
> this
> >> > message which may arise as a result of E-Mail-transmission or for
> damages
> >> > resulting from any unauthorized changes of the content of this message
> >> and
> >> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> >> > subsidiaries do not guarantee that this message is free of viruses and
> >> does
> >> > not accept liability for any damages caused by any virus transmitted
> >> > therewith.
> >> >
> >> > Click http://www.emdgroup.com/disclaimer to access the German,
> French,
> >> > Spanish and Portuguese versions of this disclaimer.
> >>
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise 

Re: Recommendations for number of open files?

2017-10-05 Thread Erick Erickson
OK, never mind about the file handle limits, let's deal with the
tlogs. Although unlimited is a good thing.

Do you have buffering disabled on the target cluster?

Best
Erick

On Thu, Oct 5, 2017 at 11:19 AM, Webster Homer  wrote:
> I wouldn't call it massive. The index is ~9 million documents. So not too
> big, the documents themselves are pretty small
>
> On Thu, Oct 5, 2017 at 12:23 PM, Erick Erickson 
> wrote:
>
>> Well, Lucene keeps an open file handle for _every_ file in _every_
>> index directory. So, for instance, let's say a replica has 10
>> segments. Each segment is 10-15 individual files. So that's 100-150
>> file handles right there. And indexes can have many segments.
>>
>> Check to see if "cfs" extensions are in your indexing directory,
>> that's "compound file system" and if present will reduce the number of
>> file handles needed.
>>
>> A second thing you might be able to do is increase the maximum segment
>> size by setting maxMergedSegmentMB in your solrconfig file for
>> TieredMergePolicy, something like
>> 1
>> eventually that'll merge segments into fewer, but that'll take a while.
>>
>> As to your question, we usually recommend to set the file limit to
>> "unlimited". You do have to monitor it however, at some point there's
>> a lot of bookkeeping.
>>
>> one replica trying to open > 8,000 files seems very odd though. Is it
>> a massive index? The default max segment size is 5G, so you could have
>> a gazillion small segments in which case you might want to split that
>> shard up and move the sub-shards to some other machine.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer 
>> wrote:
>> > We have begun to see errors around too many open files on one of our
>> > solrcloud nodes. One replica tries to open >8000 files. This replica
>> tries
>> > to startup and then fails the open files are exceeded upon startup as it
>> > tries to recover.
>> >
>> >
>> > Our solrclouds have 12 distinct collections. I would think that the
>> number
>> > of open files would depend upon the number of collections as well as
>> > numbers of files per index etc...
>> >
>> > Our current setting is 8192 open files per process.
>> >
>> > What values are recommended? is there a normal number of open files?
>> >
>> > What would lead to there being lots of open files?
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error, please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.


Re: Recommendations for number of open files?

2017-10-05 Thread Webster Homer
Interestingly many of these tlog files (5428 out of 8007) are have 0
length!? What would cause that? As I stated this is a cdcr target
collection.

On Thu, Oct 5, 2017 at 1:19 PM, Webster Homer 
wrote:

> I wouldn't call it massive. The index is ~9 million documents. So not too
> big, the documents themselves are pretty small
>
> On Thu, Oct 5, 2017 at 12:23 PM, Erick Erickson 
> wrote:
>
>> Well, Lucene keeps an open file handle for _every_ file in _every_
>> index directory. So, for instance, let's say a replica has 10
>> segments. Each segment is 10-15 individual files. So that's 100-150
>> file handles right there. And indexes can have many segments.
>>
>> Check to see if "cfs" extensions are in your indexing directory,
>> that's "compound file system" and if present will reduce the number of
>> file handles needed.
>>
>> A second thing you might be able to do is increase the maximum segment
>> size by setting maxMergedSegmentMB in your solrconfig file for
>> TieredMergePolicy, something like
>> 1
>> eventually that'll merge segments into fewer, but that'll take a while.
>>
>> As to your question, we usually recommend to set the file limit to
>> "unlimited". You do have to monitor it however, at some point there's
>> a lot of bookkeeping.
>>
>> one replica trying to open > 8,000 files seems very odd though. Is it
>> a massive index? The default max segment size is 5G, so you could have
>> a gazillion small segments in which case you might want to split that
>> shard up and move the sub-shards to some other machine.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer 
>> wrote:
>> > We have begun to see errors around too many open files on one of our
>> > solrcloud nodes. One replica tries to open >8000 files. This replica
>> tries
>> > to startup and then fails the open files are exceeded upon startup as it
>> > tries to recover.
>> >
>> >
>> > Our solrclouds have 12 distinct collections. I would think that the
>> number
>> > of open files would depend upon the number of collections as well as
>> > numbers of files per index etc...
>> >
>> > Our current setting is 8192 open files per process.
>> >
>> > What values are recommended? is there a normal number of open files?
>> >
>> > What would lead to there being lots of open files?
>> >
>> > --
>> >
>> >
>> > This message and any attachment are confidential and may be privileged
>> or
>> > otherwise protected from disclosure. If you are not the intended
>> recipient,
>> > you must not copy this message or attachment or disclose the contents to
>> > any other person. If you have received this transmission in error,
>> please
>> > notify the sender immediately and delete the message and any attachment
>> > from your system. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not accept liability for any omissions or errors in this
>> > message which may arise as a result of E-Mail-transmission or for
>> damages
>> > resulting from any unauthorized changes of the content of this message
>> and
>> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>> > subsidiaries do not guarantee that this message is free of viruses and
>> does
>> > not accept liability for any damages caused by any virus transmitted
>> > therewith.
>> >
>> > Click http://www.emdgroup.com/disclaimer to access the German, French,
>> > Spanish and Portuguese versions of this disclaimer.
>>
>
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
After some more digging, I'm wrong even at the Lucene level.

When I use the CustomAnalyzer and make my UC vowel mock filter MultitermAware, 
I get this with Lucene in trunk:

"the* quick~" name:thE* name:qUIck~2 name:thE name:qUIck

So, there's room for improvement with phrases, but the regular multiterms 
should be ok.

Still no answer for you...

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :

> There's every chance that I'm missing something at the Solr level, but 
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still 
> not applying analysis to multiterms.
>
> When I call this on 7.0.0:
>QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
> return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the 
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck



Re: Question regarding Upgrading to SolrCloud

2017-10-05 Thread Cassandra Targett
The 7.0 Ref Guide was released Monday.

An overview of the new replica types is available online here:
https://lucene.apache.org/solr/guide/7_0/shards-and-indexing-data-in-solrcloud.html#types-of-replicas.
The replica type is specified when you either create the collection or
add a replica.

On Thu, Oct 5, 2017 at 9:01 AM, Erick Erickson  wrote:
> Gopesh:
>
> There is brand new functionality in Solr 7, see: SOLR-10233, the
> "PULL" replica type which is a hybrid SolrCloud replica that uses
> master/slave type replication. You should find this in the reference
> guide, the 7.0 ref guide should be published soon. Meanwhile, that
> JIRA will let you know. Also see .../solr/CHANGES.txt. As Emir says,
> though, it would require ZooKeeper.
>
> Really, though, once you move to SolrCloud (if you do) I'd stick with
> the standard NRT replica type unless I had reason to use one of the
> other two, (TLOG and PULL) as they're for pretty special situations.
>
> All that said, if you're happy with master/slave there's no compelling
> reason to go to SolrCloud, especially for smaller installations.
>
> Best,
> Erick
>
> On Wed, Oct 4, 2017 at 11:46 PM, Gopesh Sharma
>  wrote:
>> Hello Guys,
>>
>> As of now we are running Solr 3.4 with Master Slave Configuration. We are 
>> planning to upgrade it to the lastest version (6.6 or 7). Questions I have 
>> before upgrading
>>
>>
>>   1.  Since we do not have a lot of data, is it required to move to 
>> SolrCloud or continue using it Master Slave
>>   2.  Is the support for Master Slave will be there in the future release or 
>> do you plan to remove it.
>>   3.  Can we configure master-slave replication in Solr Cloud, if yes then 
>> do we need zookeeper as well.
>>
>> Thanks,
>> Gopesh Sharma


Re: Recommendations for number of open files?

2017-10-05 Thread Webster Homer
I wouldn't call it massive. The index is ~9 million documents. So not too
big, the documents themselves are pretty small

On Thu, Oct 5, 2017 at 12:23 PM, Erick Erickson 
wrote:

> Well, Lucene keeps an open file handle for _every_ file in _every_
> index directory. So, for instance, let's say a replica has 10
> segments. Each segment is 10-15 individual files. So that's 100-150
> file handles right there. And indexes can have many segments.
>
> Check to see if "cfs" extensions are in your indexing directory,
> that's "compound file system" and if present will reduce the number of
> file handles needed.
>
> A second thing you might be able to do is increase the maximum segment
> size by setting maxMergedSegmentMB in your solrconfig file for
> TieredMergePolicy, something like
> 1
> eventually that'll merge segments into fewer, but that'll take a while.
>
> As to your question, we usually recommend to set the file limit to
> "unlimited". You do have to monitor it however, at some point there's
> a lot of bookkeeping.
>
> one replica trying to open > 8,000 files seems very odd though. Is it
> a massive index? The default max segment size is 5G, so you could have
> a gazillion small segments in which case you might want to split that
> shard up and move the sub-shards to some other machine.
>
> Best,
> Erick
>
> On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer 
> wrote:
> > We have begun to see errors around too many open files on one of our
> > solrcloud nodes. One replica tries to open >8000 files. This replica
> tries
> > to startup and then fails the open files are exceeded upon startup as it
> > tries to recover.
> >
> >
> > Our solrclouds have 12 distinct collections. I would think that the
> number
> > of open files would depend upon the number of collections as well as
> > numbers of files per index etc...
> >
> > Our current setting is 8192 open files per process.
> >
> > What values are recommended? is there a normal number of open files?
> >
> > What would lead to there being lots of open files?
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Recommendations for number of open files?

2017-10-05 Thread Webster Homer
The issue is on one of our QA collections which means I don't have access
to the systems to see. I have to go through the admins

it does have ".cfs" files in the index.

However, it turns out that the replica in question has 8007 tlog files.
This solrcloud is a target cloud for cdcr.
The replica dies during recovery, I guess it tries to read all those files
to apply them?

How does a cdcr target know when it can delete a tlog? The source
collection has 83 tlog files.

Just to be clear, you suggest a per process open file limit of unlimited?

Thanks

On Thu, Oct 5, 2017 at 12:23 PM, Erick Erickson 
wrote:

> Well, Lucene keeps an open file handle for _every_ file in _every_
> index directory. So, for instance, let's say a replica has 10
> segments. Each segment is 10-15 individual files. So that's 100-150
> file handles right there. And indexes can have many segments.
>
> Check to see if "cfs" extensions are in your indexing directory,
> that's "compound file system" and if present will reduce the number of
> file handles needed.
>
> A second thing you might be able to do is increase the maximum segment
> size by setting maxMergedSegmentMB in your solrconfig file for
> TieredMergePolicy, something like
> 1
> eventually that'll merge segments into fewer, but that'll take a while.
>
> As to your question, we usually recommend to set the file limit to
> "unlimited". You do have to monitor it however, at some point there's
> a lot of bookkeeping.
>
> one replica trying to open > 8,000 files seems very odd though. Is it
> a massive index? The default max segment size is 5G, so you could have
> a gazillion small segments in which case you might want to split that
> shard up and move the sub-shards to some other machine.
>
> Best,
> Erick
>
> On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer 
> wrote:
> > We have begun to see errors around too many open files on one of our
> > solrcloud nodes. One replica tries to open >8000 files. This replica
> tries
> > to startup and then fails the open files are exceeded upon startup as it
> > tries to recover.
> >
> >
> > Our solrclouds have 12 distinct collections. I would think that the
> number
> > of open files would depend upon the number of collections as well as
> > numbers of files per index etc...
> >
> > Our current setting is 8192 open files per process.
> >
> > What values are recommended? is there a normal number of open files?
> >
> > What would lead to there being lots of open files?
> >
> > --
> >
> >
> > This message and any attachment are confidential and may be privileged or
> > otherwise protected from disclosure. If you are not the intended
> recipient,
> > you must not copy this message or attachment or disclose the contents to
> > any other person. If you have received this transmission in error, please
> > notify the sender immediately and delete the message and any attachment
> > from your system. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not accept liability for any omissions or errors in this
> > message which may arise as a result of E-Mail-transmission or for damages
> > resulting from any unauthorized changes of the content of this message
> and
> > any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> > subsidiaries do not guarantee that this message is free of viruses and
> does
> > not accept liability for any damages caused by any virus transmitted
> > therewith.
> >
> > Click http://www.emdgroup.com/disclaimer to access the German, French,
> > Spanish and Portuguese versions of this disclaimer.
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Jenkins setup for continuous build

2017-10-05 Thread Chris Hostetter

: I have some custom code in solr (which is not of good quality for
: contributing back) so I need to setup my own continuous build solution. I
: tried jenkins and was hoping that ant build (ant clean compile) in Execute
: Shell textbox will work, but I am stuck at this ivy-fail error:
: 
: To work around it, I also added another step in the 'Execute Shell' (ant
: ivy-bootstrap), which succeeds but 'ant clean compile' still fails with the
: following error. I guess that I am not alone in doing this so there should
: be some standard work around for this.

The ivy bootstraping is really designed to to be for developers to setup 
their ~/.ant/lib directory -- IIRC most of the jenkins build servers out 
there don't use it as part of their job, they instead of install ivy once 
when setting up the jenkins server (in the home dir of the jenkins user)

I suspect the error you are running into may have to do with directory 
permissions of your jenkins server not letting the job write to the 
jenkins home dir?  or some other path/permissions incompatibility.

You could consider following the instruction in the ivy-fail warning to 
have ivy-bootstrap put the ivy jar files in a custom path inside hte 
workspace of your job, and then use "-lib" to point at that directory when 
running solr tests.

Alternatively, my preference for setting up jenkins jobs these days is to 
use docker, and let all the per-job activity (inlcuding the git co of 
lucene and the ivy bootstraping) happen inside the docker container.

For example: this is a set of scripts/configs i use for an "ondemand" 
jenkins job i have, that let's me checkout arbitrary branches/commits of 
lucene-solr, apply arbitrary patches, and the nrun arbitrary build 
commands (ie: ant test) using arbitrary JDK versions -- all configured at 
build time with build params...

https://github.com/hossman/solr-jenkins-docker-tester


: 
: ivy-fail:
:  [echo]
:  [echo]  This build requires Ivy and Ivy could not be found in
: your ant classpath.
:  [echo]
:  [echo]  (Due to classpath issues and the recursive nature of
: the Lucene/Solr
:  [echo]  build system, a local copy of Ivy can not be used an
: loaded dynamically
:  [echo]  by the build.xml)
:  [echo]
:  [echo]  You can either manually install a copy of Ivy 2.3.0
: in your ant classpath:
:  [echo]http://ant.apache.org/manual/install.html#optionalTasks
:  [echo]
:  [echo]  Or this build file can do it for you by running the
: Ivy Bootstrap target:
:  [echo]ant ivy-bootstrap
:  [echo]
:  [echo]  Either way you will only have to install Ivy one time.
:  [echo]
:  [echo]  'ant ivy-bootstrap' will install a copy of Ivy into
: your Ant User Library:
:  [echo]/home/jenkins/.ant/lib
:  [echo]
:  [echo]  If you would prefer, you can have it installed into
: an alternative
:  [echo]  directory using the
: "-Divy_install_path=/some/path/you/choose" option,
:  [echo]  but you will have to specify this path every time you
: build Lucene/Solr
:  [echo]  in the future...
:  [echo]ant ivy-bootstrap -Divy_install_path=/some/path/you/choose
:  [echo]...
:  [echo]ant -lib /some/path/you/choose clean compile
:  [echo]...
:  [echo]ant -lib /some/path/you/choose clean compile
:  [echo]
:  [echo]  If you have already run ivy-bootstrap, and still get
: this message, please
:  [echo]  try using the "--noconfig" option when running ant,
: or editing your global
:  [echo]  ant config to allow the user lib to be loaded.  See
: the wiki for more details:
:  [echo]
: http://wiki.apache.org/lucene-java/DeveloperTips#Problems_with_Ivy.3F
:  [echo]
: 

-Hoss
http://www.lucidworks.com/


RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
Prob the usual reasons...no one has submitted a patch yet, or could be a 
regression after LUCENE-7355.

See also:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201407.mbox/%3c1d06a081892adf4589bd83ee24b9dc3025971...@imcmbx02.mitre.org%3E

I'll take a look.


-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com] 
Sent: Thursday, October 5, 2017 8:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

Thanks Tim,
that might be what I'm experiencing. I'm actually quite certain of it :-)

Do you remember any reason that multi term analysis is not happening in 
ComplexPhraseQueryParser?

I'm on 6.6.1, so latest on the 6.x branch.

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :

> There's every chance that I'm missing something at the Solr level, but 
> it _looks_ at the Lucene level, like ComplexPhraseQueryParser is still 
> not applying analysis to multiterms.
>
> When I call this on 7.0.0:
>QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
> return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the 
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
> [1] https://github.com/tballison/lucene-addons/blob/master/
> lucene-5205/src/test/java/org/apache/lucene/queryparser/
> spans/TestAdvancedAnalyzers.java#L117
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Thursday, October 5, 2017 8:02 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Complexphrase treats wildcards differently than other 
> query parsers
>
> What version of Solr are you using?
>
> I thought this had been fixed fairly recently, but I can't quickly 
> find the JIRA.  Let me take a look.
>
> Best,
>
>  Tim
>
> This was one of my initial reasons for my SpanQueryParser 
> LUCENE-5205[1] and [2], which handles analysis of multiterms even in phrases.
>
> [1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
> [2] https://mvnrepository.com/artifact/org.tallison.lucene/
> lucene-5205/6.6-0.1
>
> -Original Message-
> From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
> Sent: Thursday, October 5, 2017 6:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Complexphrase treats wildcards differently than other 
> query parsers
>
> 2017-10-05 11:29 GMT+02:00 Emir Arnautović :
>
> > Hi Bjarke,
> > You are right - I jumped into wrong/old conclusion as the simplest 
> > answer to your question.
>
>
>  No problem :-)
>
> I guess looking at the code could give you an answer.
> >
>
> This is what I would like to avoid out of fear that my head would 
> explode
> ;-)
>
>
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> > Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> > > 
> > wrote:
> > >
> > > Well, according to
> > > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> > wildcard-multiterm-queries-in-solr/
> > > multiterm means
> > >
> > > wildcard
> > > range
> > > prefix
> > >
> > > so it is that way i'm using the word. That same article explains 
> > > how analysis will be performed with wildcards if the analyzers are 
> > > multi-term aware.
> > > Furthermore, both lucene and dismax do the correct analysis, so I 
> > > don't think you are right in your statement about the majority of 
> > > QPs skipping analysis for wildcards.
> > >
> > > So I'm still confused as to why complexphrase does things differently.
> > >
> > > Thanks,
> > > /Bjarke
> > >
> > > 2017-10-05 10:16 GMT+02:00 Emir Arnautović 
> > > > >:
> > >
> > >> Hi Bjarke,
> > >> It is not multiterm that is causing query parser to skip analysis 
> > >> chain but wildcard. The majority of query parsers do not analyse 
> > >> query string
> > if
> > >> there are wildcards.
> > >>
> > >> HTH
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> > >> Elasticsearch Consulting Support Training - http://sematext.com/
> > >>
> > >>
> > >>
> > >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> > >>> 
> > >> wrote:
> > >>>
> > >>> Hi list,
> > >>>
> > >>> I'm trying to search for the term funktionsnedsättning* In my 
> > >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > >>> So I would expect that funktionsnedsättning* would translate to 
> > >>> funktionsnedsattning*.
> > >>>
> > >>> If I use e.g. the lucene query parser, this is indeed what happens:
> > >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* 
> > >>> gives me "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> 

Re: Recommendations for number of open files?

2017-10-05 Thread Erick Erickson
Well, Lucene keeps an open file handle for _every_ file in _every_
index directory. So, for instance, let's say a replica has 10
segments. Each segment is 10-15 individual files. So that's 100-150
file handles right there. And indexes can have many segments.

Check to see if "cfs" extensions are in your indexing directory,
that's "compound file system" and if present will reduce the number of
file handles needed.

A second thing you might be able to do is increase the maximum segment
size by setting maxMergedSegmentMB in your solrconfig file for
TieredMergePolicy, something like
1
eventually that'll merge segments into fewer, but that'll take a while.

As to your question, we usually recommend to set the file limit to
"unlimited". You do have to monitor it however, at some point there's
a lot of bookkeeping.

one replica trying to open > 8,000 files seems very odd though. Is it
a massive index? The default max segment size is 5G, so you could have
a gazillion small segments in which case you might want to split that
shard up and move the sub-shards to some other machine.

Best,
Erick

On Thu, Oct 5, 2017 at 10:02 AM, Webster Homer  wrote:
> We have begun to see errors around too many open files on one of our
> solrcloud nodes. One replica tries to open >8000 files. This replica tries
> to startup and then fails the open files are exceeded upon startup as it
> tries to recover.
>
>
> Our solrclouds have 12 distinct collections. I would think that the number
> of open files would depend upon the number of collections as well as
> numbers of files per index etc...
>
> Our current setting is 8192 open files per process.
>
> What values are recommended? is there a normal number of open files?
>
> What would lead to there being lots of open files?
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.


Recommendations for number of open files?

2017-10-05 Thread Webster Homer
We have begun to see errors around too many open files on one of our
solrcloud nodes. One replica tries to open >8000 files. This replica tries
to startup and then fails the open files are exceeded upon startup as it
tries to recover.


Our solrclouds have 12 distinct collections. I would think that the number
of open files would depend upon the number of collections as well as
numbers of files per index etc...

Our current setting is 8192 open files per process.

What values are recommended? is there a normal number of open files?

What would lead to there being lots of open files?

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: FilterCache size should reduce as index grows?

2017-10-05 Thread Yonik Seeley
On Thu, Oct 5, 2017 at 3:20 AM, Toke Eskildsen  wrote:
> On Wed, 2017-10-04 at 21:42 -0700, S G wrote:
>
> It seems that the memory limit option maxSizeMB was added in Solr 5.2:
> https://issues.apache.org/jira/browse/SOLR-7372
> I am not sure if it works with all caches in Solr, but in my world it
> is way better to define the caches by memory instead of count.

Yes, that will work with the filterCache, but one needs to change the
cache type as well (maxSizeMB is only an option on LRUCache, and
filterCache uses FastLRUCache in the default solrconfig.xml)

-Yonik


Re: FilterCache size should reduce as index grows?

2017-10-05 Thread Yonik Seeley
On Thu, Oct 5, 2017 at 10:07 AM, Erick Erickson  wrote:
> The other thing I'd point out is that if your hit ratio is low, you
> might as well disable it entirely.

I'd normally recommend against turning it off entirely, except in
*very* custom cases.  Even if the user doesn't reuse filter queries,
Solr itself can internally in many different ways.  One way is 2-phase
distributed search for example.  Another is big terms in UIF faceting.
Some of these things were designed with the presence of a filter cache
in mind.

-Yonik


Error adding replica after a delete replica

2017-10-05 Thread Webster Homer
A colleague of mine was testing how solrcloud replica recovery works. We
have had a lot of issues with replicas going into recovery mode, replicas
down and in recovery failed states.  So to test, he deleted a healthy
replica in one of our development. First the delete operation timed out,
but the replica appears to be gone. However, addReplica always fails with
this error:

Error CREATEing SolrCore 'sial-content-citations_shard1_replica1': Unable
to create core [sial-content-citations_shard1_replica1] Caused by: Lock
held by this virtual machine: /var/solr/data/sial-content-
citations_shard1_replica1/data/index/write.lock

This cloud has 4 nodes. The collection has two shards with two replicas per
shard. They are all hosted in a google cloud environment.

So if the delete deleted the replica why would it then hold a lock? We want
to understand this.

We are using Solr 6.2.0

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: Solrcloud replication not working

2017-10-05 Thread solr2020
thanks.

We dont see any error message/any message in logs. And we have enough disk
space.

We are running solr as root user in ubuntu box but zookeeper process running
as zookeeper user.Will that cause the problem?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solrcloud replication not working

2017-10-05 Thread solr2020
Hi,

We are using Solr 6.4.2 & SolrCloud setup. We have two solr instances in the
solr cluster.This solrcloud running in ubuntu OS. The problem is replication
is not happening between these two solr instances. sometimes it replicate
10% of the content and sometimes not. 

In Zookeeper ensemble we have three zookeeper instances running in a
different box.

thanks.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: FilterCache size should reduce as index grows?

2017-10-05 Thread Erick Erickson
The other thing I'd point out is that if your hit ratio is low, you
might as well disable it entirely.

Finally, if you have any a-priori knowledge that certain fq clauses
are very unlikely to be re-used,
add {!cache=false}. If you also add cost=101, then the fq clause will
only be evaluated for
docs that need it, especially if you turn caching off.

See: http://yonik.com/advanced-filter-caching-in-solr/

Best,
Erick

On Thu, Oct 5, 2017 at 12:20 AM, Toke Eskildsen  wrote:
> On Wed, 2017-10-04 at 21:42 -0700, S G wrote:
>> The bit-vectors in filterCache are as long as the maximum number of
>> documents in a core. If there are a billion docs per core, every bit
>> vector will have a billion bits making its size as 10 9 / 8 = 128 mb
>
> The tricky part here is there are sparse (aka few hits) entries that
> takes up less space. The 1 bit/hit is worst case.
>
> This is both good and bad. The good part is of course that it saves
> memory. The bad part is that it often means that people set the
> filterCache size to a high number and that it works well, right until
> a series of filters with many hits.
>
> It seems that the memory limit option maxSizeMB was added in Solr 5.2:
> https://issues.apache.org/jira/browse/SOLR-7372
> I am not sure if it works with all caches in Solr, but in my world it
> is way better to define the caches by memory instead of count.
>
>> With such a big cache-value per entry,  the default value of 128
>> values in will become 128x128mb = 16gb and would not be very good for
>> a system running below 32 gb of memory.
>
> Sure. The default values are just that. For an index with 1M documents
> and a lot of different filters, 128 would probably be too low.
>
> If someone were to create a well-researched set of config files for
> different scenarios, it would be a welcome addition to our shared
> knowledge pool.
>
>> If such a use-case is anticipated, either the JVM's max memory be
>> increased to beyond 40 gb or the filterCache size be reduced to 32.
>
> Best solution: Use maxSizeMB (if it works)
> Second best solution: Reduce to 32 or less
> Third best, but often used, solution: Hope that most of the entries are
> sparse and will remain so
>
> - Toke Eskildsen, Royal Danish Library
>


Re: Solrcloud replication not working

2017-10-05 Thread Erick Erickson
We need a lot more data to say anything useful, please read:

https://wiki.apache.org/solr/UsingMailingLists

What do you see in your Solr logs? What have you tried to do to
diagnose this? Do you have enough disk space?

Best,
Erick

On Thu, Oct 5, 2017 at 6:56 AM, solr2020  wrote:
> Hi,
>
> We are using Solr 6.4.2 & SolrCloud setup. We have two solr instances in the
> solr cluster.This solrcloud running in ubuntu OS. The problem is replication
> is not happening between these two solr instances. sometimes it replicate
> 10% of the content and sometimes not.
>
> In Zookeeper ensemble we have three zookeeper instances running in a
> different box.
>
> thanks.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Question regarding Upgrading to SolrCloud

2017-10-05 Thread Erick Erickson
Gopesh:

There is brand new functionality in Solr 7, see: SOLR-10233, the
"PULL" replica type which is a hybrid SolrCloud replica that uses
master/slave type replication. You should find this in the reference
guide, the 7.0 ref guide should be published soon. Meanwhile, that
JIRA will let you know. Also see .../solr/CHANGES.txt. As Emir says,
though, it would require ZooKeeper.

Really, though, once you move to SolrCloud (if you do) I'd stick with
the standard NRT replica type unless I had reason to use one of the
other two, (TLOG and PULL) as they're for pretty special situations.

All that said, if you're happy with master/slave there's no compelling
reason to go to SolrCloud, especially for smaller installations.

Best,
Erick

On Wed, Oct 4, 2017 at 11:46 PM, Gopesh Sharma
 wrote:
> Hello Guys,
>
> As of now we are running Solr 3.4 with Master Slave Configuration. We are 
> planning to upgrade it to the lastest version (6.6 or 7). Questions I have 
> before upgrading
>
>
>   1.  Since we do not have a lot of data, is it required to move to SolrCloud 
> or continue using it Master Slave
>   2.  Is the support for Master Slave will be there in the future release or 
> do you plan to remove it.
>   3.  Can we configure master-slave replication in Solr Cloud, if yes then do 
> we need zookeeper as well.
>
> Thanks,
> Gopesh Sharma


Solrcloud replication not working

2017-10-05 Thread solr2020
Hi,

We are using Solr 6.4.2 & SolrCloud setup. We have two solr instances in the
solr cluster.This solrcloud running in ubuntu OS. The problem is replication
is not happening between these two solr instances. sometimes it replicate
10% of the content and sometimes not. 

In Zookeeper ensemble we have three zookeeper instances running in a
different box.

thanks.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: tf function query

2017-10-05 Thread Erick Erickson
What would you  expect as output? tf(field, "a OR b AND c NOT d"). I'm
not sure what term frequency would even mean in that situation.

tf is a pretty simple function, it expects a single term and there's
now way I know of to do what you're asking.

Best,
Erick

On Thu, Oct 5, 2017 at 3:14 AM, Dmitry Kan  wrote:
> Hi,
>
> According to
> https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions
>
> tf(field, term) requires a term as a second parameter. Is there a
> possibility to pass in an entire input query (multiterm and boolean) to the
> function?
>
> The context here is that we don't use edismax parser to apply multifield
> boosts, but instead use a custom ranking function.
>
> Would appreciate any thoughts,
>
> Dmitry
>
> --
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: https://semanticanalyzer.info


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Bjarke Buur Mortensen
Thanks Tim,
that might be what I'm experiencing. I'm actually quite certain of it :-)

Do you remember any reason that multi term analysis is not happening in
ComplexPhraseQueryParser?

I'm on 6.6.1, so latest on the 6.x branch.

2017-10-05 14:34 GMT+02:00 Allison, Timothy B. :

> There's every chance that I'm missing something at the Solr level, but it
> _looks_ at the Lucene level, like ComplexPhraseQueryParser is still not
> applying analysis to multiterms.
>
> When I call this on 7.0.0:
>QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName,
> analyzer);
> return qp.parse(qString);
>
>  where the analyzer is a mock "uppercase vowel" analyzer[1] and the
> qString is;
>
> "the* quick~" the* quick~ the quick
>
> I get this:
> "the* quick~" name:the* name:quick~2 name:thE name:qUIck
>
>
> [1] https://github.com/tballison/lucene-addons/blob/master/
> lucene-5205/src/test/java/org/apache/lucene/queryparser/
> spans/TestAdvancedAnalyzers.java#L117
>
> -Original Message-
> From: Allison, Timothy B. [mailto:talli...@mitre.org]
> Sent: Thursday, October 5, 2017 8:02 AM
> To: solr-user@lucene.apache.org
> Subject: RE: Complexphrase treats wildcards differently than other query
> parsers
>
> What version of Solr are you using?
>
> I thought this had been fixed fairly recently, but I can't quickly find
> the JIRA.  Let me take a look.
>
> Best,
>
>  Tim
>
> This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1]
> and [2], which handles analysis of multiterms even in phrases.
>
> [1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
> [2] https://mvnrepository.com/artifact/org.tallison.lucene/
> lucene-5205/6.6-0.1
>
> -Original Message-
> From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
> Sent: Thursday, October 5, 2017 6:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Complexphrase treats wildcards differently than other query
> parsers
>
> 2017-10-05 11:29 GMT+02:00 Emir Arnautović :
>
> > Hi Bjarke,
> > You are right - I jumped into wrong/old conclusion as the simplest
> > answer to your question.
>
>
>  No problem :-)
>
> I guess looking at the code could give you an answer.
> >
>
> This is what I would like to avoid out of fear that my head would explode
> ;-)
>
>
> >
> > Thanks,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> > > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen
> > > 
> > wrote:
> > >
> > > Well, according to
> > > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> > wildcard-multiterm-queries-in-solr/
> > > multiterm means
> > >
> > > wildcard
> > > range
> > > prefix
> > >
> > > so it is that way i'm using the word. That same article explains how
> > > analysis will be performed with wildcards if the analyzers are
> > > multi-term aware.
> > > Furthermore, both lucene and dismax do the correct analysis, so I
> > > don't think you are right in your statement about the majority of
> > > QPs skipping analysis for wildcards.
> > >
> > > So I'm still confused as to why complexphrase does things differently.
> > >
> > > Thanks,
> > > /Bjarke
> > >
> > > 2017-10-05 10:16 GMT+02:00 Emir Arnautović
> > > > >:
> > >
> > >> Hi Bjarke,
> > >> It is not multiterm that is causing query parser to skip analysis
> > >> chain but wildcard. The majority of query parsers do not analyse
> > >> query string
> > if
> > >> there are wildcards.
> > >>
> > >> HTH
> > >> Emir
> > >> --
> > >> Monitoring - Log Management - Alerting - Anomaly Detection Solr &
> > >> Elasticsearch Consulting Support Training - http://sematext.com/
> > >>
> > >>
> > >>
> > >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen
> > >>> 
> > >> wrote:
> > >>>
> > >>> Hi list,
> > >>>
> > >>> I'm trying to search for the term funktionsnedsättning* In my
> > >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > >>> So I would expect that funktionsnedsättning* would translate to
> > >>> funktionsnedsattning*.
> > >>>
> > >>> If I use e.g. the lucene query parser, this is indeed what happens:
> > >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives
> > >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> > >> funktionsnedsattning*"
> > >>> and 15 documents returned.
> > >>>
> > >>> Trying the same with complexphrase gives me:
> > >>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning
> > >>> *
> > >> gives me
> > >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> > >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> > >> funktionsnedsättning*"
> > >>> and 0 documents. Notice how ä has not been changed to a.
> > >>>
> > >>> How can this be? Is complexphrase somehow skipping the 

RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
There's every chance that I'm missing something at the Solr level, but it 
_looks_ at the Lucene level, like ComplexPhraseQueryParser is still not 
applying analysis to multiterms.

When I call this on 7.0.0:
   QueryParser qp = new ComplexPhraseQueryParser(defaultFieldName, analyzer);
return qp.parse(qString);

 where the analyzer is a mock "uppercase vowel" analyzer[1] and the qString is;

"the* quick~" the* quick~ the quick

I get this:
"the* quick~" name:the* name:quick~2 name:thE name:qUIck


[1] 
https://github.com/tballison/lucene-addons/blob/master/lucene-5205/src/test/java/org/apache/lucene/queryparser/spans/TestAdvancedAnalyzers.java#L117

-Original Message-
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Thursday, October 5, 2017 8:02 AM
To: solr-user@lucene.apache.org
Subject: RE: Complexphrase treats wildcards differently than other query parsers

What version of Solr are you using?

I thought this had been fixed fairly recently, but I can't quickly find the 
JIRA.  Let me take a look.

Best,

 Tim

This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and 
[2], which handles analysis of multiterms even in phrases.

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205/6.6-0.1 

-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com]
Sent: Thursday, October 5, 2017 6:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 11:29 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest 
> answer to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> > 
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how 
> > analysis will be performed with wildcards if the analyzers are 
> > multi-term aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I 
> > don't think you are right in your statement about the majority of 
> > QPs skipping analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović 
> > >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis 
> >> chain but wildcard. The majority of query parsers do not analyse 
> >> query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> >>> 
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning* In my 
> >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to 
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives 
> >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning
> >>> *
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis 
> >>> chain
> for
> >>> multiterms, even though components and in particular 
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>


RE: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Allison, Timothy B.
What version of Solr are you using?

I thought this had been fixed fairly recently, but I can't quickly find the 
JIRA.  Let me take a look.

Best,

 Tim

This was one of my initial reasons for my SpanQueryParser LUCENE-5205[1] and 
[2], which handles analysis of multiterms even in phrases.

[1] https://github.com/tballison/lucene-addons/tree/master/lucene-5205
[2] https://mvnrepository.com/artifact/org.tallison.lucene/lucene-5205/6.6-0.1 

-Original Message-
From: Bjarke Buur Mortensen [mailto:morten...@eluence.com] 
Sent: Thursday, October 5, 2017 6:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 11:29 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest 
> answer to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> > 
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how 
> > analysis will be performed with wildcards if the analyzers are 
> > multi-term aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I 
> > don't think you are right in your statement about the majority of 
> > QPs skipping analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović 
> > >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis 
> >> chain but wildcard. The majority of query parsers do not analyse 
> >> query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> >> Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> >>> 
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning* In my 
> >>> analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to 
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives 
> >>> me "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning
> >>> *
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis 
> >>> chain
> for
> >>> multiterms, even though components and in particular 
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>


Re: Solr boost function taking precedence over relevance boosting

2017-10-05 Thread alessandro.benedetti
I would try to use an additive boost and the ^= boost operator:
- name_property :( test^=2 ) will assign a fixed score of 2 if the match
happens ( it is a constant score query)
- additive boost will be 0

Re: tf function query

2017-10-05 Thread Erik Hatcher
How about the query() function?  Just be clever about the query you specify ;)

> On Oct 5, 2017, at 06:14, Dmitry Kan  wrote:
> 
> Hi,
> 
> According to
> https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions
> 
> tf(field, term) requires a term as a second parameter. Is there a
> possibility to pass in an entire input query (multiterm and boolean) to the
> function?
> 
> The context here is that we don't use edismax parser to apply multifield
> boosts, but instead use a custom ranking function.
> 
> Would appreciate any thoughts,
> 
> Dmitry
> 
> -- 
> Dmitry Kan
> Luke Toolbox: http://github.com/DmitryKey/luke
> Blog: http://dmitrykan.blogspot.com
> Twitter: http://twitter.com/dmitrykan
> SemanticAnalyzer: https://semanticanalyzer.info


RE: tf function query

2017-10-05 Thread Junte Zhang
I am afraid this is not possible, since getting frequencies for phrases is not 
possible, unless the phrases are created as tokens (i.e. using n-grams or 
shingles) and indexed. If someone has a solution for this, then I am interested 
as well.

/JZ

-Original Message-
From: Dmitry Kan [mailto:solrexp...@gmail.com] 
Sent: Thursday, October 5, 2017 12:15 PM
To: solr-user@lucene.apache.org
Subject: tf function query

Hi,

According to
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions

tf(field, term) requires a term as a second parameter. Is there a possibility 
to pass in an entire input query (multiterm and boolean) to the function?

The context here is that we don't use edismax parser to apply multifield 
boosts, but instead use a custom ranking function.

Would appreciate any thoughts,

Dmitry

--
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: https://semanticanalyzer.info


Question regarding Upgrading to SolrCloud

2017-10-05 Thread Gopesh Sharma
Hello Guys,

As of now we are running Solr 3.4 with Master Slave Configuration. We are 
planning to upgrade it to the lastest version (6.6 or 7). Questions I have 
before upgrading


  1.  Since we do not have a lot of data, is it required to move to SolrCloud 
or continue using it Master Slave
  2.  Is the support for Master Slave will be there in the future release or do 
you plan to remove it.
  3.  Can we configure master-slave replication in Solr Cloud, if yes then do 
we need zookeeper as well.

Thanks,
Gopesh Sharma


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Bjarke Buur Mortensen
2017-10-05 11:29 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> You are right - I jumped into wrong/old conclusion as the simplest answer
> to your question.


 No problem :-)

I guess looking at the code could give you an answer.
>

This is what I would like to avoid out of fear that my head would explode
;-)


>
> Thanks,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen 
> wrote:
> >
> > Well, according to
> > https://lucidworks.com/2011/11/29/whats-with-lowercasing-
> wildcard-multiterm-queries-in-solr/
> > multiterm means
> >
> > wildcard
> > range
> > prefix
> >
> > so it is that way i'm using the word. That same article explains how
> > analysis will be performed with wildcards if the analyzers are multi-term
> > aware.
> > Furthermore, both lucene and dismax do the correct analysis, so I don't
> > think you are right in your statement about the majority of QPs skipping
> > analysis for wildcards.
> >
> > So I'm still confused as to why complexphrase does things differently.
> >
> > Thanks,
> > /Bjarke
> >
> > 2017-10-05 10:16 GMT+02:00 Emir Arnautović  >:
> >
> >> Hi Bjarke,
> >> It is not multiterm that is causing query parser to skip analysis chain
> >> but wildcard. The majority of query parsers do not analyse query string
> if
> >> there are wildcards.
> >>
> >> HTH
> >> Emir
> >> --
> >> Monitoring - Log Management - Alerting - Anomaly Detection
> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >>
> >>
> >>
> >>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> >> wrote:
> >>>
> >>> Hi list,
> >>>
> >>> I'm trying to search for the term funktionsnedsättning*
> >>> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> >>> So I would expect that funktionsnedsättning* would translate to
> >>> funktionsnedsattning*.
> >>>
> >>> If I use e.g. the lucene query parser, this is indeed what happens:
> >>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsattning*"
> >>> and 15 documents returned.
> >>>
> >>> Trying the same with complexphrase gives me:
> >>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning*
> >> gives me
> >>> "rawquerystring":"funktionsnedsättning*", "querystring":
> >>> "funktionsnedsättning*", "parsedquery":"content_ol:
> >> funktionsnedsättning*"
> >>> and 0 documents. Notice how ä has not been changed to a.
> >>>
> >>> How can this be? Is complexphrase somehow skipping the analysis chain
> for
> >>> multiterms, even though components and in particular
> >>> MappingCharFilterFactory are Multi-term aware
> >>>
> >>> Are there any configuration gotchas that I'm not aware of?
> >>>
> >>> Thanks for the help,
> >>> Bjarke Buur Mortensen
> >>> Senior Software Engineer, Eluence A/S
> >>
> >>
>
>


tf function query

2017-10-05 Thread Dmitry Kan
Hi,

According to
https://lucene.apache.org/solr/guide/6_6/function-queries.html#FunctionQueries-AvailableFunctions

tf(field, term) requires a term as a second parameter. Is there a
possibility to pass in an entire input query (multiterm and boolean) to the
function?

The context here is that we don't use edismax parser to apply multifield
boosts, but instead use a custom ranking function.

Would appreciate any thoughts,

Dmitry

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: https://semanticanalyzer.info


Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Emir Arnautović
Hi Bjarke,
You are right - I jumped into wrong/old conclusion as the simplest answer to 
your question. I guess looking at the code could give you an answer.

Thanks,
Emir 
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Oct 2017, at 10:44, Bjarke Buur Mortensen  wrote:
> 
> Well, according to
> https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
> multiterm means
> 
> wildcard
> range
> prefix
> 
> so it is that way i'm using the word. That same article explains how
> analysis will be performed with wildcards if the analyzers are multi-term
> aware.
> Furthermore, both lucene and dismax do the correct analysis, so I don't
> think you are right in your statement about the majority of QPs skipping
> analysis for wildcards.
> 
> So I'm still confused as to why complexphrase does things differently.
> 
> Thanks,
> /Bjarke
> 
> 2017-10-05 10:16 GMT+02:00 Emir Arnautović :
> 
>> Hi Bjarke,
>> It is not multiterm that is causing query parser to skip analysis chain
>> but wildcard. The majority of query parsers do not analyse query string if
>> there are wildcards.
>> 
>> HTH
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
>> wrote:
>>> 
>>> Hi list,
>>> 
>>> I'm trying to search for the term funktionsnedsättning*
>>> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
>>> So I would expect that funktionsnedsättning* would translate to
>>> funktionsnedsattning*.
>>> 
>>> If I use e.g. the lucene query parser, this is indeed what happens:
>>> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives me
>>> "rawquerystring":"funktionsnedsättning*", "querystring":
>>> "funktionsnedsättning*", "parsedquery":"content_ol:
>> funktionsnedsattning*"
>>> and 15 documents returned.
>>> 
>>> Trying the same with complexphrase gives me:
>>> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning*
>> gives me
>>> "rawquerystring":"funktionsnedsättning*", "querystring":
>>> "funktionsnedsättning*", "parsedquery":"content_ol:
>> funktionsnedsättning*"
>>> and 0 documents. Notice how ä has not been changed to a.
>>> 
>>> How can this be? Is complexphrase somehow skipping the analysis chain for
>>> multiterms, even though components and in particular
>>> MappingCharFilterFactory are Multi-term aware
>>> 
>>> Are there any configuration gotchas that I'm not aware of?
>>> 
>>> Thanks for the help,
>>> Bjarke Buur Mortensen
>>> Senior Software Engineer, Eluence A/S
>> 
>> 



Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Bjarke Buur Mortensen
Well, according to
https://lucidworks.com/2011/11/29/whats-with-lowercasing-wildcard-multiterm-queries-in-solr/
multiterm means

wildcard
range
prefix

so it is that way i'm using the word. That same article explains how
analysis will be performed with wildcards if the analyzers are multi-term
aware.
Furthermore, both lucene and dismax do the correct analysis, so I don't
think you are right in your statement about the majority of QPs skipping
analysis for wildcards.

So I'm still confused as to why complexphrase does things differently.

Thanks,
/Bjarke

2017-10-05 10:16 GMT+02:00 Emir Arnautović :

> Hi Bjarke,
> It is not multiterm that is causing query parser to skip analysis chain
> but wildcard. The majority of query parsers do not analyse query string if
> there are wildcards.
>
> HTH
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen 
> wrote:
> >
> > Hi list,
> >
> > I'm trying to search for the term funktionsnedsättning*
> > In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> > So I would expect that funktionsnedsättning* would translate to
> > funktionsnedsattning*.
> >
> > If I use e.g. the lucene query parser, this is indeed what happens:
> > ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives me
> > "rawquerystring":"funktionsnedsättning*", "querystring":
> > "funktionsnedsättning*", "parsedquery":"content_ol:
> funktionsnedsattning*"
> > and 15 documents returned.
> >
> > Trying the same with complexphrase gives me:
> > ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning*
> gives me
> > "rawquerystring":"funktionsnedsättning*", "querystring":
> > "funktionsnedsättning*", "parsedquery":"content_ol:
> funktionsnedsättning*"
> > and 0 documents. Notice how ä has not been changed to a.
> >
> > How can this be? Is complexphrase somehow skipping the analysis chain for
> > multiterms, even though components and in particular
> > MappingCharFilterFactory are Multi-term aware
> >
> > Are there any configuration gotchas that I'm not aware of?
> >
> > Thanks for the help,
> > Bjarke Buur Mortensen
> > Senior Software Engineer, Eluence A/S
>
>


Re: Question regarding Upgrading to SolrCloud

2017-10-05 Thread Emir Arnautović
Hi Sharma,
Please see inline answers.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 5 Oct 2017, at 09:00, Gopesh Sharma  wrote:
> 
> Hello Guys,
> 
> As of now we are running Solr 3.4 with Master Slave Configuration. We are 
> planning to upgrade it to the lastest version (6.6 or 7). Questions I have 
> before upgrading
> 
> 
>  1.  Since we do not have a lot of data, is it required to move to SolrCloud 
> or continue using it Master Slave
It is not required to move to SolrCloud if you are ok with MS limitations. The 
main drivers to move to SC are:
data volume that requires sharding
NRT requirements that cannot be met with MS model
FT requirements - with MS you have SPOF - master node that can prevent updates, 
but if you NRT requirements are not strict and can tolerate longer periods 
without updates, this can be ignored

>  2.  Is the support for Master Slave will be there in the future release or 
> do you plan to remove it.
SolrCloud also uses replication as backup mechanism, so it is there to stay.

>  3.  Can we configure master-slave replication in Solr Cloud, if yes then do 
> we need zookeeper as well.
SolrCloud requires ZK - it is where it keeps cluster state. Like mentioned 
above, SolrCloud have replication handlers enabled, so you can have some hybrid 
model, but it will not make your system simpler.

> 
> Thanks,
> Gopesh Sharma
> 



Re: Complexphrase treats wildcards differently than other query parsers

2017-10-05 Thread Emir Arnautović
Hi Bjarke,
It is not multiterm that is causing query parser to skip analysis chain but 
wildcard. The majority of query parsers do not analyse query string if there 
are wildcards.

HTH
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 4 Oct 2017, at 22:08, Bjarke Buur Mortensen  wrote:
> 
> Hi list,
> 
> I'm trying to search for the term funktionsnedsättning*
> In my analyzer chain I use a MappingCharFilterFactory to change ä to a.
> So I would expect that funktionsnedsättning* would translate to
> funktionsnedsattning*.
> 
> If I use e.g. the lucene query parser, this is indeed what happens:
> ...debugQuery=on=lucene=funktionsneds%C3%A4ttning* gives me
> "rawquerystring":"funktionsnedsättning*", "querystring":
> "funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsattning*"
> and 15 documents returned.
> 
> Trying the same with complexphrase gives me:
> ...debugQuery=on=complexphrase=funktionsneds%C3%A4ttning* gives me
> "rawquerystring":"funktionsnedsättning*", "querystring":
> "funktionsnedsättning*", "parsedquery":"content_ol:funktionsnedsättning*"
> and 0 documents. Notice how ä has not been changed to a.
> 
> How can this be? Is complexphrase somehow skipping the analysis chain for
> multiterms, even though components and in particular
> MappingCharFilterFactory are Multi-term aware
> 
> Are there any configuration gotchas that I'm not aware of?
> 
> Thanks for the help,
> Bjarke Buur Mortensen
> Senior Software Engineer, Eluence A/S



Re: FilterCache size should reduce as index grows?

2017-10-05 Thread Toke Eskildsen
On Wed, 2017-10-04 at 21:42 -0700, S G wrote:
> The bit-vectors in filterCache are as long as the maximum number of
> documents in a core. If there are a billion docs per core, every bit
> vector will have a billion bits making its size as 10 9 / 8 = 128 mb

The tricky part here is there are sparse (aka few hits) entries that
takes up less space. The 1 bit/hit is worst case.

This is both good and bad. The good part is of course that it saves
memory. The bad part is that it often means that people set the
filterCache size to a high number and that it works well, right until
a series of filters with many hits.

It seems that the memory limit option maxSizeMB was added in Solr 5.2:
https://issues.apache.org/jira/browse/SOLR-7372
I am not sure if it works with all caches in Solr, but in my world it
is way better to define the caches by memory instead of count.

> With such a big cache-value per entry,  the default value of 128
> values in will become 128x128mb = 16gb and would not be very good for
> a system running below 32 gb of memory.

Sure. The default values are just that. For an index with 1M documents
and a lot of different filters, 128 would probably be too low.

If someone were to create a well-researched set of config files for
different scenarios, it would be a welcome addition to our shared
knowledge pool.

> If such a use-case is anticipated, either the JVM's max memory be
> increased to beyond 40 gb or the filterCache size be reduced to 32.

Best solution: Use maxSizeMB (if it works)
Second best solution: Reduce to 32 or less
Third best, but often used, solution: Hope that most of the entries are
sparse and will remain so

- Toke Eskildsen, Royal Danish Library



Question regarding Upgrading to SolrCloud

2017-10-05 Thread Gopesh Sharma
Hello Guys,

As of now we are running Solr 3.4 with Master Slave Configuration. We are 
planning to upgrade it to the lastest version (6.6 or 7). Questions I have 
before upgrading


  1.  Since we do not have a lot of data, is it required to move to SolrCloud 
or continue using it Master Slave
  2.  Is the support for Master Slave will be there in the future release or do 
you plan to remove it.
  3.  Can we configure master-slave replication in Solr Cloud, if yes then do 
we need zookeeper as well.

Thanks,
Gopesh Sharma