Re: Congratulations to the new Apache Solr PMC Chair, Jan Høydahl!

2021-02-20 Thread Yonik Seeley
Congrats Jan! Go Solr!
-Yonik


On Thu, Feb 18, 2021 at 1:56 PM Anshum Gupta  wrote:

> Hi everyone,
>
> I’d like to inform everyone that the newly formed Apache Solr PMC nominated
> and elected Jan Høydahl for the position of the Solr PMC Chair and Vice
> President. This decision was approved by the board in its February 2021
> meeting.
>
> Congratulations Jan!
>
> --
> Anshum Gupta
>


Re: Help using Noggit for streaming JSON data

2020-09-17 Thread Yonik Seeley
See this method:

  /** Reads a JSON string into the output, decoding any escaped characters.
*/
  public void getString(CharArr output) throws IOException

And then the idea is to create a subclass of CharArr to incrementally
handle the string that is written to it.
You could overload write methods, or perhaps reserve() to flush/handle the
buffer when it reaches a certain size.

-Yonik


On Thu, Sep 17, 2020 at 11:48 AM Christopher Schultz <
ch...@christopherschultz.net> wrote:

> All,
>
> Is this an appropriate forum for asking questions about how to use
> Noggit? The Github doesn't have any discussions available and filing an
> "issue" to ask a question is kinda silly. I'm happy to be redirected to
> the right place if this isn't appropriate.
>
> I've been able to figure out most things in Noggit by reading the code,
> but I have a new use-case where I expect that I'll have very large
> values (base64-encoded binary) and I'd like to stream those rather than
> calling parser.getString() and getting a potentially huge string coming
> back. I'm streaming into a database so I never need the whole string in
> one place at one time.
>
> I was thinking something like this:
>
> JSONParser p = ...;
>
> int evt = p.nextEvent();
> if(JSONParser.STRING == evt) {
>   // Start streaming
>   boolean eos = false;
>   while(!eos) {
> char c = p.getChar();
> if(c == '"') {
>   eos = true;
> } else {
>   append to stream
> }
>   }
> }
>
> But getChar() is not public. The only "documentation" I've really been
> able to find for Noggit is this post from Yonic back in 2014:
>
> http://yonik.com/noggit-json-parser/
>
> It mostly says "Noggit is great!" and specifically mentions huge, long
> strings but does not actually show any Java code to consume the JSON
> data in any kind of streaming way.
>
> The ObjectBuilder class is a great user of JSONParser, but it just
> builds standard objects and would consume tons of memory in my case.
>
> I know for sure that Solr consumes huge JSON documents and I'm assuming
> that Noggit is being used in that situation, though I have not looked at
> the code used to do that.
>
> Any suggestions?
>
> -chris
>


Re: Solr admin interface freezes on Chrome

2019-10-02 Thread Yonik Seeley
Can someone open a JIRA to track this problem?
-Yonik

On Wed, Oct 2, 2019 at 7:04 PM Solr User  wrote:

> > Works fine on Firefox, and I
> > haven't made any changes to our Solr instance (v8.1.1) in a while.
>
> Had a co-worker with a similar issue. He had a pop-blocker enabled in
> chrome that was preventing some resource call (or something similar). When
> switching to Firefox everything worked without issue.  Any chance something
> is showing in the developer tools console?
>


Re: Optimizing fq query performance

2019-04-13 Thread Yonik Seeley
More constrained but matching the same set of documents just guarantees
that there is more information to evaluate per document matched.
For your specific case, you can optimize fq = 'field1:* AND field2:value'
to =field1:*=field2:value
This will at least cause field1:* to be cached and reused if it's a common
pattern.
field1:* is slow in general for indexed fields because all terms for the
field need to be iterated (e.g. does term1 match doc1, does term2 match
doc1, etc)
One can optimize this by indexing a term in a different field to turn it
into a single term query (i.e. exists:field1)

-Yonik

On Sat, Apr 13, 2019 at 2:58 PM John Davis 
wrote:

> Hi there,
>
> We noticed a sizable performance degradation when we add certain fq filters
> to the query even though the result set does not change between the two
> queries. I would've expected solr to optimize internally by picking the
> most constrained fq filter first, but maybe my understanding is wrong.
> Here's an example:
>
> query1: fq = 'field1:* AND field2:value'
> query2: fq = 'field2:value'
>
> If we assume that the result set is identical between the two queries and
> field1 is in general more frequent in the index, we noticed query1 takes
> 100x longer than query2. In case it matters field1 is of type tlongs while
> field2 is a string.
>
> Any tips for optimizing this?
>
> John
>


Re: Problem with white space or special characters in function queries

2019-03-29 Thread Yonik Seeley
On Thu, Mar 28, 2019 at 6:05 PM Jan Høydahl  wrote:

> Functions can never contain spaces.


Spaces work fine in functions in general.
The issue is the "bf" parameter as it uses whitespace to delimit multiple
functions IIRC.

-Yonik



> Try to substitute the term with a variable, i.e. a request parameter, e.g.
>
>
> bf=if(termfreq(ADSKFeature,$myTerm),log(CaseCount),sqrt(CaseCount))=CUI+(Command)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 28. mar. 2019 kl. 18:51 skrev shamik :
> >
> > Ahemad, I don't think its related to the field definition, rather looks
> like
> > an inherent bug. For the time being, I created a copyfield which uses a
> > custom regex to remove whitespace and special characters and use it in
> the
> > function. I'll debug the source code and confirm if it's bug, will raise
> a
> > JIRA if needed.
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>


Re: Solr 7.X negative filter not working

2018-09-20 Thread Yonik Seeley
I just tried the master branch quickly, and I can't reproduce this.

"params":{
  "q":"*:*",
  "debug":"true",
  "fq":"title_t:(NOT Kings)"}},
 [...]
"QParser":"LuceneQParser",
"filter_queries":["title_t:(NOT Kings)"],
"parsed_filter_queries":["-title_t:kings"],

Knowing how the parser works for myfield:(stuff), this is expected (it just
sets the default field to myfield and then parses what's inside the parens
as normal.
So in my example above, all the following parse to the same expression:
-title_t:Kings
NOT title_t:Kings
title_t:(NOT Kings)

Handling pure negative queries currently works at a higher level in Solr
(not the query parser).

Try turning on debug=true and looking at the relevant part of what the
debug info returns.

-Yonik

On Thu, Sep 20, 2018 at 4:05 AM damian.pawski  wrote:

> Hi
> On the Solr 5.4.x below query works fine:
>*
>   
>   "q": "*:*",
>   "_": "1537429094299",
>   "wt": "json",
>   "fq": "JobTitle:(NOT programmer)"
>   ...
> *
>
> , however the same query returns 0 results (I have checked and index
> contains correct data) in the Solr 7.4.1.
>
> I couldn't find anything about this issue in the Solr upgrade pages.
>
> I have tried below query on Solr 7.4.1
>  *
>...
>"q":"*:*",
>"fq":"-JobTitle:programmer"
>...*
> and I am getting correct results.
>
> The problematic search  "JobTitle:(NOT programmer)" is constructed via C#
> code, so I can ont easily update to "-JobTitle".
>
>
> Why the *NOT* query stopped working in the Solr 7.4.1, is there any setting
> that I can switch on to make it work?
>
>
> Thank you
> Damian
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: CACHE -> fieldValueCache usage

2018-09-20 Thread Yonik Seeley
On Wed, Sep 19, 2018 at 9:44 AM Vincenzo D'Amore  wrote:

> Looking at Solr Admin Panel I've found the CACHE -> fieldValueCache tab
> where all the values are 0.
>
> [...]
>
> what do you thing, is that normal?


Yep, that's completely normal.
That cache is only used by certain operations on multi-valued indexed
fields that don't have docValues.  If it ends up not being used, it's fine.

-Yonik


Re: 7.3 appears to leak

2018-06-28 Thread Yonik Seeley
> * SortedIntDocSet instances ánd ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;

If these are actually filterCache entries being leaked, it stands to
reason that a whole searcher is being leaked somewhere.

-Yonik


Re: Retrieving json.facet from a search

2018-06-28 Thread Yonik Seeley
There isn't typed support, but you can use the generic support like so:

.getResponse().get("facets")

-Yonik

On Thu, Jun 28, 2018 at 2:31 PM, Webster Homer  wrote:
> I have a fairly large existing code base for querying Solr. It is
> architected where common code calls solr and returns a solrj QueryResponse
> object.
>
> I'm currently using Solr 7.2 the code interacts with solr using the Solrj
> client api
>
> I have a need that would be very easily met by using the json.facet api.
> The problem is that I don't see how to get the json.facet out of a
> QueryResponse object.
>
> There doesn't seem to be a lot of discussion on line about this.
> Is there a way to get the Json object out of the QueryResponse?
>
> --
>
>
> This message and any attachment are confidential and may be
> privileged or
> otherwise protected from disclosure. If you are not the intended
> recipient,
> you must not copy this message or attachment or disclose the
> contents to
> any other person. If you have received this transmission in error,
> please
> notify the sender immediately and delete the message and any attachment
>
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do
> not accept liability for any omissions or errors in this
> message which may
> arise as a result of E-Mail-transmission or for damages
> resulting from any
> unauthorized changes of the content of this message and
> any attachment thereto.
> Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee
> that this message is free of viruses and does
> not accept liability for any
> damages caused by any virus transmitted
> therewith.
>
>
>
> Click http://www.emdgroup.com/disclaimer
>  to access the
> German, French, Spanish
> and Portuguese versions of this disclaimer.


Re: Solr 7.3, FunctionScoreQuery no longer displays debug output

2018-05-17 Thread Yonik Seeley
If this used to work, I wonder if it's something to do with changes to boost:
https://issues.apache.org/jira/browse/LUCENE-8099

-Yonik


On Thu, May 17, 2018 at 5:48 PM, Markus Jelsma
 wrote:
> Hello,
>
> Sorry to disturb. Is there anyone here able to reproduce and verify this 
> issue?
>
> Many thanks,
> Markus
>
>
>
> -Original message-
>> From:Markus Jelsma 
>> Sent: Wednesday 9th May 2018 18:25
>> To: solr-user 
>> Subject: Solr 7.3, FunctionScoreQuery no longer displays debug output
>>
>> Hi,
>>
>> Is this a known problem? For example, the following query:
>> q=australia=true=if(exists(query($bqlang)),2,1)=lang:en=edismax=content_en
>>  content_ro
>>
>> returns the following toString for 7.2.1:
>> boost(+(Synonym(content_en:australia content_en:australia) | 
>> Synonym(content_ro:austral 
>> content_ro:australia)),if(exists(query(lang:en,def=0.0)),const(2),const(1)))
>>
>> 7.3:
>> FunctionScoreQuery(+(Synonym(content_en:australia content_en:australia) | 
>> Synonym(content_ro:austral content_ro:australia)), scored by 
>> boost(if(exists(query(lang:en,def=0.0)),const(2),const(1
>>
>> and the following debug output for 7.2.1:
>>
>> 11.226025 = boost((Synonym(content_en:australia content_en:australia) | 
>> Synonym(content_ro:austral 
>> content_ro:australia)),if(exists(query(lang:en,def=0.0)),const(2),const(1))),
>>  product of:
>>   11.226025 = max of:
>> 11.226025 = weight(Synonym(content_ro:austral content_ro:australia) in 
>> 6761) [SchemaSimilarity], result of:
>>   11.226025 = score(doc=6761,freq=18.0 = termFreq=18.0
>> ), product of:
>> 5.442921 = idf(docFreq=193, docCount=44720)
>> 2.0625 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:
>>   18.0 = termFreq=18.0
>>   1.2 = parameter k1
>>   0.0 = parameter b (norms omitted for field)
>>   1.0 = if(exists(query(lang:en,def=0.0)=0.0),const(2),const(1))
>>
>> but for 7.3 i get only:
>>
>> 11.226025 = product of:
>>   1.0 = boost
>>   11.226025 = boost(if(exists(query(lang:en,def=0.0)),const(2),const(1)))
>>
>> The scores are still the same, but the debug output is useless. Removing the 
>> boost fixes the problem of debug output immediately.
>>
>> Thanks,
>> Markus
>>
>>


Re: Error using multiple terms in function query

2018-05-15 Thread Yonik Seeley
Problems like this are usually caused by the whole query not even
making it to Solr due to bad HTTP param encoding.
For example, if you're using curl with request parameters in the URL,
you need to manually encode spaces as either "+" or "%20"

-Yonik


On Tue, May 15, 2018 at 7:41 PM, Shamik Bandopadhyay  wrote:
> Hi,
>
>   I'm having issues using multiple terms in Solr function queries. For e.g.
> I'm trying to use the following bf function using termfreq
>
> bf=if(termfreq(ProductLine,'Test Product'),5,0)
>
> This throws  org.apache.solr.search.SyntaxError: Missing end to unquoted
> value starting at 28 str='if(termfreq(ProductLine,Test'
>
> If I use only Test or Product i.e. a single term, the function works as
> expected. I'm seeing a similar problem with other functions like
> docfreq,ttf,tf,idf,etc. I'm using 6.6 but verified similar issue in 5 as
> well.
>
> Just wondering if this is an existing issue or something not supported by
> Solr. Is there an alternate way to check multiple terms in a function?
>
> Any pointers will be appreciated.


Re: Solr Json Facet

2018-05-08 Thread Yonik Seeley
Looks like some sort of proxy server inbetween the python client and
solr server.
I would still check first if the output from the python client is
correctly escaped/encoded HTTP.

One easy way is to use netcat to pretend to be a server:
$ nc -l 8983
And then send point the python client at that and send the request.

-Yonik


On Tue, May 8, 2018 at 9:17 PM, Kojo <rbsnk...@gmail.com> wrote:
> Thank you all. I tried escaping but still not working
>
> Yonik, I am using Python Requests. It works if my fq is a single word, even
> if I use double quotes on this single word without escaping.
>
> This is the HTTP response:
>
> response.content
> 
> ' 2.0//EN">\n\n400 Bad
> Request\n\nBad Request\nYour browser sent
> a request that this server could not understand. />\n\n\nApache/2.2.15 (Oracle) Server at leydenh Port
> 80\n\n'
>
>
> Thank you,
>
>
>
> 2018-05-08 18:46 GMT-03:00 Yonik Seeley <ysee...@gmail.com>:
>
>> On Tue, May 8, 2018 at 1:36 PM, Kojo <rbsnk...@gmail.com> wrote:
>> > If I tag the fq query and I query for a simple word it works fine too.
>> But
>> > if query a multi word with space in the middle it breaks:
>>
>> Most likely the full query is not getting to Solr because of an HTTP
>> protocol error (i.e. the request is not encoded correctly).
>> How are you sending your request to Solr (with curl, or with some other
>> method?)
>>
>> -Yonik
>>


Re: Solr Json Facet

2018-05-08 Thread Yonik Seeley
On Tue, May 8, 2018 at 1:36 PM, Kojo  wrote:
> If I tag the fq query and I query for a simple word it works fine too. But
> if query a multi word with space in the middle it breaks:

Most likely the full query is not getting to Solr because of an HTTP
protocol error (i.e. the request is not encoded correctly).
How are you sending your request to Solr (with curl, or with some other method?)

-Yonik


Re: Error in indexing JSON with space in value

2018-03-22 Thread Yonik Seeley
Ah, there's the extra bit of context:
> PS C:\curl> .\curl '

You're using Windows perhaps?  If so, it's probably a shell issue
getting all of the data to the "curl" command.
Something like cygwin or WSL (Windows Subsystem for Linux) may make
your life easier.

-Yonik


On Thu, Mar 22, 2018 at 6:45 PM, Zheng Lin Edwin Yeo
<edwinye...@gmail.com> wrote:
> Thanks for your reply.
>
>
>
> PS C:\curl> .\curl '
> http://localhost:8983/edm/emails6/update/json/docs?split=/|/orgs' -H
> 'Content-type:application/j
> son' -d ' {   "id":"1",   "name_s": "Joe Smith",   "phone_s": 876876687,
>  "orgs": [ {   "name1_s": "Microsoft",
>"city_s": "Seattle",   "zip_s": 98052}, {   "name1_s":
> "Apple",   "city_s": "Cupertino",   "z
> ip_s": 95014}   ] }' --trace -
>
> == Info:   Trying ::1...
> == Info: TCP_NODELAY set
> == Info: Connected to localhost (::1) port 8983 (#0)
> => Send header, 172 bytes (0xac)
> : 50 4f 53 54 20 2f 65 64 6d 2f 65 6d 61 69 6c 73 POST /edm/emails
> 0010: 36 2f 75 70 64 61 74 65 2f 6a 73 6f 6e 2f 64 6f 6/update/json/do
> 0020: 63 73 3f 73 70 6c 69 74 3d 2f 7c 2f 6f 72 67 73 cs?split=/|/orgs
> 0030: 20 48 54 54 50 2f 31 2e 31 0d 0a 48 6f 73 74 3a  HTTP/1.1..Host:
> 0040: 20 6c 6f 63 61 6c 68 6f 73 74 3a 38 39 38 33 0d  localhost:8983.
> 0050: 0a 55 73 65 72 2d 41 67 65 6e 74 3a 20 63 75 72 .User-Agent: cur
> 0060: 6c 2f 37 2e 35 32 2e 31 0d 0a 41 63 63 65 70 74 l/7.52.1..Accept
> 0070: 3a 20 2a 2f 2a 0d 0a 43 6f 6e 74 65 6e 74 2d 74 : */*..Content-t
> 0080: 79 70 65 3a 61 70 70 6c 69 63 61 74 69 6f 6e 2f ype:application/
> 0090: 6a 73 6f 6e 0d 0a 43 6f 6e 74 65 6e 74 2d 4c 65 json..Content-Le
> 00a0: 6e 67 74 68 3a 20 32 34 0d 0a 0d 0a ngth: 24
> => Send data, 24 bytes (0x18)
> : 20 7b 20 20 20 69 64 3a 31 2c 20 20 20 6e 61 6d  {   id:1,   nam
> 0010: 65 5f 73 3a 20 4a 6f 65 e_s: Joe
> == Info: upload completely sent off: 24 out of 24 bytes
> <= Recv header, 26 bytes (0x1a)
> : 48 54 54 50 2f 31 2e 31 20 34 30 30 20 42 61 64 HTTP/1.1 400 Bad
> 0010: 20 52 65 71 75 65 73 74 0d 0aRequest..
> <= Recv header, 40 bytes (0x28)
> : 43 6f 6e 74 65 6e 74 2d 54 79 70 65 3a 20 74 65 Content-Type: te
> 0010: 78 74 2f 70 6c 61 69 6e 3b 63 68 61 72 73 65 74 xt/plain;charset
> 0020: 3d 75 74 66 2d 38 0d 0a =utf-8..
> <= Recv header, 21 bytes (0x15)
> : 43 6f 6e 74 65 6e 74 2d 4c 65 6e 67 74 68 3a 20 Content-Length:
> 0010: 33 32 33 0d 0a  323..
> <= Recv header, 2 bytes (0x2)
> : 0d 0a   ..
> <= Recv data, 323 bytes (0x143)
> : 7b 0a 20 20 22 72 65 73 70 6f 6e 73 65 48 65 61 {.  "responseHea
> 0010: 64 65 72 22 3a 7b 0a 20 20 20 20 22 73 74 61 74 der":{."stat
> 0020: 75 73 22 3a 34 30 30 2c 0a 20 20 20 20 22 51 54 us":400,."QT
> 0030: 69 6d 65 22 3a 30 7d 2c 0a 20 20 22 65 72 72 6f ime":0},.  "erro
> 0040: 72 22 3a 7b 0a 20 20 20 20 22 6d 65 74 61 64 61 r":{."metada
> 0050: 74 61 22 3a 5b 0a 20 20 20 20 20 20 22 65 72 72 ta":[.  "err
> 0060: 6f 72 2d 63 6c 61 73 73 22 2c 22 6f 72 67 2e 61 or-class","org.a
> 0070: 70 61 63 68 65 2e 73 6f 6c 72 2e 63 6f 6d 6d 6f pache.solr.commo
> 0080: 6e 2e 53 6f 6c 72 45 78 63 65 70 74 69 6f 6e 22 n.SolrException"
> 0090: 2c 0a 20 20 20 20 20 20 22 72 6f 6f 74 2d 65 72 ,.  "root-er
> 00a0: 72 6f 72 2d 63 6c 61 73 73 22 2c 22 6f 72 67 2e ror-class","org.
> 00b0: 61 70 61 63 68 65 2e 73 6f 6c 72 2e 63 6f 6d 6d apache.solr.comm
> 00c0: 6f 6e 2e 53 6f 6c 72 45 78 63 65 70 74 69 6f 6e on.SolrException
> 00d0: 22 5d 2c 0a 20 20 20 20 22 6d 73 67 22 3a 22 43 "],."msg":"C
> 00e0: 61 6e 6e 6f 74 20 70 61 72 73 65 20 70 72 6f 76 annot parse prov
> 00f0: 69 64 65 64 20 4a 53 4f 4e 3a 20 45 78 70 65 63 ided JSON: Expec
> 0100: 74 65 64 20 27 2c 27 20 6f 72 20 27 7d 27 3a 20 ted ',' or '}':
> 0110: 63 68 61 72 3d 28 45 4f 46 29 2c 70 6f 73 69 74 char=(EOF),posit
> 0120: 69 6f 6e 3d 32 34 20 41 46 54 45 52 3d 27 27 22 ion=24 AFTER=''"
> 0130: 2c 0a 20 20 20 20 22 63 6f 64 65 22 3a 34 30 30 ,."code":400
> 0140: 7d 7d 0a}}.
> {
>   "responseHeader":{
>     "status":400,
> "QTime":0},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "r

Re: Error in indexing JSON with space in value

2018-03-22 Thread Yonik Seeley
It looks like a curl globbing issue from the curl error message you included:
"curl: (3) [globbing] bad range specification in column 39"

You can try turning off curl globbing with the -g param.
That may not be the only issue though, as the command shown shouldn't
have triggered curl globbing.  Perhaps you simplified it or redacted
some info before posting?

-Yonik


Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-22 Thread Yonik Seeley
I've reproduced the issue and opened
https://issues.apache.org/jira/browse/SOLR-12020

-Yonik



On Thu, Feb 22, 2018 at 11:03 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> Thanks Antelmo, I'm trying to reproduce this now.
> -Yonik
>
>
> On Mon, Feb 19, 2018 at 10:13 AM, Antelmo Aguilar <aagui...@nd.edu> wrote:
>> Hi all,
>>
>> I was wondering if the information I sent is sufficient to look into the
>> issue.  Let me know if you need anything else from me please.
>>
>> Thanks,
>> Antelmo
>>
>> On Thu, Feb 15, 2018 at 1:56 PM, Antelmo Aguilar <aagui...@nd.edu> wrote:
>>
>>> Hi,
>>>
>>> Here are two pastebins.  The first is the full complete response with the
>>> search parameters used.  The second is the stack trace from the logs:
>>>
>>> https://pastebin.com/rsHvKK63
>>>
>>> https://pastebin.com/8amxacAj
>>>
>>> I am not using any custom code or plugins with the Solr instance.
>>>
>>> Please let me know if you need anything else and thanks for looking into
>>> this.
>>>
>>> -Antelmo
>>>
>>> On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>>>
>>>> Could you provide the full stack trace containing "Invalid Date
>>>> String"  and the full request that causes it?
>>>> Are you using any custom code/plugins in Solr?
>>>> -Yonik
>>>>
>>>>
>>>> On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar <aagui...@nd.edu> wrote:
>>>> > Hi,
>>>> >
>>>> > I was using the following part of a query to get facet buckets so that I
>>>> > can use the information in the buckets for some post-processing:
>>>> >
>>>> > "json":
>>>> > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b:
>>>> true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"ter
>>>> m\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:spec
>>>> ies_category}\",\"facet\":{\"collection_dates\":{\"type\":\
>>>> "terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\
>>>> ":{\"collection\":
>>>> > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"fa
>>>> cet\":{\"abnd\":\"sum(div(sample_size_i,
>>>> > collection_duration_days_i))\""
>>>> >
>>>> > Sorry if it is hard to read.  Basically what is was doing was getting
>>>> the
>>>> > following buckets:
>>>> >
>>>> > First bucket will be categorized by "Species category" by default
>>>> unless we
>>>> > pass in the request the "term" parameter which we will categories the
>>>> first
>>>> > bucket by whatever "term" is set to.  Then inside this first bucket, we
>>>> > create another buckets of the "Collection date" category.  Then inside
>>>> the
>>>> > "Collection date" category buckets, we would use some functions to do
>>>> some
>>>> > calculations and return those calculations inside the "Collection date"
>>>> > category buckets.
>>>> >
>>>> > This query is working fine in Solr 6.2, but I upgraded our instance of
>>>> Solr
>>>> > 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr
>>>> 6.6
>>>> > broke the above query.  Now it complains when trying to create the
>>>> buckets
>>>> > of the "Collection date" category.  I get the following error:
>>>> >
>>>> > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
>>>> >
>>>> > It seems that when creating the buckets of a date field, it does some
>>>> > conversion of the way the date is stored and causes the error to appear.
>>>> > Does anyone have an idea as to why this error is happening?  I would
>>>> really
>>>> > appreciate any help.  Hopefully I was able to explain my issue well.
>>>> >
>>>> > Thanks,
>>>> > Antelmo
>>>>
>>>
>>>


Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-22 Thread Yonik Seeley
Thanks Antelmo, I'm trying to reproduce this now.
-Yonik


On Mon, Feb 19, 2018 at 10:13 AM, Antelmo Aguilar <aagui...@nd.edu> wrote:
> Hi all,
>
> I was wondering if the information I sent is sufficient to look into the
> issue.  Let me know if you need anything else from me please.
>
> Thanks,
> Antelmo
>
> On Thu, Feb 15, 2018 at 1:56 PM, Antelmo Aguilar <aagui...@nd.edu> wrote:
>
>> Hi,
>>
>> Here are two pastebins.  The first is the full complete response with the
>> search parameters used.  The second is the stack trace from the logs:
>>
>> https://pastebin.com/rsHvKK63
>>
>> https://pastebin.com/8amxacAj
>>
>> I am not using any custom code or plugins with the Solr instance.
>>
>> Please let me know if you need anything else and thanks for looking into
>> this.
>>
>> -Antelmo
>>
>> On Wed, Feb 14, 2018 at 12:56 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>>
>>> Could you provide the full stack trace containing "Invalid Date
>>> String"  and the full request that causes it?
>>> Are you using any custom code/plugins in Solr?
>>> -Yonik
>>>
>>>
>>> On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar <aagui...@nd.edu> wrote:
>>> > Hi,
>>> >
>>> > I was using the following part of a query to get facet buckets so that I
>>> > can use the information in the buckets for some post-processing:
>>> >
>>> > "json":
>>> > "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b:
>>> true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"ter
>>> m\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:spec
>>> ies_category}\",\"facet\":{\"collection_dates\":{\"type\":\
>>> "terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\
>>> ":{\"collection\":
>>> > {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"fa
>>> cet\":{\"abnd\":\"sum(div(sample_size_i,
>>> > collection_duration_days_i))\""
>>> >
>>> > Sorry if it is hard to read.  Basically what is was doing was getting
>>> the
>>> > following buckets:
>>> >
>>> > First bucket will be categorized by "Species category" by default
>>> unless we
>>> > pass in the request the "term" parameter which we will categories the
>>> first
>>> > bucket by whatever "term" is set to.  Then inside this first bucket, we
>>> > create another buckets of the "Collection date" category.  Then inside
>>> the
>>> > "Collection date" category buckets, we would use some functions to do
>>> some
>>> > calculations and return those calculations inside the "Collection date"
>>> > category buckets.
>>> >
>>> > This query is working fine in Solr 6.2, but I upgraded our instance of
>>> Solr
>>> > 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr
>>> 6.6
>>> > broke the above query.  Now it complains when trying to create the
>>> buckets
>>> > of the "Collection date" category.  I get the following error:
>>> >
>>> > Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
>>> >
>>> > It seems that when creating the buckets of a date field, it does some
>>> > conversion of the way the date is stored and causes the error to appear.
>>> > Does anyone have an idea as to why this error is happening?  I would
>>> really
>>> > appreciate any help.  Hopefully I was able to explain my issue well.
>>> >
>>> > Thanks,
>>> > Antelmo
>>>
>>
>>


Re: facet.method=uif not working in solr cloud?

2018-02-15 Thread Yonik Seeley
On Wed, Feb 14, 2018 at 7:24 PM, Wei <weiwan...@gmail.com> wrote:
> Thanks Yonik. If uif has big upfront cost when hits solr the first time,
> in solr cloud the same faceting request could hit different replicas in the
> same shard, so that cost will happen at least for the number of replicas?
> If we are doing frequent auto commits, fieldvaluecache will be invalidated
> and uif will have to pay the upfront cost again after each commit?

Right.  It's not good for frequently changing indexes.

-Yonik

>
>
> On Wed, Feb 14, 2018 at 11:51 AM, Yonik Seeley <ysee...@gmail.com> wrote:
>
>> On Wed, Feb 14, 2018 at 2:28 PM, Wei <weiwan...@gmail.com> wrote:
>> > Thanks all!   It's really great learning.  A bit off the topic, after I
>> > enabled facet.method = uif in solr cloud,  the faceting performance is
>> > actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
>> > with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
>> > that fieldValueCache is getting utilized.  Any reason uif could be so
>> > slow?
>>
>> I haven't seen that before.  Are you sure it's not the first time
>> faceting on a field?  uif has big upfront cost, but is usually faster
>> once that cost has been paid.
>>
>>
>> -Yonik
>>
>> > On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley <ysee...@gmail.com> wrote:
>> >
>> >> Great, thanks for tracking that down!
>> >> It's interesting that a mincount of 0 disables uif processing in the
>> >> first place.  IIRC, it's only the hash-based method (as opposed to
>> >> array-based) that can't return zero counts.
>> >>
>> >> -Yonik
>> >>
>> >>
>> >> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
>> >> <a.benede...@sease.io> wrote:
>> >> > *Update* : This has been actually already solved by Hoss.
>> >> >
>> >> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
>> >> > Request : https://github.com/apache/lucene-solr/pull/279/files
>> >> >
>> >> > This should go live with 7.3
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> >
>> >> > -
>> >> > ---
>> >> > Alessandro Benedetti
>> >> > Search Consultant, R Software Engineer, Director
>> >> > Sease Ltd. - www.sease.io
>> >> > --
>> >> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> >>
>>


Re: facet.method=uif not working in solr cloud?

2018-02-14 Thread Yonik Seeley
On Wed, Feb 14, 2018 at 2:28 PM, Wei <weiwan...@gmail.com> wrote:
> Thanks all!   It's really great learning.  A bit off the topic, after I
> enabled facet.method = uif in solr cloud,  the faceting performance is
> actually much worse than the original fc( ~1000 ms with uif  vs ~200 ms
> with fc). My cloud has 8 shards with 6 replicas in each shard.  I do see
> that fieldValueCache is getting utilized.  Any reason uif could be so
> slow?

I haven't seen that before.  Are you sure it's not the first time
faceting on a field?  uif has big upfront cost, but is usually faster
once that cost has been paid.


-Yonik

> On Tue, Feb 13, 2018 at 7:41 AM, Yonik Seeley <ysee...@gmail.com> wrote:
>
>> Great, thanks for tracking that down!
>> It's interesting that a mincount of 0 disables uif processing in the
>> first place.  IIRC, it's only the hash-based method (as opposed to
>> array-based) that can't return zero counts.
>>
>> -Yonik
>>
>>
>> On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
>> <a.benede...@sease.io> wrote:
>> > *Update* : This has been actually already solved by Hoss.
>> >
>> > https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
>> > Request : https://github.com/apache/lucene-solr/pull/279/files
>> >
>> > This should go live with 7.3
>> >
>> > Cheers
>> >
>> >
>> >
>> > -
>> > ---
>> > Alessandro Benedetti
>> > Search Consultant, R Software Engineer, Director
>> > Sease Ltd. - www.sease.io
>> > --
>> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>


Re: Issue Using JSON Facet API Buckets in Solr 6.6

2018-02-14 Thread Yonik Seeley
Could you provide the full stack trace containing "Invalid Date
String"  and the full request that causes it?
Are you using any custom code/plugins in Solr?
-Yonik


On Mon, Feb 12, 2018 at 4:55 PM, Antelmo Aguilar  wrote:
> Hi,
>
> I was using the following part of a query to get facet buckets so that I
> can use the information in the buckets for some post-processing:
>
> "json":
> "{\"filter\":[\"bundle:pop_sample\",\"has_abundance_data_b:true\",\"has_geodata:true\",\"${project}\"],\"facet\":{\"term\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"${term:species_category}\",\"facet\":{\"collection_dates\":{\"type\":\"terms\",\"limit\":-1,\"field\":\"collection_date\",\"facet\":{\"collection\":
> {\"type\":\"terms\",\"field\":\"collection_assay_id_s\",\"facet\":{\"abnd\":\"sum(div(sample_size_i,
> collection_duration_days_i))\""
>
> Sorry if it is hard to read.  Basically what is was doing was getting the
> following buckets:
>
> First bucket will be categorized by "Species category" by default unless we
> pass in the request the "term" parameter which we will categories the first
> bucket by whatever "term" is set to.  Then inside this first bucket, we
> create another buckets of the "Collection date" category.  Then inside the
> "Collection date" category buckets, we would use some functions to do some
> calculations and return those calculations inside the "Collection date"
> category buckets.
>
> This query is working fine in Solr 6.2, but I upgraded our instance of Solr
> 6.2 to the latest 6.6 version.  However it seems that upgrading to Solr 6.6
> broke the above query.  Now it complains when trying to create the buckets
> of the "Collection date" category.  I get the following error:
>
> Invalid Date String:'Fri Aug 01 00:00:00 UTC 2014'
>
> It seems that when creating the buckets of a date field, it does some
> conversion of the way the date is stored and causes the error to appear.
> Does anyone have an idea as to why this error is happening?  I would really
> appreciate any help.  Hopefully I was able to explain my issue well.
>
> Thanks,
> Antelmo


Re: facet.method=uif not working in solr cloud?

2018-02-13 Thread Yonik Seeley
Great, thanks for tracking that down!
It's interesting that a mincount of 0 disables uif processing in the
first place.  IIRC, it's only the hash-based method (as opposed to
array-based) that can't return zero counts.

-Yonik


On Tue, Feb 13, 2018 at 6:17 AM, Alessandro Benedetti
 wrote:
> *Update* : This has been actually already solved by Hoss.
>
> https://issues.apache.org/jira/browse/SOLR-11711 and this is the Pull
> Request : https://github.com/apache/lucene-solr/pull/279/files
>
> This should go live with 7.3
>
> Cheers
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: facet.method=uif not working in solr cloud?

2018-02-12 Thread Yonik Seeley
Feels like we should open an issue for this (that facet.method=uif is
only respected if you specify another esoteric parameter...)

-Yonik


On Mon, Feb 12, 2018 at 8:34 PM, Wei  wrote:
> Adding facet.distrib.mco=true did the trick.  Thanks Toke and Alessandro!
>
> Cheers,
> Wei
>
> On Thu, Feb 8, 2018 at 1:23 AM, Toke Eskildsen  wrote:
>
>> On Fri, 2018-02-02 at 17:40 -0800, Wei wrote:
>> > I tried to debug a bit and see that when executing on a cloud solr
>> > server, although I put
>> > facet.field=color=*:*=uif=1 in
>> > the request url, at the point it reaches SimpleFacet inside
>> > req.params it somehow has been rewritten
>> > to  f.color.facet.mincount=0, no wonder the
>> > method chosen become FC. So one myth solved; but the new myth is why
>> > the facet.mincount is override to 0 in solr req?
>>
>> AFAIK, it is due to an attempt of optimisation for distributed
>> faceting. The relevant JIRA seems to be https://issues.apache.org/jira/
>> browse/SOLR-8988
>>
>> Try setting facet.distrib.mco=true
>>
>> - Toke Eskildsen, Royal Danish Library
>>
>>


Re: Solr4 To Solr6 CPU load issues

2018-02-12 Thread Yonik Seeley
On Sun, Feb 11, 2018 at 8:47 AM, ~$alpha`  wrote:
> I have upgraded Solr4.0 Beta to Solr6.6. The Cache results look Awesome but
> overall the CPU load on solr6.6 is double the load on solr4.0 and hence I am
> not able to roll solr6.6 to 100% of my traffic.
>
> *Some Key Stats In Performance of Sol6 Vs Solr4*
> Document cache usage increased from .98 from .14
> Query Result cache usage increased from .10 from .24
> Filter cache same as .94
> Field Value cache was 0.99 in solr4 but n/a in solr6 (i guess because field
> multivalued concept was changed from solr4 to solr6)

It could be faceting... "uif" was the default faceting method in Solr
4 (which used fieldValueCache), which was good for facet performance
but bad for heap usage and index turnaround time (NRT).
It was removed, but then later re-added in the JSON Facet API.  Access
to that was hooked into the older facet API via facet.method=uif
https://issues.apache.org/jira/browse/SOLR-8466

-Yonik


Re: Solr 7.2.1 - cursorMark and elevateIds

2018-01-25 Thread Yonik Seeley
Yes, please open a JIRA issue.
The elevate component modifies the sort parameter, and it looks like
that doesn't play well with cursorMark, which needs to
serialize/deserialize sort values.
We can either fix the issue, or at a minimum provide a better error
message if cursorMark is limited to sorting on "normal" fields only.

-Yonik


On Wed, Jan 24, 2018 at 3:19 PM, Greg Roodt  wrote:
> Given the technical nature of this problem? Do you think I should try
> raising this on the developer group or raising a bug?
>
>
>
> On 24 January 2018 at 12:36, Greg Roodt  wrote:
>
>> Hi
>>
>> I'm trying to use the Query Eleveation Component in conjunction with
>> CursorMark pagination. It doesn't seem to work. I get an exception. Are
>> these components meant to work together?
>>
>> This works:
>> enableElevation=true=true=MAAMNqFV1dg
>>
>> This fails:
>> cursorMark=*=true=true&
>> elevateIds=MAAMNqFV1dg
>>
>> Here is the stacktrace:
>>
>> """
>> 'trace'=>'java.lang.ClassCastException: java.lang.Integer cannot be cast
>> to org.apache.lucene.util.BytesRef at org.apache.solr.schema.FieldType.
>> marshalStringSortValue(FieldType.java:1127) at org.apache.solr.schema.
>> StrField.marshalSortValue(StrField.java:100) at org.apache.solr.search.
>> CursorMark.getSerializedTotem(CursorMark.java:250) at
>> org.apache.solr.handler.component.QueryComponent.doProcessUngroupedSearch(QueryComponent.java:1445)
>> at org.apache.solr.handler.component.QueryComponent.
>> process(QueryComponent.java:375) at org.apache.solr.handler.
>> component.SearchHandler.handleRequestBody(SearchHandler.java:303) at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503) at
>> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710) at
>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516) at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
>> at org.eclipse.jetty.servlet.ServletHandler$CachedChain.
>> doFilter(ServletHandler.java:1751) at org.eclipse.jetty.servlet.
>> ServletHandler.doHandle(ServletHandler.java:582) at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> at 
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>> at org.eclipse.jetty.server.session.SessionHandler.
>> doHandle(SessionHandler.java:226) at org.eclipse.jetty.server.
>> handler.ContextHandler.doHandle(ContextHandler.java:1180) at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>> at org.eclipse.jetty.server.session.SessionHandler.
>> doScope(SessionHandler.java:185) at org.eclipse.jetty.server.
>> handler.ContextHandler.doScope(ContextHandler.java:1112) at
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(
>> ContextHandlerCollection.java:213) at org.eclipse.jetty.server.
>> handler.HandlerCollection.handle(HandlerCollection.java:119) at
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>> at 
>> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
>> at 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>> at org.eclipse.jetty.server.Server.handle(Server.java:534) at
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) at
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>> at 
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) at
>> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>> at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
>> executeProduceConsume(ExecuteProduceConsume.java:303) at
>> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.
>> produceConsume(ExecuteProduceConsume.java:148) at
>> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(
>> ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.
>> QueuedThreadPool.runJob(QueuedThreadPool.java:671) at
>> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>> at java.lang.Thread.run(Thread.java:748)
>> """
>>
>> Any idea what's going wrong?
>>
>> Greg
>>
>>


Re: Json Facet Query Stripping Field Name with Hyphen

2018-01-04 Thread Yonik Seeley
The JSON Facet API uses the function query parser for something like
sum(week_-91) so you'll probably have problems with any function that
uses these fields as well.
As Erick says, you're better off renaming the fields.  There is a
workaround for wonky field names via the "field" function:
sum(field(week_-91))

-Yonik


On Thu, Jan 4, 2018 at 10:02 AM, RAUNAK AGRAWAL
 wrote:
> Hi Guys,
>
> I am facing issue where I am trying to follow the JSON facet API. I have
> data in my collection and field names are like "week_0", "week_-1" which
> means current week and previous week respectively.
>
> When I am querying for week_0 summation using the following query I am able
> to get the result.
>
> http://localhost:8983/solr/collection1/query?q=*:*={week_0_sum:'sum(week_0)'}=0
>
>
> But when I am trying to do the same for any field "week_-*", it is break.
>
> For example when I am trying:
> http://localhost:8983/solr/collection1/query?q=*:*={week_-91_sum:%27sum(week_-91)%27}=0
>
>
> I am getting the exception as* "msg": "undefined field: \"week_\"''*
>
>
> That means solr is stripping field name after hyphen (-). Is there
> workaround to fix this. I tried adding escape character (\) but it is of no
> help.
>
> With escape:
> http://localhost:8983/solr/collection1/query?q=*:*={week_-91_sum:%27sum(week_\-91)%27}=0
>
>
> Please help me regarding this.
>
> Thanks


Re: Solr Aggregation queries are way slower than Elastic Search

2017-12-12 Thread Yonik Seeley
On Tue, Dec 12, 2017 at 9:17 AM, RAUNAK AGRAWAL
<agrawal.rau...@gmail.com> wrote:
> Hi Yonik,
>
> So if the query is fine then I guess even using JSON Facet API will not
> help me here.

As Joel mentioned, it's completely different code than the old stats API.
This is a very simple use-case, so if we're slower than ES for some
reason, it should be very easy to fix.

-Yonik


> On Tue, Dec 12, 2017 at 7:27 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>
>> OK great, so it's definitely not the main query (which is just a
>> single term query in this example!)
>>
>> > Also I have looked into the JSON Facet API. If I have to use facets, I
>> will
>> > have to then define 3600 facets in a single query and I guess that would
>> be
>> > also slow.
>>
>> You can ask for any number of stats for a given facet (even the root
>> facet bucket w/o faceting on any fields):
>>
>> cutl 'http://localhost:8983/solr/collection1.query?q=variable1:
>> 290=0={
>>   s1:"sum(metric_1)",
>>   s2:"sum(metric_2)",
>>   s3:"sum(metric_3)"
>> }'
>>
>> -Yonik
>>
>>
>> On Tue, Dec 12, 2017 at 5:40 AM, RAUNAK AGRAWAL
>> <agrawal.rau...@gmail.com> wrote:
>> > Hi Yonik,
>> >
>> > As you asked here is the code snippet and the actual solr query. Please
>> > have a look. I have included only 104 metrics but like this we can go
>> upto
>> > 3600 rollups.
>> >
>> > Also I have looked into the JSON Facet API. If I have to use facets, I
>> will
>> > have to then define 3600 facets in a single query and I guess that would
>> be
>> > also slow. Also is there any max limit on the number of facets we can
>> > define in a single query?
>> >
>> > Code snippet:
>> >
>> > private SolrQuery buildQuery(Integer variable1, List metrics) {
>> > SolrQuery query = new SolrQuery();
>> > query.set("q", "variable1:" + variable1);
>> > query.setRows(0);
>> > metrics.forEach(
>> > metric -> query.setGetFieldStatistics("{!sum=true }" +
>> metric)
>> > );
>> > return query;
>> > }
>> >
>> >
>> > The generated query:
>> >
>> > {! q=variable1:290 rows=0 stats=true stats.field='{!sum=true
>> > }metric_1' stats.field='{!sum=true }metric_2' stats.field='{!sum=true
>> > }metric_3' stats.field='{!sum=true }metric_4' stats.field='{!sum=true
>> > }metric_5' stats.field='{!sum=true }metric_6' stats.field='{!sum=true
>> > }metric_7' stats.field='{!sum=true }metric_8' stats.field='{!sum=true
>> > }metric_9' stats.field='{!sum=true }metric_10' stats.field='{!sum=true
>> > }metric_11' stats.field='{!sum=true }metric_12'
>> > stats.field='{!sum=true }metric_13' stats.field='{!sum=true
>> > }metric_14' stats.field='{!sum=true }metric_15'
>> > stats.field='{!sum=true }metric_16' stats.field='{!sum=true
>> > }metric_17' stats.field='{!sum=true }metric_18'
>> > stats.field='{!sum=true }metric_19' stats.field='{!sum=true
>> > }metric_20' stats.field='{!sum=true }metric_21'
>> > stats.field='{!sum=true }metric_22' stats.field='{!sum=true
>> > }metric_23' stats.field='{!sum=true }metric_24'
>> > stats.field='{!sum=true }metric_25' stats.field='{!sum=true
>> > }metric_26' stats.field='{!sum=true }metric_27'
>> > stats.field='{!sum=true }metric_28' stats.field='{!sum=true
>> > }metric_29' stats.field='{!sum=true }metric_30'
>> > stats.field='{!sum=true }metric_31' stats.field='{!sum=true
>> > }metric_32' stats.field='{!sum=true }metric_33'
>> > stats.field='{!sum=true }metric_34' stats.field='{!sum=true
>> > }metric_35' stats.field='{!sum=true }metric_36'
>> > stats.field='{!sum=true }metric_37' stats.field='{!sum=true
>> > }metric_38' stats.field='{!sum=true }metric_39'
>> > stats.field='{!sum=true }metric_40' stats.field='{!sum=true
>> > }metric_41' stats.field='{!sum=true }metric_42'
>> > stats.field='{!sum=true }metric_43' stats.field='{!sum=true
>> > }metric_44' stats.field='{!sum=true }metric_45'
>> > stats.field='{!sum=true }metric_46' stats.field='{!sum=true
>> > }metric_47' stats.field='{!sum=true }metric_48'
>> > stats.field='{!sum=true }metric_49' stats.field='{!sum=true
>> > }metric_50' stats.field='{!sum=true }metric_51'
>> > stats.field='{!sum=true }metric_52' stats.field='{!sum=true
>> > }metric_53' stats.field='{!sum=true }metric_54'

Re: Solr Aggregation queries are way slower than Elastic Search

2017-12-12 Thread Yonik Seeley
ield='{!sum=true }metric_87'
> stats.field='{!sum=true }metric_88' stats.field='{!sum=true
> }metric_89' stats.field='{!sum=true }metric_90'
> stats.field='{!sum=true }metric_91' stats.field='{!sum=true
> }metric_92' stats.field='{!sum=true }metric_93'
> stats.field='{!sum=true }metric_94' stats.field='{!sum=true
> }metric_95' stats.field='{!sum=true }metric_96'
> stats.field='{!sum=true }metric_97' stats.field='{!sum=true
> }metric_98' stats.field='{!sum=true }metric_99'
> stats.field='{!sum=true }metric_100' stats.field='{!sum=true
> }metric_101' stats.field='{!sum=true }metric_102'
> stats.field='{!sum=true }metric_103' stats.field='{!sum=true
> }metric_104'}
>
>
>
>
> On Tue, Dec 12, 2017 at 10:21 AM, RAUNAK AGRAWAL <agrawal.rau...@gmail.com>
> wrote:
>
>> Hi Yonik,
>>
>> I will try the JSON Facet API and update here but my hunch is that
>> querying mechanism is not the problem. Rather the problem lies with the
>> solr aggregations.
>>
>> Thanks
>>
>> On Tue, Dec 12, 2017 at 6:31 AM, Yonik Seeley <ysee...@gmail.com> wrote:
>>
>>> I think the SolrJ below uses the old stats component.
>>> Hopefully the JSON Facet API would be faster for this, but it's not
>>> completely clear what the main query here looks like, and if it's the
>>> source of any bottleneck rather than the aggregations.
>>> What does the generated query string actually look like (it may be
>>> easiest just to pull this from the logs).
>>>
>>> -Yonik
>>>
>>>
>>> On Mon, Dec 11, 2017 at 7:32 PM, RAUNAK AGRAWAL
>>> <agrawal.rau...@gmail.com> wrote:
>>> > Hi,
>>> >
>>> > We have a use case where there are 4-5 dimensions and around 3500
>>> metrics
>>> > in a single document. We have indexed only 2 dimensions and made all the
>>> > metrics as doc_values so that we can run the aggregation queries.
>>> >
>>> > We have 6 million such documents and we are using solr cloud(6.6) on a 6
>>> > node cluster with 8 Vcores and 24 GB RAM each.
>>> >
>>> > On the same set of hardware in elastic search we were getting the
>>> response
>>> > in about 10ms whereas in solr we are getting response in around 300-400
>>> ms.
>>> >
>>> > This is how I am querying the data.
>>> >
>>> > private SolrQuery buildQuery(Integer variable1, List groups,
>>> > List metrics) {
>>> > SolrQuery query = new SolrQuery();
>>> > String groupQuery = " (" + groups.stream().map(g -> "group:" +
>>> g).collect
>>> > (Collectors.joining(" OR ")) + ")";
>>> > String finalQuery = "variable1:" + variable1 + " AND " + groupQuery;
>>> > query.set("q", finalQuery);
>>> > query.setRows(0);
>>> > metrics.forEach(
>>> > metric -> query.setGetFieldStatistics("{!sum=true }" +
>>> metric)
>>> > );
>>> > return query;
>>> > }
>>> >
>>> > Any help will be appreciated regarding this.
>>> >
>>> >
>>> > Thanks,
>>> >
>>> > Raunak
>>>
>>
>>


Re: Solr Aggregation queries are way slower than Elastic Search

2017-12-11 Thread Yonik Seeley
I think the SolrJ below uses the old stats component.
Hopefully the JSON Facet API would be faster for this, but it's not
completely clear what the main query here looks like, and if it's the
source of any bottleneck rather than the aggregations.
What does the generated query string actually look like (it may be
easiest just to pull this from the logs).

-Yonik


On Mon, Dec 11, 2017 at 7:32 PM, RAUNAK AGRAWAL
 wrote:
> Hi,
>
> We have a use case where there are 4-5 dimensions and around 3500 metrics
> in a single document. We have indexed only 2 dimensions and made all the
> metrics as doc_values so that we can run the aggregation queries.
>
> We have 6 million such documents and we are using solr cloud(6.6) on a 6
> node cluster with 8 Vcores and 24 GB RAM each.
>
> On the same set of hardware in elastic search we were getting the response
> in about 10ms whereas in solr we are getting response in around 300-400 ms.
>
> This is how I am querying the data.
>
> private SolrQuery buildQuery(Integer variable1, List groups,
> List metrics) {
> SolrQuery query = new SolrQuery();
> String groupQuery = " (" + groups.stream().map(g -> "group:" + g).collect
> (Collectors.joining(" OR ")) + ")";
> String finalQuery = "variable1:" + variable1 + " AND " + groupQuery;
> query.set("q", finalQuery);
> query.setRows(0);
> metrics.forEach(
> metric -> query.setGetFieldStatistics("{!sum=true }" + metric)
> );
> return query;
> }
>
> Any help will be appreciated regarding this.
>
>
> Thanks,
>
> Raunak


Re: Skewed IDF in multi lingual index, again

2017-12-05 Thread Yonik Seeley
On Tue, Dec 5, 2017 at 5:15 AM, alessandro.benedetti
 wrote:
> "Lucene/Solr doesn't actually delete documents when you delete them, it
> just marks them as deleted.  I'm pretty sure that the difference between
> docCount and maxDoc is deleted documents.  Maybe I don't understand what
> I'm talking about, but that is the best I can come up with. "
>
> Thanks Shawn, yes, that is correct and I was aware of it.
> I was curious of another difference :
> I think we confirmed that docCount is local to the field ( thanks Yonik for
> that) so :
>
> docCount(index,field1)= # of documents in the index that currently have
> value(s) for field1
>
> My question is :
>
> maxDocs(index,field1)= max # of documents in the index that had value(s) for
> field1
>
> OR
>
> maxDocs(index)= max # of documents that appeared in the index ( field
> independent)

The latter.
I imagine that's why docCount was introduced (to avoid changing the
meaning of an existing term).
FWIW, the scoring change was made in
https://issues.apache.org/jira/browse/LUCENE-6711 for Lucene/Solr 6.0

-Yonik


Re: Skewed IDF in multi lingual index, again

2017-12-04 Thread Yonik Seeley
On Mon, Dec 4, 2017 at 1:35 PM, Shawn Heisey  wrote:
> I'm pretty sure that the difference between docCount and maxDoc is deleted 
> documents.

docCount (not the best name) here is the number of documents with the
field being searched.  docFreq (df) is the number of documents
actually containing the term in that field.
In the past, maxDoc was used instead of docCount.

-Yonik


Re: JVM GC Issue

2017-12-03 Thread Yonik Seeley
On Sat, Dec 2, 2017 at 8:59 PM, S G  wrote:
> I am a bit curious on the docValues implementation.
> I understand that docValues do not use JVM memory and
> they make use of OS cache - that is why they are more performant.
>
> But to return any response from the docValues, the values in the
> docValues' column-oriented-structures would need to be brought
> into the JVM's memory. And that will then increase the pressure
> on the JVM's memory anyways. So how do docValues actually
> help from memory perspective?

docValues are not more performant than the on-heap fieldCache once a
fieldCache entry has been populated.
docValues do help with memory in a number of ways:
1) off-heap memory use degrades much more gracefully... heap memory
will just cause an OOM exception when heap size is exceeded
2) off-heap memory (memory used by OS cache) can be dynamically shared
with other processes running on the system (the OS allocates as
needed)
3) easier to configure (as opposed to max-heap size) since the OS will
just automatically use all free memory to cache relevant parts of
memory mapped disk files
4) off-heap memory does not need to be garbage collected (helps with
people that have huge GC pauses)

-Yonik


Re: Solr 7.x: Issues with unique()/hll() function on a string field nested in a range facet

2017-11-21 Thread Yonik Seeley
I opened https://issues.apache.org/jira/browse/SOLR-11664 to track this.
I should be able to look into this shortly if no one else does.

-Yonik


On Tue, Nov 21, 2017 at 6:02 PM, Yonik Seeley <ysee...@gmail.com> wrote:
> Thanks for the complete info that allowed me to easily reproduce this!
> The bug seems to extend beyond hll/unique... I tried min(string_s) and
> got wonky results as well.
>
> -Yonik
>
>
> On Tue, Nov 21, 2017 at 7:47 AM, Volodymyr Rudniev <vmrudn...@gmail.com> 
> wrote:
>> Hello,
>>
>> I've encountered 2 issues while trying to apply unique()/hll() function to a
>> string field inside a range facet:
>>
>> Results are incorrect for a single-valued string field.
>> I’m getting ArrayIndexOutOfBoundsException for a multi-valued string field.
>>
>>
>> How to reproduce:
>>
>> Create a core based on the default configSet.
>> Add several simple documents to the core, like these:
>>
>> [
>>   {
>> "id": "14790",
>> "int_i": 2010,
>> "date_dt": "2010-01-01T00:00:00Z",
>> "string_s": "a",
>> "string_ss": ["a", "b"]
>>   },
>>   {
>> "id": "12254",
>> "int_i": 2014,
>> "date_dt": "2014-01-01T00:00:00Z",
>> "string_s": "e",
>> "string_ss": ["b", "c"]
>>   },
>>   {
>> "id": "12937",
>> "int_i": 2008,
>> "date_dt": "2008-01-01T00:00:00Z",
>> "string_s": "c",
>> "string_ss": ["c", "d"]
>>   },
>>   {
>> "id": "10575",
>> "int_i": 2008,
>> "date_dt": "2008-01-01T00:00:00Z",
>> "string_s": "b",
>> "string_ss": ["d", "e"]
>>   },
>>   {
>> "id": "13644",
>> "int_i": 2014,
>> "date_dt": "2014-01-01T00:00:00Z",
>> "string_s": "e",
>> "string_ss": ["e", "a"]
>>   },
>>   {
>> "id": "8405",
>> "int_i": 2014,
>> "date_dt": "2014-01-01T00:00:00Z",
>> "string_s": "d",
>> "string_ss": ["a", "b"]
>>   },
>>   {
>> "id": "6128",
>> "int_i": 2008,
>> "date_dt": "2008-01-01T00:00:00Z",
>> "string_s": "a",
>> "string_ss": ["b", "c"]
>>   },
>>   {
>> "id": "5220",
>> "int_i": 2015,
>> "date_dt": "2015-01-01T00:00:00Z",
>> "string_s": "d",
>> "string_ss": ["c", "d"]
>>   },
>>   {
>> "id": "6850",
>> "int_i": 2012,
>> "date_dt": "2012-01-01T00:00:00Z",
>> "string_s": "b",
>> "string_ss": ["d", "e"]
>>   },
>>   {
>> "id": "5748",
>> "int_i": 2014,
>> "date_dt": "2014-01-01T00:00:00Z",
>> "string_s": "e",
>> "string_ss": ["e", "a"]
>>   }
>> ]
>>
>> 3. Try queries like the following for a single-valued string field:
>>
>> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_s)"
>>
>> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_s)"
>>
>> Distinct counts returned are incorrect in general. For example, for the set
>> of documents above, the response will contain:
>>
>> {
>> "val": 2010,
>> "count": 1,
>> "distinct_count": 0
>> }
>>
>> and
>>
>> "between": {
>> "count": 10,
>> "distinct_count": 1
>> }
>>
>> (there should be 5 distinct values).
>>
>> Note, the result depends on the order in which the documents are added.
>>
>> 4. Try queries like the following for a multi-valued string field:
>>
>> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_ss)"
>>
>> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_ss)"
>>
>> I’m getting ArrayIndexOutOfBoundsException for such queries.
>>
>> Note, everything looks Ok for other field types (I tried single- and
>> multi-valued ints, doubles and dates) or when the enclosing facet is a terms
>> facet or there is no enclosing facet at all.
>>
>> I can reproduce these issues both for Solr 7.0.1 and 7.1.0. Solr 6.x and
>> 5.x, as it seems, do not have such issues.
>>
>> Is it a bug? Or, may be, I’ve missed something?
>>
>> Thanks,
>>
>> Volodymyr
>>


Re: Solr 7.x: Issues with unique()/hll() function on a string field nested in a range facet

2017-11-21 Thread Yonik Seeley
Thanks for the complete info that allowed me to easily reproduce this!
The bug seems to extend beyond hll/unique... I tried min(string_s) and
got wonky results as well.

-Yonik


On Tue, Nov 21, 2017 at 7:47 AM, Volodymyr Rudniev  wrote:
> Hello,
>
> I've encountered 2 issues while trying to apply unique()/hll() function to a
> string field inside a range facet:
>
> Results are incorrect for a single-valued string field.
> I’m getting ArrayIndexOutOfBoundsException for a multi-valued string field.
>
>
> How to reproduce:
>
> Create a core based on the default configSet.
> Add several simple documents to the core, like these:
>
> [
>   {
> "id": "14790",
> "int_i": 2010,
> "date_dt": "2010-01-01T00:00:00Z",
> "string_s": "a",
> "string_ss": ["a", "b"]
>   },
>   {
> "id": "12254",
> "int_i": 2014,
> "date_dt": "2014-01-01T00:00:00Z",
> "string_s": "e",
> "string_ss": ["b", "c"]
>   },
>   {
> "id": "12937",
> "int_i": 2008,
> "date_dt": "2008-01-01T00:00:00Z",
> "string_s": "c",
> "string_ss": ["c", "d"]
>   },
>   {
> "id": "10575",
> "int_i": 2008,
> "date_dt": "2008-01-01T00:00:00Z",
> "string_s": "b",
> "string_ss": ["d", "e"]
>   },
>   {
> "id": "13644",
> "int_i": 2014,
> "date_dt": "2014-01-01T00:00:00Z",
> "string_s": "e",
> "string_ss": ["e", "a"]
>   },
>   {
> "id": "8405",
> "int_i": 2014,
> "date_dt": "2014-01-01T00:00:00Z",
> "string_s": "d",
> "string_ss": ["a", "b"]
>   },
>   {
> "id": "6128",
> "int_i": 2008,
> "date_dt": "2008-01-01T00:00:00Z",
> "string_s": "a",
> "string_ss": ["b", "c"]
>   },
>   {
> "id": "5220",
> "int_i": 2015,
> "date_dt": "2015-01-01T00:00:00Z",
> "string_s": "d",
> "string_ss": ["c", "d"]
>   },
>   {
> "id": "6850",
> "int_i": 2012,
> "date_dt": "2012-01-01T00:00:00Z",
> "string_s": "b",
> "string_ss": ["d", "e"]
>   },
>   {
> "id": "5748",
> "int_i": 2014,
> "date_dt": "2014-01-01T00:00:00Z",
> "string_s": "e",
> "string_ss": ["e", "a"]
>   }
> ]
>
> 3. Try queries like the following for a single-valued string field:
>
> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_s)"
>
> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_s)"
>
> Distinct counts returned are incorrect in general. For example, for the set
> of documents above, the response will contain:
>
> {
> "val": 2010,
> "count": 1,
> "distinct_count": 0
> }
>
> and
>
> "between": {
> "count": 10,
> "distinct_count": 1
> }
>
> (there should be 5 distinct values).
>
> Note, the result depends on the order in which the documents are added.
>
> 4. Try queries like the following for a multi-valued string field:
>
> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"int_i","gap":1,"missing":false,"start":2008,"end":2016,"type":"range","facet":{"distinct_count":"unique(string_ss)"
>
> q=*:*=0={"facet":{"histogram":{"include":"lower,edge","other":"all","field":"date_dt","gap":"%2B1YEAR","missing":false,"start":"2008-01-01T00:00:00Z","end":"2016-01-01T00:00:00Z","type":"range","facet":{"distinct_count":"unique(string_ss)"
>
> I’m getting ArrayIndexOutOfBoundsException for such queries.
>
> Note, everything looks Ok for other field types (I tried single- and
> multi-valued ints, doubles and dates) or when the enclosing facet is a terms
> facet or there is no enclosing facet at all.
>
> I can reproduce these issues both for Solr 7.0.1 and 7.1.0. Solr 6.x and
> 5.x, as it seems, do not have such issues.
>
> Is it a bug? Or, may be, I’ve missed something?
>
> Thanks,
>
> Volodymyr
>


Re: Nested facet complete wrong counts

2017-11-11 Thread Yonik Seeley
Also, If you're looking at all constraints, you shouldn't need refine:true
But if you do need it, it was only added in Solr 7.0 (and I see you're
using 6.6)

-Yonik


On Sat, Nov 11, 2017 at 9:48 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> On Sat, Nov 11, 2017 at 9:18 AM, Kenny Knecht <ke...@ontoforce.com> wrote:
>> Hi Yonik,
>>
>> I am aware of the estimate on the hll. But we don't use the hll as a
>> baseline for comparison. We ask the values for one facet (for example
>> Gender). We store these counts for each bucket. Next we do another request.
>> This time for a facet and a subfacet (for example Gender x Type). We sum
>> all the values of Type with the same Gender and compare these sums with the
>> numbers of previous request. These numbers differ by 60% which is quite
>> worrying. Not always it depends on the facet, but still.
>> Did you get any reports like this?
>
> Nope.  The counts for the scenario you describe should add up exactly
> for single-valued fields.  Are you sure you're adding in the "missing"
> bucket?
>
> When you some up the sub-facets on Type, do you get a value under or
> over the counts on the parent facet?
> Verify that Type is single-valued.  One would not expect facets on a
> multi-valued field to add up in the same way.
> Verify that you're getting all of the Type constraints by using a
> limit of -1on that sub-facet.
>
> -Yonik


Re: Nested facet complete wrong counts

2017-11-11 Thread Yonik Seeley
On Sat, Nov 11, 2017 at 9:18 AM, Kenny Knecht  wrote:
> Hi Yonik,
>
> I am aware of the estimate on the hll. But we don't use the hll as a
> baseline for comparison. We ask the values for one facet (for example
> Gender). We store these counts for each bucket. Next we do another request.
> This time for a facet and a subfacet (for example Gender x Type). We sum
> all the values of Type with the same Gender and compare these sums with the
> numbers of previous request. These numbers differ by 60% which is quite
> worrying. Not always it depends on the facet, but still.
> Did you get any reports like this?

Nope.  The counts for the scenario you describe should add up exactly
for single-valued fields.  Are you sure you're adding in the "missing"
bucket?

When you some up the sub-facets on Type, do you get a value under or
over the counts on the parent facet?
Verify that Type is single-valued.  One would not expect facets on a
multi-valued field to add up in the same way.
Verify that you're getting all of the Type constraints by using a
limit of -1on that sub-facet.

-Yonik


Re: Nested facet complete wrong counts

2017-11-10 Thread Yonik Seeley
I do notice you are using hll (hyper-log-log) which is a distributed
cardinality *estimate* : https://en.wikipedia.org/wiki/HyperLogLog

-Yonik


On Fri, Nov 10, 2017 at 11:32 AM, kenny  wrote:
> Hi all,
>
> We are doing some tests in solr 6.6 with json facet api and we get
> completely wrong counts for some combination of  facets
>
> Setting: We have a set of fields for 376k documents in our query (total 120M
> documents). We work with 2 shards. When doing first a faceting over the
> first facet and keeping these numbers, we subsequently do a nested faceting
> over both facets.
>
> Then we add the numbers of sub-facet and expect to get the (approximately)
> the same numbers back. Sometimes we get rounding errors of about 1%
> difference. But on other occasions it seems to way off
>
> for example
>
> Gender (3 values) Country (211 values)
> 16226 - 18424 = -2198 (-13.5461604832%)
> 282854 - 464387 = -181533 (-64.1790464338%)
> 40489 - 47902 = -7413 (-18.3086764306%)
> 36672 - 49749 = -13077 (-35.6593586387%)
>
> Gender (3 values)  Status (17 Values)
> 16226 - 16273 = -47 (-0.289658572661%)
> 282854 - 435974 = -153120 (-54.1339348215%)
> 40489 - 49925 = -9436 (-23.305095211%)
> 36672 - 54019 = -17347 (-47.3031195462%)
>
> ...
>
> These are the typical requests we submit. So note that we have refine and an
> overrequest, but we in the case of Gender vs Request we should query all the
> buckets anyway.
>
> {"wt":"json","rows":0,"json.facet":"{\"Status_sfhll\":\"hll(Status_sf)\",\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}","q":"*:*","fq":["type:\"something\""]}
>
> {"wt":"json","rows":0,"json.facet":"{\"Gender_sf\":{\"type\":\"terms\",\"field\":\"Gender_sf\",\"missing\":true,\"refine\":true,\"overrequest\":10,\"limit\":10,\"offset\":0,\"facet\":{\"Status_sf\":{\"type\":\"terms\",\"field\":\"Status_sf\",\"missing\":true,\"refine\":true,\"overrequest\":50,\"limit\":50,\"offset\":0}}},\"Gender_sfhll\":\"hll(Gender_sf)\"}","q":"*:*","fq":["type:\"something\""]}
>
> Is this a known bug? Would switching to old facet api resolve this? Are
> there other parameters we miss?
>
>
> Thanks
>
>
> kenny
>
>


Re: Upgrade path from 5.4.1

2017-11-01 Thread Yonik Seeley
On Wed, Nov 1, 2017 at 2:36 PM, Erick Erickson  wrote:
> I _always_ prefer to reindex if possible. Additionally, as of Solr 7
> all the numeric types are deprecated in favor of points-based types
> which are faster on all fronts and use less memory.

They are a good step forward in genera, and faster for range queries
(and multiple-dimensions), but looking at the design I'd guess that
they may be slower for exact-match queries?
Has anyone tested this?

-Yonik


Re: Really slow facet performance in 6.6

2017-10-25 Thread Yonik Seeley
On Mon, Oct 23, 2017 at 3:06 PM, John Davis  wrote:
> Hello,
>
> We are seeing really slow facet performance with new solr release. This is
> on an index of 2M documents. A few things we've tried:

What happens when you run this facet request again?
The first time a UIF faceting method runs for a field on a changed
index, the data structure needs to be rebuilt (i.e. it's not good for
NRT).  Maybe that build time is being included.  Otherwise I've never
seen faceting so slow and there is something else going on here.

-Yonik


Re: Jetty maxThreads

2017-10-20 Thread Yonik Seeley
The high number of maxThreads is to avoid distributed deadlock.
The fix is multiple thread pools, depending on request type:
https://issues.apache.org/jira/browse/SOLR-7344

-Yonik


On Wed, Oct 18, 2017 at 4:41 PM, Walter Underwood  wrote:
> Jetty maxThreads is set to 10,000 which seams way too big.
>
> The comment suggests 5X the number of CPUs. We have 36 CPUs, which would mean 
> 180 threads, which seems more reasonable.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>


Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread Yonik Seeley
On Fri, Oct 20, 2017 at 2:22 PM, kenny <ke...@ontoforce.com> wrote:

> Thanks for the clear explanation. A couple of follow up questions
>
> - can we tune overrequesting in json API?
>

Yes, I still need to document it, but you can specify a specific number of
documents to overrequest:
{
  type : field,
  field : cat,
  overrequest : 500
}

Also note that the JSON facet API does not do refinement by default (it's
not always desired).
Add refine:true to the field facet if you do want it.


> - we do see conflicting counts but that's when we have offsets different
> from 0. We have now already tested it in solr 6.6 with json api. We
> sometimes get the same value in different offsets: for example the range of
> constraints [0,500] and [500,1000] might contain the same constraint.
>

That can happen with both regular faceting and with the JSON Facet API
(deeper paging "discoveres" a new constraint which ranks higher).
Regular faceting does more overrequest by default, and does refinement by
default.  So adding refine:true and a deeper overrequest for json facets
should perform equivalently.

 -Yonik

Kenny
>
> On 20-10-17 17:12, Yonik Seeley wrote:
>
> Facet refinement in Solr guarantees that counts for returned
> constraints are correct, but does not guarantee that the top N
> returned isn't missing a constraint.
>
> Consider the following shard counts (3 shards) for the following
> constraints (aka facet values):
> constraintA: 2 0 0
> constraintB: 0 2 0
> constraintC: 0 0 2
> constraintD: 1 1 1
>
> Now for simplicity consider facet.limit=1:
> Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
> back A=2,B=2,C=2)
> Phase 2: refinement: retrieve counts for A,B,C for any shard that did
> not contribute to the count in Phase 1: (for example we ask shard2 and
> shard3 for the count of A)
> The counts are all correct, but we missed "D" because it never
> appeared in Phase #1
>
> Solr actually has overrequesting in the first phase to reduce the
> chances of this happening (i.e. it won't actually happen with the
> exact scenario above), but it can still happen.
>
> You can increase the overrequest amount 
> (seehttps://lucene.apache.org/solr/guide/6_6/faceting.html)
> Or use streaming expressions or the SQL that goes on top of that in
> the latest Solr releases.
>
> -Yonik
>
>
> On Fri, Oct 20, 2017 at 10:19 AM, kenny <ke...@ontoforce.com> 
> <ke...@ontoforce.com> wrote:
>
> Hi all,
>
> When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
> from 500 to 1000), we see small but disturbing difference in counts between
> the two (for example last count on first batch 165, first count on second
> batch 167)
> We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
> Any-one seen ths before? I could not find any bug reported like this.
>
> Thanks
>
> Kenny
>
>
>
> --
>
> [image: ONTOFORCE] <http://www.ontoforce.com/>
> Kenny Knecht, PhD
> CTO and technical lead
> +32 486 75 66 16 <00324756616>
> ke...@ontoforce.com
> www.ontoforce.com
> Meetdistrict, Ottergemsesteenweg-Zuid 808, 9000 Gent, Belgium
> <https://maps.google.com/?q=Ottergemsesteenweg-Zuid+808,+9000+Gent,+Belgium=gmail=g>
> CIC, One Broadway, MA 02142 Cambridge, United States
>


Re: Solr facets counts deep paged returns inconsistent counts

2017-10-20 Thread Yonik Seeley
Facet refinement in Solr guarantees that counts for returned
constraints are correct, but does not guarantee that the top N
returned isn't missing a constraint.

Consider the following shard counts (3 shards) for the following
constraints (aka facet values):
constraintA: 2 0 0
constraintB: 0 2 0
constraintC: 0 0 2
constraintD: 1 1 1

Now for simplicity consider facet.limit=1:
Phase 1: retrieve the top 1 facet counts from all 3 shards (this gets
back A=2,B=2,C=2)
Phase 2: refinement: retrieve counts for A,B,C for any shard that did
not contribute to the count in Phase 1: (for example we ask shard2 and
shard3 for the count of A)
The counts are all correct, but we missed "D" because it never
appeared in Phase #1

Solr actually has overrequesting in the first phase to reduce the
chances of this happening (i.e. it won't actually happen with the
exact scenario above), but it can still happen.

You can increase the overrequest amount (see
https://lucene.apache.org/solr/guide/6_6/faceting.html)
Or use streaming expressions or the SQL that goes on top of that in
the latest Solr releases.

-Yonik


On Fri, Oct 20, 2017 at 10:19 AM, kenny  wrote:
> Hi all,
>
> When we run some 'deep' facet counts (eg facet values from 0 to 500 and then
> from 500 to 1000), we see small but disturbing difference in counts between
> the two (for example last count on first batch 165, first count on second
> batch 167)
> We run this on solr 5.3.1 in cloud mode (3 shards) in non-json facet module
> Any-one seen ths before? I could not find any bug reported like this.
>
> Thanks
>
> Kenny


Re: Trying to fix Too Many Boolean Clauses Exception

2017-10-18 Thread Yonik Seeley
On Wed, Oct 18, 2017 at 12:23 PM, Erick Erickson
 wrote:
> What have you tried? And what is the current setting?
>
> This usually occurs when you are assembling very large OR clauses,
> sometimes for ACL calculations.
>
> So if you have a query of the form
> q=field:(A OR B OR C OR)
> or
> fq=field:(A OR B OR C OR)
>
> change it to use TermsQueryParser, see
> https://lucene.apache.org/solr/guide/6_6/other-parsers.html it doesn't
> suffer this limitation.
>
> In recent versions of Solr this is automatic.

Yeah, that was implemented  in 6.4, when we know we don't need
scoring.  So it will be automatic for things like "fq" parameters, but
not "q" unless you wrap it in a filter()

-Yonik


>
> Best,
> Erick
>
> On Wed, Oct 18, 2017 at 7:44 AM, Patrick R. TOKOUO  wrote:
>> Hello,
>> Please I have unsuccessfuly tried to fix this error on Solr 6.4.
>> I have increased  value to some max, but the same error
>> appear.
>> Please, could you help me.
>>
>> Regards,
>> Patrick R. TOKOUO
>> Mob: (+237) 6 90 08 55 95
>> Skype: ptokouo
>> In: www.linkedin.com/in/patricktokouo
>>
>> 
>> Garanti
>> sans virus. www.avg.com
>> 
>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


Re: Concern on solr commit

2017-10-18 Thread Yonik Seeley
On Wed, Oct 18, 2017 at 5:09 AM, Leo Prince
 wrote:
> Is there any known negative impacts in setting up autoSoftCommit as 1
> second other than RAM usage..?

Briefly:
Don't use autowarming (but keep caches enabled!)
Use docValues for fields you will facet and sort on (this will avoid
using FieldCache)

-Yonik


Re: [ANNOUNCE] Apache Solr 7.1.0 released

2017-10-17 Thread Yonik Seeley
It pointed to 7.1.0 for me perhaps a browser cache issue?
Anyway, you can go directly as well:
http://www.apache.org/dyn/closer.lua/lucene/solr/7.1.0

-Yonik


On Tue, Oct 17, 2017 at 11:25 AM, Susheel Kumar  wrote:
> Thanks, Shalin.
>
> But the download mirror still has 7.0.1 not 7.1.0.
>
> http://www.apache.org/dyn/closer.lua/lucene/solr/7.0.1
>
>
>
>
> On Tue, Oct 17, 2017 at 5:28 AM, Shalin Shekhar Mangar
>  wrote:
>>
>> 17 October 2017, Apache Solr™ 7.1.0 available
>>
>> The Lucene PMC is pleased to announce the release of Apache Solr 7.1.0
>>
>> Solr is the popular, blazing fast, open source NoSQL search platform
>> from the Apache Lucene project. Its major features include powerful
>> full-text search, hit highlighting, faceted search, dynamic
>> clustering, database integration, rich document (e.g., Word, PDF)
>> handling, and geospatial search. Solr is highly scalable, providing
>> fault tolerant distributed search and indexing, and powers the search
>> and navigation features of many of the world's largest internet sites.
>>
>> Solr 7.1.0 is available for immediate download at:
>>
>> http://lucene.apache.org/solr/mirrors-solr-latest-redir.html
>>
>> See http://lucene.apache.org/solr/7_1_0/changes/Changes.html for a
>> full list of details.
>>
>> Solr 7.1.0 Release Highlights:
>>
>> * Critical Security Update: Fix for CVE-2017-12629 which is a working
>> 0-day exploit reported on the public mailing list. See
>> https://s.apache.org/FJDl
>>
>> * Auto-scaling: Solr can now move replicas automatically when a new
>> node is added or an existing node is removed using the auto scaling
>> policy framework introduced in 7.0
>>
>> * Auto-scaling: The 'autoAddReplicas' feature which was limited to
>> shared file systems is now available for all file systems. It has been
>> ported to use the new autoscaling framework internally.
>>
>> * Auto-scaling: New set-trigger, remove-trigger, set-listener,
>> remove-listener, suspend-trigger, resume-trigger APIs
>>
>> * Auto-scaling: New /autoscaling/history API to show past autoscaling
>> actions and cluster events
>>
>> * New JSON based Query DSL for Solr that extends JSON Request API to
>> also support all query parsers and their nested parameters
>>
>> * JSON Facet API: min/max aggregations are now supported on
>> single-valued date fields
>>
>> * Lucene's Geo3D (surface of sphere & ellipsoid) is now supported on
>> spatial RPT fields by setting spatialContextFactory="Geo3D".
>> Furthermore, this is the first time Solr has out of the box support
>> for polygons
>>
>> * Expanded support for statistical stream evaluators such as various
>> distributions, rank correlations, distances and more.
>>
>> * Multiple other optimizations and bug fixes
>>
>> You are encouraged to thoroughly read the "Upgrade Notes" at
>> http://lucene.apache.org/solr/7_1_0/changes/Changes.html or in the
>> CHANGES.txt file accompanying the release.
>>
>> Solr 7.1 also includes many other new features as well as numerous
>> optimizations and bugfixes of the corresponding Apache Lucene release.
>>
>> Please report any feedback to the mailing lists
>> (http://lucene.apache.org/solr/discussion.html)
>>
>> Note: The Apache Software Foundation uses an extensive mirroring
>> network for distributing releases. It is possible that the mirror you
>> are using may not have replicated the release yet. If that is the
>> case, please try another mirror. This also goes for Maven access.
>>
>> --
>> Regards,
>> Shalin Shekhar Mangar.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>


Re: Concern on solr commit

2017-10-17 Thread Yonik Seeley
Related: maxWarmingSearchers behavior was fixed (block for another
commit to succeed first rather than fail) in Solr 6.4 and later.
https://issues.apache.org/jira/browse/SOLR-9712

Also, if any of your "realtime" search requests only involve
retrieving certain documents by ID, then you can use "realtime get"
without opening a new searcher.

-Yonik


On Tue, Oct 17, 2017 at 9:45 AM, Leo Prince
 wrote:
> Hi,
>
> Thank you Emir, Erick and Shawn for your inputs.
>
> I am currently using SolrCloud and planning to try out commitWithin
> parameter to reduce hard commits as per your advise. Though, just wanted to
> double check whether commitWithin have any
> negative impacts in SolrCloud environment like lag to search from other
> nodes in SolrCloud.
>
> Thanks,
> Leo Prince.
>
> On Tue, Oct 17, 2017 at 4:01 AM, Shawn Heisey  wrote:
>
>> I'm supplementing the other replies you've already gotten.  See inline:
>>
>>
>> On 10/13/2017 2:30 AM, Leo Prince wrote:
>> > I am getting the following errors/warnings from Solr > > 1, ERROR: >
>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >
>> > Error opening new searcher. exceeded limit of maxWarmingSearchers=2,
>> > try again later. 2, PERFORMANCE WARNING: Overlapping > onDeckSearchers=2
>> 3, WARN: DistributedUpdateProcessor error sending
>> See this FAQ entry:
>>
>> https://wiki.apache.org/solr/FAQ?highlight=%28ondecksearchers%29#What_
>> does_.22exceeded_limit_of_maxWarmingSearchers.3DX.22_mean.3F
>>
>> > So my concern is, is there any chance of performance issues when >
>> number of commits are high at a particular point of time. In our >
>> application, we are approximating like 100-500 commits can happen >
>> simultaneously from application and autocommit too for those >
>> individual requests which are not committing individually after the >
>> write. > > Autocommit is configured as follows, > > 
>> 15000 > false
>> 
>> The commits generated by this configuration are not opening new
>> searchers, so they are not connected in any way to the error messages
>> you're getting, which are about new searchers.  Note that this
>> particular kind of configuration is strongly recommended for ALL Solr
>> installs using Solr 4.0 and later, so that transaction logs do not grow
>> out of control.  I would personally use a value of at least 6 for
>> autoCommit, but there is nothing wrong with a 15 second interval.
>>
>> The initial reply you got on this thread mentioned that commits from the
>> application are discouraged.  I don't agree with this statement, but I
>> will say that the way that people *use* commits from the application is
>> frequently very wrong, and because of that, switching to fully automatic
>> soft commits is often the best solution, because they are somewhat
>> easier to control.
>>
>> We have no way of knowing how long it will take to open a new searcher
>> on your index.  It depends on a lot of factors.  Whatever that time is,
>> commits should not be happening on a more frequent basis than that
>> interval.  They should happen *less* frequently than that interval if at
>> all possible.  Depending on exactly how Solr is configured, it might be
>> possible to reduce the amount of time that a commit with a new searcher
>> takes to complete.
>>
>> Definitely avoid sending a commit after every document.  It is generally
>> also a bad idea to send a commit with every update request.  If you want
>> to do commits manually, then you should index a bunch of data and then
>> send one commit to make all those changes visible, and not do another
>> commit until you do another batch of indexing.
>>
>> Thanks,
>> Shawn
>>
>>


Re: FieldValueCache in solr 6.6

2017-10-06 Thread Yonik Seeley
On Fri, Oct 6, 2017 at 12:45 PM, sile  wrote:
> Hi Yonik,
>
> Thanks for your answer :).
>
> It works.
>
> Another question:
>
> What is recommended to be used in solr 6.6 for faceting (docValues or
> UnInvertedField), because UnInvertedField performs better for subsequent
> requests?
>
> I assume that docValues is more beneficial in terms of heap memory use, but
> should I use fieldValueCache instead if hit ratio is good?

docValues is a safer default.  UIF needs to rebuild every time the
index is changed (and is also one reason why it's faster once it is
built).
If one uses real docValues (at index time), then there will even be
less heap memory.
A note to those trying out UIF: method=uif is an execution hint only
and will be ignored if you've indexed docValues.  We should probably
add some way of forcing it at some point.

-Yonik


Re: FieldValueCache in solr 6.6

2017-10-06 Thread Yonik Seeley
If you're using regular faceting (as opposed to the JSON Facet API),
you can try facet.method=uif
https://issues.apache.org/jira/browse/SOLR-8466

Background:
UIF (UnInvertedField which are the entries in the FieldValueCache) was
completely removed from use at some point in the 5.x timeframe.
It was part of the JSON Facet API though, and so later SOLR-8466 added
back support to regular faceting by calling JSON Faceting when
facet.method=uif

-Yonik


On Fri, Oct 6, 2017 at 11:22 AM, sile  wrote:
> Hi,
>
> I'm new to solr, and I'm using solr 6.6.
>
> I did some testing with solr 4.9 and 6.6 on the same index with the same
> faceting queries on the multivalued fields.
>
> In first run (with empty cache) solr 6.6 performs much better, but when I
> run same queries couple more times solr 4.9 is a little bit faster than solr
> 6.6.
>
> FieldValueCache is empty in solr 6.6, and solr 4.9 uses this cache with a
> good hit ratio (0.9).
>
> I have specified this cache inside solrconfig.xml for both solr 6.6 and 4.9.
>
> I have also tried same thing by reindexing documents with docValues set to
> false for the faceting fields and run queries again and FieldValueCache is
> still empty.
>
> Is it possible to use FieldValueCache in solr 6.6?
>
> Thanks in advance.
>
> Regards,
>
> Sile
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: FilterCache size should reduce as index grows?

2017-10-06 Thread Yonik Seeley
On Fri, Oct 6, 2017 at 6:50 AM, Toke Eskildsen  wrote:
> Letting the default use maxSizeMB would be better IMO. But I assume
> that FastLRUCache is used for a reason, so that would have to be
> extended to support that parameter first.

FastLRUCache is the default on the filter cache because it was shown
(and developed for the purpose) of being better under concurrency.
But I don't know if anyone has analyzed / thought about what is the
better default since the size option was added to LRUCache.

Would be nice to try something based on Caffeine + LFU + size-based limits.

-Yonik


Re: FilterCache size should reduce as index grows?

2017-10-05 Thread Yonik Seeley
On Thu, Oct 5, 2017 at 3:20 AM, Toke Eskildsen  wrote:
> On Wed, 2017-10-04 at 21:42 -0700, S G wrote:
>
> It seems that the memory limit option maxSizeMB was added in Solr 5.2:
> https://issues.apache.org/jira/browse/SOLR-7372
> I am not sure if it works with all caches in Solr, but in my world it
> is way better to define the caches by memory instead of count.

Yes, that will work with the filterCache, but one needs to change the
cache type as well (maxSizeMB is only an option on LRUCache, and
filterCache uses FastLRUCache in the default solrconfig.xml)

-Yonik


Re: FilterCache size should reduce as index grows?

2017-10-05 Thread Yonik Seeley
On Thu, Oct 5, 2017 at 10:07 AM, Erick Erickson  wrote:
> The other thing I'd point out is that if your hit ratio is low, you
> might as well disable it entirely.

I'd normally recommend against turning it off entirely, except in
*very* custom cases.  Even if the user doesn't reuse filter queries,
Solr itself can internally in many different ways.  One way is 2-phase
distributed search for example.  Another is big terms in UIF faceting.
Some of these things were designed with the presence of a filter cache
in mind.

-Yonik


Re: SOLR 6.1 | Continuous hits coming for unwanted URL pattern

2017-09-26 Thread Yonik Seeley
Looks like it's some sort of ping (liveness) query, probably from a
load balancer?
Actually, it looks like it's a SolrJ client... here's the code that
sets up that exact query:
https://github.com/apache/lucene-solr/blob/master/solr/solrj/src/java/org/apache/solr/client/solrj/impl/LBHttpSolrClient.java#L135-L145

Solr does use LBHttpSolrClient internally sometimes, so it could
either be Solr itself, or an external client using
SolrJ/LBHttpSolrClient.

bq. As a result of this URL, the memory on slaves goes high
consistently leading to out of memory eventually.

That should not be the case.

-Yonik


On Tue, Sep 26, 2017 at 6:55 AM, saurabhagrawal
 wrote:
> Hi,
>
> We are using SOLR 6.1 with hybris. On our production environment the set is
> as follows:
>
> Master for replication using AFTER_COMMIT and 2 slaves which servers the
> query response. The replication was initially set to 60 seconds as we wanted
> latest data on slaves ASAP.
>
> On production environment we have started to see hits coming from all our 8
> hybris servers to SOLR slave servers of following pattern:
>
> 0.131.74.45 - - [26/Sep/2017:15:09:39 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:09:44 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:09:49 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:09:54 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:09:59 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:10:04 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:10:09 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
> 10.131.74.45 - - [26/Sep/2017:15:10:14 +0530] "GET
> /solr/select?q=*%3A*=0=_docid_+asc=false=javabin=2
> HTTP/1.1" 404 1081
>
>
> These URLs are getting triggered from hybris nodes and just getting
> triggered on production. On lower environments, we don't see this URL.
>
> Note: We are using Tomcat 8.5.14 which is hosting SOLR.
>
> Completely puzzled to see this URL and unable to figure out what is
> triggering this URL. As a result of this URL, the memory on slaves goes high
> consistently leading to out of memory eventually.
>
> Any idea of what this URL is and what component must be firing this ?
>
> Thanks in advance,
> Saurabh
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: When will be solr 7.1 released?

2017-09-26 Thread Yonik Seeley
On Tue, Sep 26, 2017 at 2:02 PM, Nawab Zada Asad Iqbal  wrote:
> Thanks Yonik and Erick.
>
> That is helpful.
> I am slightly confused about the branch name conventions. I expected 7x to
> be named as branch_7_0

branch_7x is the main branch for all 7.x releases.  When it's time for
7.1 to be released, a branch_7_1 will be created from branch_7x.
But branch_7x  will continue on and any new changes on branch_7x after
that point will only be in releases after 7.1

So branch_7x always represents the next 7.x release currently in development.

-Yonik


Re: When will be solr 7.1 released?

2017-09-26 Thread Yonik Seeley
One can also use a nightly snapshot build to try out the latest stuff:
7.x: 
https://builds.apache.org/job/Solr-Artifacts-7.x/lastSuccessfulBuild/artifact/solr/package/
8.0: 
https://builds.apache.org/job/Solr-Artifacts-master/lastSuccessfulBuild/artifact/solr/package/

-Yonik


On Tue, Sep 26, 2017 at 11:50 AM, Erick Erickson
 wrote:
> There's nothing preventing you from getting/compiling the latest Solr
> 7x (what will be 7.1) for your own use. There's information here:
> https://wiki.apache.org/solr/HowToContribute
>
> Basically, you get the code from Git (instructions provided at the
> link above) and execute the "ant package" command from the solr
> directory. After things churn for a while you should have the tgz and
> zip files just as though you have downloaded them from the Apache
> Wiki. You need Java 1.8 JDK and ant installed, and the first time you
> try to compile you may see instructions to execute an ant target that
> downloads ivy.
>
> One note, there was a comment recently that you may have to get
> ivy-2.4.0.jar to have the "ant package" complete successfully.
>
> Best,
> Erick
>
> On Tue, Sep 26, 2017 at 8:38 AM, Steve Rowe  wrote:
>> Hi Nawab,
>>
>> Committership is a prerequisite for the Lucene/Solr release manager role.
>>
>> Some info here about the release process: 
>> 
>>
>> --
>> Steve
>> www.lucidworks.com
>>
>>> On Sep 26, 2017, at 11:28 AM, Nawab Zada Asad Iqbal  
>>> wrote:
>>>
>>> Where can I learn more about this process? I am not a committer but I am
>>> wondering if I know enough to do it.
>>>
>>>
>>> Thanks
>>> Nawab
>>>
>>>
>>> On Mon, Sep 25, 2017 at 9:23 PM, Erick Erickson 
>>> wrote:
>>>
 In a word "no". Basically whenever a committer feels like there are
 enough changes to warrant spinning a new version, they volunteer.
 Nobody has stepped up to do that yet, although I expect it to be in
 the next 2-3 months, but that's only a guess.

 Best,
 Erick

 On Mon, Sep 25, 2017 at 5:21 PM, Nawab Zada Asad Iqbal 
 wrote:
> Hi,
>
> How are the release dates decided for new versions, are they known in
> advance?
>
> Thanks
> Nawab

>>


Re: Consecutive calls to a query give different results

2017-09-07 Thread Yonik Seeley
On Thu, Sep 7, 2017 at 12:47 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> bq: and deleted documents are irrelevant to term statistics...
>
> Did you mean "relevant"? Or do I have to adjust my thinking _again_?

One can make it work either way ;-)
Whether a document is marked as deleted or not has no effect on term
statistics (i.e. irrelevant)
OR documents marked for deletion still count in term statistics (i.e. relevant)

I guess I used the former because we don't go out of our way to still
include deleted documents... it's just a side effect of the index
structure that we don't (and can't easily) update statistics when a
document is marked as deleted.

-Yonik


> Erick
>
> On Wed, Sep 6, 2017 at 7:48 PM, Yonik Seeley <ysee...@gmail.com> wrote:
>> Different replicas of the same shard can have different numbers of
>> deleted documents (really just marked as deleted), and deleted
>> documents are irrelevant to term statistics (like the number of
>> documents a term appears in).  Documents marked for deletion stop
>> contributing to corpus statistics when they are actually removed (via
>> expunge deletes, merges, optimizes).
>> -Yonik
>>
>>
>> On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer <webster.ho...@sial.com> wrote:
>>> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
>>> replicas (total of 4 nodes).
>>>
>>> If I run the query multiple times I see the three different top scoring
>>> results.
>>> No data load is running, all data has been commited
>>>
>>> I get these three different hits with their scores:
>>> copperiinitratehemipentahydrate2325919004194430.61722
>>> copperiinitrateoncelite1234598765   432.44238
>>> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>>>
>>> How is it that the same search against the same data can give different
>>> responses?
>>> I looked at the specific cores they look OK the numdocs for the replicas in
>>> a shard match
>>>
>>> This is the query:
>>> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]=id_s=30=true=sort_ds%20asc=on=2%3C-25%25=OR=copper%20nitrate=search_pid
>>> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform=30=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc=json
>>>
>>> --
>>>
>>>
>>> This message and any attachment are confidential and may be privileged or
>>> otherwise protected from disclosure. If you are not the intended recipient,
>>> you must not copy this message or attachment or disclose the contents to
>>> any other person. If you have received this transmission in error, please
>>> notify the sender immediately and delete the message and any attachment
>>> from your system. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not accept liability for any omissions or errors in this
>>> message which may arise as a result of E-Mail-transmission or for damages
>>> resulting from any unauthorized changes of the content of this message and
>>> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
>>> subsidiaries do not guarantee that this message is free of viruses and does
>>> not accept liability for any damages caused by any virus transmitted
>>> therewith.
>>>
>>> Click http://www.emdgroup.com/disclaimer to access the German, French,
>>> Spanish and Portuguese versions of this disclaimer.


Re: Consecutive calls to a query give different results

2017-09-06 Thread Yonik Seeley
Different replicas of the same shard can have different numbers of
deleted documents (really just marked as deleted), and deleted
documents are irrelevant to term statistics (like the number of
documents a term appears in).  Documents marked for deletion stop
contributing to corpus statistics when they are actually removed (via
expunge deletes, merges, optimizes).
-Yonik


On Wed, Sep 6, 2017 at 5:51 PM, Webster Homer  wrote:
> I am using Solr 6.2.0 configured as a solr cloud with 2 shards and 4
> replicas (total of 4 nodes).
>
> If I run the query multiple times I see the three different top scoring
> results.
> No data load is running, all data has been commited
>
> I get these three different hits with their scores:
> copperiinitratehemipentahydrate2325919004194430.61722
> copperiinitrateoncelite1234598765   432.44238
> copperiinitratehydrate18756anhydrousbasis13778319 428.24185
>
> How is it that the same search against the same data can give different
> responses?
> I looked at the specific cores they look OK the numdocs for the replicas in
> a shard match
>
> This is the query:
> http://ae1c-ecomdev-msc01.sial.com:8983/solr/sial-catalog-product/select?defType=edismax=searchmv_en_keywords,%20searchmv_keywords,searchmv_pno,%20searchmv_en_s_pri_name,%20search_en_p_pri_name,%20search_pno%20[explain%20style=nl]=id_s=30=true=sort_ds%20asc=on=2%3C-25%25=OR=copper%20nitrate=search_pid
> ^500%20search_concat_pno^400%20searchmv_concat_sku^400%20searchmv_pno^300%20search_concat_pno_genr^100%20searchmv_pno_genr%20searchmv_p_skus_genr%20searchmv_user_term^200%20search_lform^190%20searchmv_en_acronym^180%20search_en_root_name^170%20searchmv_en_s_pri_name^160%20search_en_p_pri_name^150%20searchmv_en_synonyms^145%20searchmv_en_keywords^140%20search_en_sortkey^120%20searchmv_p_skus^100%20searchmv_chem_comp^90%20searchmv_en_name_suf%20searchmv_cas_number^80%20searchmv_component_cas^70%20search_beilstein^50%20search_color_idx^40%20search_ecnumber^30%20search_egecnumber^30%20search_femanumber^20%20searchmv_isbn^10%20search_mdl_number%20searchmv_en_page_title%20searchmv_en_descriptions%20searchmv_en_attributes%20searchmv_rtecs%20searchmv_lookahead_terms%20searchmv_xref_comparable_pno%20searchmv_xref_comparable_sku%20searchmv_xref_equivalent_pno%20searchmv_xref_exact_pno%20searchmv_xref_exact_sku%20searchmv_component_molform=30=score%20desc,sort_en_name%20asc,sort_ds%20asc,search_pid%20asc=json
>
> --
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.emdgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.


Re: NumberFormatException for multvalue, pint

2017-09-06 Thread Yonik Seeley
On Wed, Sep 6, 2017 at 4:09 PM, Steve Pruitt  wrote:
> Can't get a multi-valued pint field to update.
>
> The schema defines the field:   multiValued="true" required="false" docValues="true" stored="true"/>
>
> I get the exception on this input:  7780386,7313483
>
> Caused by: java.lang.NumberFormatException: For input string: "7780386, 
> 7313483"

Try two separate values:
 7780386
 7313483

Or in JSON you can do: dnis:[7780386,7313483]

-Yonik


> at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> at java.lang.Integer.parseInt(Integer.java:580)
> at java.lang.Integer.parseInt(Integer.java:615)
> at 
> org.apache.solr.schema.IntPointField.createField(IntPointField.java:181)
> at org.apache.solr.schema.PointField.createFields(PointField.java:216)
> at 
> org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:72)
> at 
> org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:179)
>
> Not sure why the parser thinks the values are strings.  I don't see any 
> non-numeric extraneous characters.
>
> Do I need docValues and multivalued in my field definition, since they are 
> defined on the pints field type?
>
> Thanks.
>
> -Steve


Re: slow solr facet processing

2017-09-05 Thread Yonik Seeley
The number-of-segments noise probably swamps this... but one
optimization around deep-facet-paging that didn't get carried forward
is
https://issues.apache.org/jira/browse/SOLR-2092

-Yonik


On Tue, Sep 5, 2017 at 6:49 AM, Toke Eskildsen <t...@kb.dk> wrote:
> On Mon, 2017-09-04 at 11:03 -0400, Yonik Seeley wrote:
>> It's due to this (see comments in UnInvertedField):
>
> I have read that. What I don't understand is the difference between 4.x
> and 6.x. But as you say, Ere seems to be in the process of verifying
> whether this is simply due to more segments in 6.x.
>
>> There's probably a number of ways we can speed this up somewhat:
>> - optimize how much memory is used to store the term index and use
>> the savings to store more than every 128th term
>> - store the terms contiguously in block(s)
>
> I'm considering taking a shot at that. A fairly easy optimization would
> be to replace the BytesRef[] indexedTermsArray with a BytesRefArray.
>
> - Toke Eskildsen, Royal Danish Library
>


Re: slow solr facet processing

2017-09-04 Thread Yonik Seeley
On Mon, Sep 4, 2017 at 6:38 AM, Toke Eskildsen  wrote:
> On Mon, 2017-09-04 at 13:21 +0300, Ere Maijala wrote:
>> Thanks for the insight, Yonik. I can confirm that #2 is true. I ran
>>
>> 
>>
>> and after it completed I was able to retrieve 2000 values in 17ms.
>
> Very interesting. Is this on spinning disks or SSD? Is your index data
> cached in memory? What I am aiming at is if this is primarily a "many
> relatively slow random access"-thing or more due to the way DocValues
> are represented in the segments (the codec).

It's due to this (see comments in UnInvertedField):
*   To further save memory, the terms (the actual string values) are
not all stored in
*   memory, but a TermIndex is used to convert term numbers to term values only
*   for the terms needed after faceting has completed.  Only every
128th term value
*   is stored, along with its corresponding term number, and this is used as an
*   index to find the closest term and iterate until the desired number is hit

There's probably a number of ways we can speed this up somewhat:
- optimize how much memory is used to store the term index and use the
savings to store more than every 128th term
- store the terms contiguously in block(s)
- don't store the whole term, only store what's needed to seek to the
Nth term correctly
- when retrieving many terms, sort them first and convert from ord->str in order

-Yonik


Re: slow solr facet processing

2017-09-01 Thread Yonik Seeley
On Fri, Sep 1, 2017 at 9:17 AM, Ere Maijala <ere.maij...@helsinki.fi> wrote:
> I spoke a bit too soon. Now I see why I didn't see any improvement from
> facet.method=uif before: its performance seems to depend heavily on how many
> facets are returned. With an index of 6 million records and the facet having
> 1960 buckets:
>
> facet.limit=20 takes 4ms
> facet.limit=200 takes ~100ms
> facet.limit=2000 takes ~1300ms
>
> So, for some uses it provides a nice boost, but if you need to fetch more
> than a few top items, it doesn't perform properly.

Another thought on this one:
If it does slow down more than 4.x when requesting many items, it's either
1) a bug introduced at some point
2) not actually slower, but due to the 6.6 index having more segments
(ord->string conversion needs to merge multiple term enumerators, so
more segments == slower)

If you could check #2, that would be great!  If it doesn't seem to be
the problem, could you open up a new JIRA issue for this?

-Yonik


> Query used was:
>
> q=*:*=0=true=building=1=2000=true=uif
>
> --Ere
>
>
> Ere Maijala kirjoitti 1.9.2017 klo 13.10:
>>
>> I can confirm that we're seeing the same issue as Günter. For a collection
>> of 57 million bibliographic records, Solr 4.10.2 (without docValues) can
>> consistently return a facet in about 20ms, while Solr 6.6.0 with docValues
>> takes around 2600ms. I've tested some versions between those two too, but I
>> don't have comparable numbers for them.
>>
>> I thought I had tried all different combinations of docValues="true/false"
>> and facet.method=fc/uif/enum, but now that I checked it again, it seems that
>> I may have missed a test, as an 6.6.0 index with docValues="false" and
>> facet.method=uif is markedly faster than other methods. At around 700ms it's
>> still not nowhere near as fast as 4.10.2, but a whole lot better. It seems
>> that docValues needs to be disabled for facet.method=uif to have effect
>> though, which is unfortunate. Otherwise it reports that applied method is
>> UIF, but the performance is actually much worse than with FC. I'll do just
>> another round of testing to verify all this. I can report to SOLR-8096 when
>> I have something conclusive.
>>
>> --Ere
>>
>> Yonik Seeley kirjoitti 31.8.2017 klo 20.04:
>>>
>>> A possible improvement for some multiValued fields might be to use the
>>> "uif" facet method (UnInvertedField was the default method for
>>> multiValued fields in 4.x)
>>> I'm not sure if you would need to reindex without docValues on that
>>> field to try it though.
>>>
>>> Example: to enable on the "union" field, add f.union.facet.method=uif
>>>
>>> Support for this was added in
>>> https://issues.apache.org/jira/browse/SOLR-8466
>>>
>>> -Yonik
>>>
>>>
>>> On Thu, Aug 31, 2017 at 10:41 AM, Günter Hipler
>>> <guenter.hip...@unibas.ch> wrote:
>>>>
>>>> Hi,
>>>>
>>>> in the meantime I came across the reason for the slow facet processing
>>>> capacities of SOLR since version 5.x
>>>>
>>>>   https://issues.apache.org/jira/browse/SOLR-8096
>>>> https://issues.apache.org/jira/browse/LUCENE-5666
>>>>
>>>> compared to version 4.x
>>>>
>>>> Various library networks across the world are suffering from the same
>>>> symptoms:
>>>>
>>>> Facet processing is one of the most important features of a search
>>>> server
>>>> (for us) and it seems (at least IMHO) there is no solution for the issue
>>>> since March 2015 (release date for the last SOLR 4 version)
>>>>
>>>> What are the plans / ideas of the solr developers for a possible future
>>>> solution? Or maybe there is already a solution I haven't seen so far.
>>>>
>>>> Thanks for a feedback
>>>>
>>>> Günter
>>>>
>>>>
>>>>
>>>> On 21.08.2017 15:35, guenterh.li...@bluewin.ch wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I can't figure out the reason why the facet processing in version 6
>>>>> needs
>>>>> significantly more time compared to version 4.
>>>>>
>>>>> The debugging response (for 30 million documents)
>>>>>
>>>>> solr 4
>>>>> 280.0>>>> name="query">0.0>>>> name="facet">>>>> 

Re: slow solr facet processing

2017-09-01 Thread Yonik Seeley
On Fri, Sep 1, 2017 at 9:17 AM, Ere Maijala <ere.maij...@helsinki.fi> wrote:
> I spoke a bit too soon. Now I see why I didn't see any improvement from
> facet.method=uif before: its performance seems to depend heavily on how many
> facets are returned. With an index of 6 million records and the facet having
> 1960 buckets:
>
> facet.limit=20 takes 4ms
> facet.limit=200 takes ~100ms
> facet.limit=2000 takes ~1300ms
>
> So, for some uses it provides a nice boost, but if you need to fetch more
> than a few top items, it doesn't perform properly.

Yes, this should be the same performance tradeoff that 4.x had.  It's
optimized for retrieving the top N values, where N is small and most
of the time is finding the top ordinals.
To save memory, we don't load all string values into memory.  This
makes ord->string conversion more costly.

-Yonik



> Query used was:
>
> q=*:*=0=true=building=1=2000=true=uif
>
> --Ere
>
>
> Ere Maijala kirjoitti 1.9.2017 klo 13.10:
>>
>> I can confirm that we're seeing the same issue as Günter. For a collection
>> of 57 million bibliographic records, Solr 4.10.2 (without docValues) can
>> consistently return a facet in about 20ms, while Solr 6.6.0 with docValues
>> takes around 2600ms. I've tested some versions between those two too, but I
>> don't have comparable numbers for them.
>>
>> I thought I had tried all different combinations of docValues="true/false"
>> and facet.method=fc/uif/enum, but now that I checked it again, it seems that
>> I may have missed a test, as an 6.6.0 index with docValues="false" and
>> facet.method=uif is markedly faster than other methods. At around 700ms it's
>> still not nowhere near as fast as 4.10.2, but a whole lot better. It seems
>> that docValues needs to be disabled for facet.method=uif to have effect
>> though, which is unfortunate. Otherwise it reports that applied method is
>> UIF, but the performance is actually much worse than with FC. I'll do just
>> another round of testing to verify all this. I can report to SOLR-8096 when
>> I have something conclusive.
>>
>> --Ere
>>
>> Yonik Seeley kirjoitti 31.8.2017 klo 20.04:
>>>
>>> A possible improvement for some multiValued fields might be to use the
>>> "uif" facet method (UnInvertedField was the default method for
>>> multiValued fields in 4.x)
>>> I'm not sure if you would need to reindex without docValues on that
>>> field to try it though.
>>>
>>> Example: to enable on the "union" field, add f.union.facet.method=uif
>>>
>>> Support for this was added in
>>> https://issues.apache.org/jira/browse/SOLR-8466
>>>
>>> -Yonik
>>>
>>>
>>> On Thu, Aug 31, 2017 at 10:41 AM, Günter Hipler
>>> <guenter.hip...@unibas.ch> wrote:
>>>>
>>>> Hi,
>>>>
>>>> in the meantime I came across the reason for the slow facet processing
>>>> capacities of SOLR since version 5.x
>>>>
>>>>   https://issues.apache.org/jira/browse/SOLR-8096
>>>> https://issues.apache.org/jira/browse/LUCENE-5666
>>>>
>>>> compared to version 4.x
>>>>
>>>> Various library networks across the world are suffering from the same
>>>> symptoms:
>>>>
>>>> Facet processing is one of the most important features of a search
>>>> server
>>>> (for us) and it seems (at least IMHO) there is no solution for the issue
>>>> since March 2015 (release date for the last SOLR 4 version)
>>>>
>>>> What are the plans / ideas of the solr developers for a possible future
>>>> solution? Or maybe there is already a solution I haven't seen so far.
>>>>
>>>> Thanks for a feedback
>>>>
>>>> Günter
>>>>
>>>>
>>>>
>>>> On 21.08.2017 15:35, guenterh.li...@bluewin.ch wrote:
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> I can't figure out the reason why the facet processing in version 6
>>>>> needs
>>>>> significantly more time compared to version 4.
>>>>>
>>>>> The debugging response (for 30 million documents)
>>>>>
>>>>> solr 4
>>>>> 280.0>>>> name="query">0.0>>>> name="facet">>>>> name="time">280.0
>>>>> (once the query is cached)
>>>>> before caching: between 1.5 and 2 sec
>

Re: slow solr facet processing

2017-08-31 Thread Yonik Seeley
A possible improvement for some multiValued fields might be to use the
"uif" facet method (UnInvertedField was the default method for
multiValued fields in 4.x)
I'm not sure if you would need to reindex without docValues on that
field to try it though.

Example: to enable on the "union" field, add f.union.facet.method=uif

Support for this was added in https://issues.apache.org/jira/browse/SOLR-8466

-Yonik


On Thu, Aug 31, 2017 at 10:41 AM, Günter Hipler
 wrote:
> Hi,
>
> in the meantime I came across the reason for the slow facet processing
> capacities of SOLR since version 5.x
>
>  https://issues.apache.org/jira/browse/SOLR-8096
> https://issues.apache.org/jira/browse/LUCENE-5666
>
> compared to version 4.x
>
> Various library networks across the world are suffering from the same
> symptoms:
>
> Facet processing is one of the most important features of a search server
> (for us) and it seems (at least IMHO) there is no solution for the issue
> since March 2015 (release date for the last SOLR 4 version)
>
> What are the plans / ideas of the solr developers for a possible future
> solution? Or maybe there is already a solution I haven't seen so far.
>
> Thanks for a feedback
>
> Günter
>
>
>
> On 21.08.2017 15:35, guenterh.li...@bluewin.ch wrote:
>>
>> Hi,
>>
>> I can't figure out the reason why the facet processing in version 6 needs
>> significantly more time compared to version 4.
>>
>> The debugging response (for 30 million documents)
>>
>> solr 4
>> 280.0> name="query">0.0> name="time">280.0
>> (once the query is cached)
>> before caching: between 1.5 and 2 sec
>>
>>
>> solr 6.x (my last try was with 6.6)
>> without docvalues for facetting fields (same schema as version 4)
>> 5874.0> name="query">0.0> name="time">5873.0> name="time">0.0
>> the time is not getting better even after repeating the query several
>> times
>>
>>
>> solr 6.6 with docvalues for facetting fields
>> 9837.0> name="query">0.0> name="time">9837.0> name="time">0.0
>>
>> used query (our productive system with version 4)
>>
>> http://search.swissbib.ch/solr/sb-biblio/select?debugQuery=true=*:*=true=union=navAuthor_full=format=language=navSub_green=navSubform=publishDate=edismax=2=arrarr=recip(abs(ms(NOW/DAY,freshness)),3.16e-10,100,100)=*,score=250=0=AND=score+desc=0=START_HILITE=100=END_HILITE=false=title_short^1000+title_alt^200+title_sub^200+title_old^200+title_new^200+author^750+author_additional^100+author_additional_dsv11_txt_mv^100+title_additional_dsv11_txt_mv^100+series^200+topic^500+addfields_txt_mv^50+publplace_txt_mv^25+publplace_dsv11_txt_mv^25+fulltext+callnumber^1000+ctrlnum^1000+publishDate+isbn+variant_isbn_isn_mv+issn+localcode+id=title_short^1000=1=fulltext&=xml=count
>>
>>
>> Running the queries on smaller indices (8 million docs) the difference is
>> similar although the absolut figures for processing time are smaller
>>
>>
>> Any hints why this huge differences?
>>
>> Günter
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> --
> Universität Basel
> Universitätsbibliothek
> Günter Hipler
> Projekt SwissBib
> Schoenbeinstrasse 18-20
> 4056 Basel, Schweiz
> Tel.: + 41 (0)61 267 31 12 Fax: ++41 61 267 3103
> E-Mail guenter.hip...@unibas.ch
> URL: www.swissbib.org  / http://www.ub.unibas.ch/
>


Re: Huge Facets and Streaming

2017-08-21 Thread Yonik Seeley
On Mon, Aug 21, 2017 at 6:01 AM, Mikhail Khludnev  wrote:
> Hello!
>
> I need to count really wide facet on 30 shards index with roughly 100M
> docs, the facet response is about 100M values takes 0.5G in text file.
>
> So, far I experimented with old facets. It calculates per shard facets
> fine, but then a node which attempts to merge such 30 responses fails due
> to OOM. It's reasonable.
>
> I suppose I'll get pretty much same with json.facet, or it's better
> scalable?
>
> I want to experiment with Streaming Expression, which I've never taken yet.
> I've found facet() expression and select() with partitionKeys they'll try
> to merge facet values in FacetComponent/Module anyway.
> Is there a way to merge per-shard facet responses with Streaming?

Yeah, I think I've mentioned before that this is the way it should be
implemented (per-shard distrib=false facet request merged by streaming
expression).
The JSON Facet "stream" method does stream (i.e. does not build up the
response all in memory first), but only at the shard level and not at
the distrib/merge level.  This could then be fed into streaming to get
exact facets (and streaming facets).  But I don't think this has been
done yet.

-Yonik


Re: QueryParser changes query by itself

2017-08-16 Thread Yonik Seeley
The queryCache shouldn't be involved, this is somehow an issue in
parsing (and Solr doesn't currently cache parsing).
Perhaps there is something shared in your SynonymQParser instances
that isn't quite thread safe?
It could also be something in the text analysis in lucene as well
(related to the new graph stuff?)

-Yonik


On Wed, Aug 16, 2017 at 7:32 AM, Bernd Fehling
 wrote:
> My class SynonymQParser which calls SolrQueryParserBase.parse :
>
> class SynonymQParser extends QParser {
> protected SolrQueryParser sqparser;
> ...
> @Override
> public Query parse() throws SyntaxError {
> ...
> sqparser = new SolrQueryParser(this, defaultField);
> sqparser.setEnableGraphQueries(false);
> sqparser.setEnablePositionIncrements(false);
> ...
> Query synquery = sqparser.parse(qstr);
> ...
>
> And this is SolrQueryParserBase with method parse:
>
> public abstract class SolrQueryParserBase extends QueryBuilder {
> ...
> public Query parse(String query) throws SyntaxError {
> ReInit(new FastCharStream(new StringReader(query)));
> try {
>   // TopLevelQuery is a Query followed by the end-of-input (EOF)
>   Query res = TopLevelQuery(null);  // pass null so we can tell later 
> if an explicit field was provided or not
>   return res!=null ? res : newBooleanQuery().build();
> }
> ...
>
>
> The String variable "query" going into parse method is always 
> "textth:waffenhandel" !!!
> Having a breakpoint at "return", the Query variable "res" changes sometimes to
> TermQuery with term "textth:rss" instead of being a SynonymQuery.
>
> This is strange!!!
>
> What is ReInit right before try doing, is that a cahe lookup?
>
> Or is the problem in TopLevelQuery?
>
> Regards
> Bernd
>
>
> Am 16.08.2017 um 09:06 schrieb Bernd Fehling:
>> Hi Ahmet,
>>
>> thank you for your reply. I was also targeting towards QueryCache but
>> with your hint about LUCENE-3758 I have a better point to start with.
>>
>> If the system is under high load and the the QueryCache is filled I have
>> a higher rate of changed queries.
>> In debug mode the "timing-->process-->query" of changed queries is always 
>> "0" zero.
>>
>> The query parser "SynonymQParser" is self developed which uses QParserPlugin.
>> There is no caching inside and works for years.
>> Only compiled against recent Lucene/Solr and some modifications like
>> using Builder with newer Lucene versions.
>>
>> I will test without query cache.
>> Wich one should be disabled, Query Result Cache?
>>
>> Regards
>> Bernd
>>
>>
>> Am 15.08.2017 um 19:07 schrieb Ahmet Arslan:
>>> Hi Bernd,
>>>
>>> In LUCENE-3758, a new member field added into ComplexPhraseQuery class. But 
>>> we didn't change its hashCode method accordingly. This caused anomalies in 
>>> Solr, and Yonik found the bug and fixed hashCode. Your e-mail somehow 
>>> reminded me this.
>>> Could it be the QueryCache and hashCode method/implementation of Query 
>>> subclasses.
>>> May be your good and bad example is producing same hashCode? And this is 
>>> confusing query cache in solr?
>>> Can you disable the query cache, to test it?
>>> By the way, which query parser are you using? I believe SynonymQuery is 
>>> produced by BM25 similarity, right?
>>>
>>> Ahmet
>>>
>>>
>>> On Friday, August 11, 2017, 2:48:07 PM GMT+3, Bernd Fehling 
>>>  wrote:
>>>
>>>
>>> We just noticed a very strange problem with Solr 6.4.2 QueryParser.
>>> The QueryParser changes the query by itself from time to time.
>>> This happens if doing a search request reload several times at higher rate.
>>>
>>> Good example:
>>> ...
>>> textth:waffenhandel
>>>   
>>> ...
>>> textth:waffenhandel
>>> textth:waffenhandel
>>>   +SynonymQuery(Synonym(textth:"arms sales" 
>>> textth:"arms trade"...
>>>   +Synonym(textth:"arms sales" 
>>> textth:"arms trade"...
>>>
>>>
>>> Bad example:
>>> ...
>>> textth:waffenhandel
>>>   
>>> ...
>>> textth:waffenhandel
>>> textth:waffenhandel
>>>   +textth:rss
>>>   +textth:rss
>>>
>>> As you can see in the bad example after several reloads the parsedquery 
>>> changed to term "rss".
>>> But the original querystring has no "rss" substring at all. That is really 
>>> strange.
>>>
>>> Anyone seen this before?
>>>
>>> Single index, Solr 6.4.2.
>>>
>>> Regards
>>> Bernd
>>>


Re: JSON facet SUM precision and accuracy is incorrect

2017-08-08 Thread Yonik Seeley
This is due to function queries currently lacking type information
(this problem will occur anywhere function queries are used and is not
unique to JSON Facet).
Function queries were originally only used in lucene scoring (which
only uses float).
The inner sum(amount1_d,amount2_d) uses SumFloatFunction, hence the
loss of precision.

One should see the same loss of precision using pseudo-fields with
function queries for example:
q=my_document=id, amount1_d, amount2_d ,sum(amount1_d,amount2_d)

The JIRA for this issue: https://issues.apache.org/jira/browse/SOLR-6575

-Yonik


On Tue, Aug 8, 2017 at 4:48 AM, Patrick Chan  wrote:
> Appreciate if anyone can help raise an issue for the JSON facet sum error
> my staff Edwin raised earlier
>
> but have not gotten any response from the Solr community and developers.
>
> Our production operation is urgently needing this accuracy to proceed as it
> impacts audit issues.
>
>
> Best regards,
>
> Dr.Patrick
>
>
> On Tue, Jul 25, 2017 at 6:27 PM, Zheng Lin Edwin Yeo 
>
> wrote:
>
>> This is the way which I put my JSON facet.
>>
>> totalAmount:"sum(sum(amount1_d,amount2_d))"
>>
>> amount1_d: 69446961 <6944%206961>.2
>> amount2_d: 0
>>
>> Result I get: 69446959 <6944%206959>.27
>>
>>
>> Regards,
>> Edwin
>>
>>
>> On 25 July 2017 at 20:44, Zheng Lin Edwin Yeo 
>> wrote:
>>
>> > Hi,
>> >
>> > I'm trying to do a sum of two double fields in JSON Facet. One of the
>> > field has a value of 69446961 <6944%206961>.2, while the other is 0.
> However, when I
>> get
>> > the result, I'm getting a value of 69446959 <6944%206959>.27. This is
> 1.93 lesser than
>> > the original value.
>> >
>> > What could be the reason?
>> >
>> > I'm using Solr 6.5.1.
>> >
>> > Regards,
>> > Edwin
>> >


Re: _version_ as LongPointField returns error

2017-06-12 Thread Yonik Seeley
On Mon, Jun 12, 2017 at 12:24 PM, Shawn Feldman <shawn.feld...@gmail.com> wrote:
> Why do you need doc values though?  i'm never going to sort by version

Solr needs a quick lookup from docid->_version_
If you don't have docValues, Solr tries to create an in-memory version
(via the FieldCache).  That's not yet supported for Point* fields.

-Yonik

> On Mon, Jun 12, 2017 at 10:13 AM Yonik Seeley <ysee...@gmail.com> wrote:
>
>> I think the _version_ field should be
>>  - indexed="false"
>>  - stored="false"
>>  - docValues="true"
>>
>> -Yonik
>>
>>
>> On Mon, Jun 12, 2017 at 12:08 PM, Shawn Feldman <shawn.feld...@gmail.com>
>> wrote:
>> > I changed all my TrieLong Fields to Point fields.  _version_ always
>> returns
>> > an error unless i turn on docvalues
>> >
>> >   
>> >   
>> >
>> > Getting this error when i index.  Any ideas?
>> >
>> >
>> >  Remote error message: Point fields can't use FieldCache. Use
>> > docValues=true for field: _version_
>> > solr2_1|at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:973)
>> > solr2_1|at
>> >
>> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1912)
>> > solr2_1|at
>> >
>> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:182)
>> > solr2_1|at
>> >
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:78)
>> > solr2_1|at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
>> > solr2_1|at
>> org.apache.solr.core.SolrCore.execute(SolrCore.java:2440)
>> > solr2_1|at
>> > org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)
>>


Re: _version_ as LongPointField returns error

2017-06-12 Thread Yonik Seeley
I think the _version_ field should be
 - indexed="false"
 - stored="false"
 - docValues="true"

-Yonik


On Mon, Jun 12, 2017 at 12:08 PM, Shawn Feldman  wrote:
> I changed all my TrieLong Fields to Point fields.  _version_ always returns
> an error unless i turn on docvalues
>
>   
>   
>
> Getting this error when i index.  Any ideas?
>
>
>  Remote error message: Point fields can't use FieldCache. Use
> docValues=true for field: _version_
> solr2_1|at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:973)
> solr2_1|at
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1912)
> solr2_1|at
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.finish(LogUpdateProcessorFactory.java:182)
> solr2_1|at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:78)
> solr2_1|at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:173)
> solr2_1|at org.apache.solr.core.SolrCore.execute(SolrCore.java:2440)
> solr2_1|at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:723)


Re: JSON facet performance for aggregations

2017-05-24 Thread Yonik Seeley
On Mon, May 8, 2017 at 11:27 AM, Yonik Seeley <ysee...@gmail.com> wrote:
> I opened https://issues.apache.org/jira/browse/SOLR-10634 to address
> this performance issue.

OK, this has been committed.
A quick test shows about a 30x speedup when faceting on a
string/numeric docvalues field with 100K unique values and doing a
simple aggregation on another numeric field (and when the limit:-1).

-Yonik


Re: JSON facet performance for aggregations

2017-05-08 Thread Yonik Seeley
On Mon, May 8, 2017 at 3:55 AM, Mikhail Ibraheem
<mikhail.ibrah...@oracle.com> wrote:
> Thanks Yonik.
> It is double because our use case allows to group by any field of any type.

Grouping in Solr does not require a double type, so I'm not sure how
that logically follows.  Perhaps it's a limitation in the system using
Solr?

> According to your below valuable explanation, is it better at this case to 
> use flat faceting instead of JSON faceting?

I don't think it would help.

I opened https://issues.apache.org/jira/browse/SOLR-10634 to address
this performance issue.

> Indexing the field should give us better performance than flat faceting?

Indexing the studentId field should give better performance wherever
you need to search for or filter by specific student ids.

-Yonik


> Indexing the field should give us better performance than flat faceting?
> Do you recommend streaming at that case?
>
> Please advise.
>
> Thanks
> Mikhail
>
> -Original Message-
> From: Yonik Seeley [mailto:ysee...@gmail.com]
> Sent: Sunday, May 07, 2017 6:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> OK, so I think I know what's going on.
>
> The current code is more optimized for finding the top K buckets from a total 
> of N.
> When one asks to return the top 10 buckets when there are potentially 
> millions of buckets, it makes sense to defer calculating other metrics for 
> those buckets until we know which ones they are.  After we identify the top 
> 10 buckets, we calculate the domain for that bucket and use that to calculate 
> the remaining metrics.
>
> The current method is obviously much slower when one is requesting
> *all* buckets.  We might as well just calculate all metrics in the first pass 
> rather than trying to defer them.
>
> This inefficiency is compounded by the fact that the fields are not indexed.  
> In the second phase, finding the domain for a bucket is a field query.  For 
> an indexed field, this would involve a single term lookup.  For a non-indexed 
> docValues field, this involves a full column scan.
>
> If you ever want to do quick lookups on studentId, it would make sense for it 
> to be indexed (and why is it a double, anyway?)
>
> I'll open up a JIRA issue for the first problem (don't defer metrics if we're 
> going to return all buckets anyway)
>
> -Yonik
>
>
> On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem 
> <mikhail.ibrah...@oracle.com> wrote:
>> Hi Yonik,
>> We are using Solr 6.5
>> Both studentId and grades are double:
>>   > indexed="false" stored="true" docValues="true" multiValued="false"
>> required="false"/>
>>
>> We have 1.5 million records.
>>
>> Thanks
>> Mikhail
>>
>> -Original Message-
>> From: Yonik Seeley [mailto:ysee...@gmail.com]
>> Sent: Sunday, April 30, 2017 1:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: JSON facet performance for aggregations
>>
>> It is odd there would be quite such a big performance delta.
>> What version of solr are you using?
>> What is the fieldType of "grades"?
>> -Yonik
>>
>>
>> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem 
>> <mikhail.ibrah...@oracle.com> wrote:
>>> 1-
>>> studentId has docValue = true . it is of type double which is
>>> >> stored="true" docValues="true" multiValued="false" required="false"/>
>>>
>>>
>>> 2- If we just facet without aggregation it finishes in good time 60ms:
>>>
>>> json.facet={
>>>studentId:{
>>>   type:terms,
>>>   limit:-1,
>>>   field:" studentId "
>>>
>>>}
>>> }
>>>
>>>
>>> Thanks
>>>
>>>
>>> -Original Message-
>>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>>> Sent: Sunday, April 30, 2017 10:44 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: RE: JSON facet performance for aggregations
>>>
>>> Please enable doc values and try.
>>> There is a bug in the source code which causes json facet on string field 
>>> to run very slow. On numeric fields it runs fine with doc value enabled.
>>>
>>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem"
>>> <mikhail.ibrah...@oracle.com>
>>> wrote:
>>>
>>>> Hi Vijay,
>>>> It is already numeric field.
>>>> It is huge difference between json and flat here. Do y

Re: JSON facet performance for aggregations

2017-05-07 Thread Yonik Seeley
OK, so I think I know what's going on.

The current code is more optimized for finding the top K buckets from
a total of N.
When one asks to return the top 10 buckets when there are potentially
millions of buckets, it makes sense to defer calculating other metrics
for those buckets until we know which ones they are.  After we
identify the top 10 buckets, we calculate the domain for that bucket
and use that to calculate the remaining metrics.

The current method is obviously much slower when one is requesting
*all* buckets.  We might as well just calculate all metrics in the
first pass rather than trying to defer them.

This inefficiency is compounded by the fact that the fields are not
indexed.  In the second phase, finding the domain for a bucket is a
field query.  For an indexed field, this would involve a single term
lookup.  For a non-indexed docValues field, this involves a full
column scan.

If you ever want to do quick lookups on studentId, it would make sense
for it to be indexed (and why is it a double, anyway?)

I'll open up a JIRA issue for the first problem (don't defer metrics
if we're going to return all buckets anyway)

-Yonik


On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem
<mikhail.ibrah...@oracle.com> wrote:
> Hi Yonik,
> We are using Solr 6.5
> Both studentId and grades are double:
>stored="true" docValues="true" multiValued="false" required="false"/>
>
> We have 1.5 million records.
>
> Thanks
> Mikhail
>
> -Original Message-
> From: Yonik Seeley [mailto:ysee...@gmail.com]
> Sent: Sunday, April 30, 2017 1:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON facet performance for aggregations
>
> It is odd there would be quite such a big performance delta.
> What version of solr are you using?
> What is the fieldType of "grades"?
> -Yonik
>
>
> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem 
> <mikhail.ibrah...@oracle.com> wrote:
>> 1-
>> studentId has docValue = true . it is of type double which is
>> > stored="true" docValues="true" multiValued="false" required="false"/>
>>
>>
>> 2- If we just facet without aggregation it finishes in good time 60ms:
>>
>> json.facet={
>>studentId:{
>>   type:terms,
>>   limit:-1,
>>   field:" studentId "
>>
>>}
>> }
>>
>>
>> Thanks
>>
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 10:44 AM
>> To: solr-user@lucene.apache.org
>> Subject: RE: JSON facet performance for aggregations
>>
>> Please enable doc values and try.
>> There is a bug in the source code which causes json facet on string field to 
>> run very slow. On numeric fields it runs fine with doc value enabled.
>>
>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem"
>> <mikhail.ibrah...@oracle.com>
>> wrote:
>>
>>> Hi Vijay,
>>> It is already numeric field.
>>> It is huge difference between json and flat here. Do you know the
>>> reason for this? Is there a way to improve it ?
>>>
>>> -Original Message-
>>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>>> Sent: Sunday, April 30, 2017 9:58 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: JSON facet performance for aggregations
>>>
>>> Json facet on string fields run lot slower than on numeric fields.
>>> Try and see if you can represent studentid as a numeric field.
>>>
>>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>>> <mikhail.ibrah...@oracle.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I am trying to do aggregation with JSON faceting but performance is
>>> > very bad for one of the requests:
>>> >
>>> > json.facet={
>>> >
>>> >studentId:{
>>> >
>>> >   type:terms,
>>> >
>>> >   limit:-1,
>>> >
>>> >   field:"studentId",
>>> >
>>> >   facet:{
>>> >
>>> >   x:"sum(grades)"
>>> >
>>> >   }
>>> >
>>> >}
>>> >
>>> > }
>>> >
>>> >
>>> >
>>> > This request finishes in 250 seconds, and we can't paginate for
>>> > this service for functional reason so we have to use limit:-1, and
>>> > the cardinality of the studentId is 7500.
>>> >
>>> >
>>> >
>>> > If I try the same with flat facet it finishes in 3 seconds :
>>> > stats=true=true={!tag=piv1
>>> > sum=true}grades={!stats=piv1}studentId
>>> >
>>> >
>>> >
>>> > We are hoping to use one approach json or flat for all our services.
>>> > JSON facet performance is better for many case.
>>> >
>>> >
>>> >
>>> > Please advise on why the performance for this is so bad and if we
>>> > can improve it. Also what is the default algorithm used for json facet.
>>> >
>>> >
>>> >
>>> > Thanks
>>> >
>>> > Mikhail
>>> >
>>>


Re: Poll: Master-Slave or SolrCloud?

2017-04-30 Thread Yonik Seeley
On Tue, Apr 25, 2017 at 1:33 PM, Otis Gospodnetić
 wrote:
> I think I saw mentions (maybe on user or dev MLs or JIRA) about
> potentially, in the future, there only being SolrCloud mode (and dropping
> SolrCloud name in favour of Solr).

I personally never saw this actually happening, and not because of any
complexity issues with "getting started with SolrCloud", although I
think continuing improvements there are a good thing.

Many times, I see these two things conflated:
1) how easy it is to get SolrCloud set up
2) the inherent internal complexity of a system

We can always improve #1, but that does not imply improvement in #2
(and may actually increase internal complexity).

A system where you can just fire up a node pointed at a directory and
not worry about any shared state is very easy to understand, debug,
hack around, and build very complex custom systems around.

-Yonik


Re: JSON facet performance for aggregations

2017-04-30 Thread Yonik Seeley
It is odd there would be quite such a big performance delta.
What version of solr are you using?
What is the fieldType of "grades"?
-Yonik


On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem
 wrote:
> 1-
> studentId has docValue = true . it is of type double which is  name="double" class="solr.TrieDoubleField" indexed="false" stored="true" 
> docValues="true" multiValued="false" required="false"/>
>
>
> 2- If we just facet without aggregation it finishes in good time 60ms:
>
> json.facet={
>studentId:{
>   type:terms,
>   limit:-1,
>   field:" studentId "
>
>}
> }
>
>
> Thanks
>
>
> -Original Message-
> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
> Sent: Sunday, April 30, 2017 10:44 AM
> To: solr-user@lucene.apache.org
> Subject: RE: JSON facet performance for aggregations
>
> Please enable doc values and try.
> There is a bug in the source code which causes json facet on string field to 
> run very slow. On numeric fields it runs fine with doc value enabled.
>
> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" 
> wrote:
>
>> Hi Vijay,
>> It is already numeric field.
>> It is huge difference between json and flat here. Do you know the
>> reason for this? Is there a way to improve it ?
>>
>> -Original Message-
>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com]
>> Sent: Sunday, April 30, 2017 9:58 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: JSON facet performance for aggregations
>>
>> Json facet on string fields run lot slower than on numeric fields. Try
>> and see if you can represent studentid as a numeric field.
>>
>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem"
>> 
>> wrote:
>>
>> > Hi,
>> >
>> > I am trying to do aggregation with JSON faceting but performance is
>> > very bad for one of the requests:
>> >
>> > json.facet={
>> >
>> >studentId:{
>> >
>> >   type:terms,
>> >
>> >   limit:-1,
>> >
>> >   field:"studentId",
>> >
>> >   facet:{
>> >
>> >   x:"sum(grades)"
>> >
>> >   }
>> >
>> >}
>> >
>> > }
>> >
>> >
>> >
>> > This request finishes in 250 seconds, and we can't paginate for this
>> > service for functional reason so we have to use limit:-1, and the
>> > cardinality of the studentId is 7500.
>> >
>> >
>> >
>> > If I try the same with flat facet it finishes in 3 seconds :
>> > stats=true=true={!tag=piv1
>> > sum=true}grades={!stats=piv1}studentId
>> >
>> >
>> >
>> > We are hoping to use one approach json or flat for all our services.
>> > JSON facet performance is better for many case.
>> >
>> >
>> >
>> > Please advise on why the performance for this is so bad and if we
>> > can improve it. Also what is the default algorithm used for json facet.
>> >
>> >
>> >
>> > Thanks
>> >
>> > Mikhail
>> >
>>


Re: prefix facet performance

2017-04-24 Thread Yonik Seeley
In SimpleFacets.getFacetTermEnumCounts, we seek to the first term
matching the prefix using the index and then for each term after
compare the prefix until it no longer matches.

-Yonik


On Mon, Apr 24, 2017 at 5:04 AM, alessandro.benedetti
 wrote:
> Thanks Yonik and Maria.
> It make sense, if we reduce the number of terms, term enum becomes a very
> good solution.
> @Yonik : do we still check the prefix on the term dictionary one by one, or
> an FST is used to identify the set of candidate terms ?
>
> I will check the code later,
>
> Regards
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: prefix facet performance

2017-04-21 Thread Yonik Seeley
On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea  wrote:
> The field is:
>
> 
>
> and using unique() I found that it has 700K+ unique values.
>
> The query before (that takes ~10s):
>
> wt=json=true=*:*=0=true=concept=A/
>
> the query after (that is almost instant):
>
> wt=json=true=*:*=0=true=concept=A/=enum'

Ah, the fact that you specify a facet.prefix makes this perfectly
aligned for the "enum" method, which can skip directly to the first
term on-or-after "A/"
facet.method=enum goes term-by-term, calculating the intersection with
the facet domain.
In this case, it's the number of terms that start with "A/" that
matters, not the number of terms in the entire field (hence the
speedup).

-Yonik


Re: prefix facet performance

2017-04-18 Thread Yonik Seeley
How many unique values in the index?
You could try facet.method=enum

-Yonik


On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea  wrote:
> Hi,
>
> I have ~40K documents in SOLR (not many) and a multivalued facet field that
> contains at least 2K values per document.
>
> The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and
> I use facet.prefix.
>
> q=*:*=0=true=concept=A/
>
>
> with "concept" defined as:
>
>
> 
>
>
> This generates the output that I am looking for, but it takes more than 10
> seconds per query.
>
>
> Is there any way that I could improve the facet query performance for this
> example?
>
>
> Thank you,
>
> Maria


Re: Disable All kind of caching in Solr/Lucene

2017-03-31 Thread Yonik Seeley
On Fri, Mar 31, 2017 at 1:53 PM, Nilesh Kamani  wrote:
> @Alexandre - Could you please point me to reference doc to remove default
> cache settings ?
>
> @Yonik - The code change is in Solr Indexer to sort the results.

OK, so to test indexing performance, there are no caches to worry
about (as long as you have autowarmCount=0 on all caches, as is the
case with the Solr example configs).

To test sorted query performance (I assume you're sorting the index to
accelerate certain sorted queries), if you can't make the queries
unique, then add
{!cache=false} to the query
example: q={!cache=false}*:*
You could also add a random term on a non-existent field to change the
query and prevent unwanted caching...
example: q=*:* does_not_exist_s:149475394

-Yonik


Re: Disable All kind of caching in Solr/Lucene

2017-03-31 Thread Yonik Seeley
On Fri, Mar 31, 2017 at 9:44 AM, Nilesh Kamani  wrote:
> I am planning to do load testing for some of my code changes and I need to
> disable all kind of caching.

Perhaps you should be aiming to either:
1) seek a config + query load that maximizes time spent in your code
in order to optimize it
2) seek a realistic query load for acceptance testing of your use case

Attempting to disable or work around *some* caching can help for #1,
but attempting to disable *all* kinds of caching sounds misguided.

If you share what your code changes are, people may be able to suggest
ways to better isolate the performance of those changes.

-Yonik


Re: JSON Facet API Virtual Field Support

2017-03-24 Thread Yonik Seeley
On Fri, Mar 24, 2017 at 7:52 PM, Furkan KAMACI  wrote:
> Hi,
>
> I test JSON Facet API of Solr. Is it possible to create a virtual field
> which is generated by using existing fields at response and supports
> elementary arithmetic operations?
>
> Example:
>
> Schema fields:
>
> products,
> sold_products,
> date
>
> I want to run a date range facet and add another field to response which is
> the percentage of sold products (ratio will be calculated as sold_products
> * 100 / products)

Currently only half supported.  By this I mean we can do math on
fields and aggregate them per bucket.
Basically sum(div(sold_products,products)), assuming products and
sold_products exist on each document.

What we can't do yet is do math on aggregations:
  div(sum(sold_products),sum(products))

If the former works for you, simply place that in a facet block within
a parent facet (like your range facet).
http://yonik.com/solr-facet-functions/

-Yonik


Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 2:17 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> On 3/17/2017 8:11 AM, Yonik Seeley wrote:
>> For Solr 6.4, we've managed to circumvent this for filter queries and
>> other contexts where scoring isn't needed.
>> http://yonik.com/solr-6-4/  "More efficient filter queries"
>
> Nice!
>
> If the filter looks like the following (because q.op=AND), does it still
> use TermsQuery?
>
> fq=id:(id1 OR id2 OR id3 OR ... id2000)

Yep, that works as well.  As does fq=id:id1 OR id:id2 OR id:id3 ...
Was implemented here: https://issues.apache.org/jira/browse/SOLR-9786

-Yonik


Re: fq performance

2017-03-17 Thread Yonik Seeley
On Fri, Mar 17, 2017 at 9:09 AM, Shawn Heisey  wrote:
[...]
> Lucene has a global configuration called "maxBooleanClauses" which
> defaults to 1024.

For Solr 6.4, we've managed to circumvent this for filter queries and
other contexts where scoring isn't needed.
http://yonik.com/solr-6-4/  "More efficient filter queries"

-Yonik


Re: Get handler not working

2017-03-16 Thread Yonik Seeley
Ah, yeah, if you're using a different route field it's highly likely
that's the issue.
I was always against that "feature", and this thread demonstrates part
of the problem (complicating clients, including us human clients
trying to make sense of what's going on).

-Yonik


On Thu, Mar 16, 2017 at 10:31 AM, Chris Ulicny <culicny@iq.media> wrote:
> Speaking of routing, I realized I completely forgot to add the routing
> setup to the test cloud, so it probably has something to do with the issue.
> I'll add that in and report back.
>
> So the routing and uniqueKey setup is as follows:
>
> Schema setup:
> iqdocid  multiValued="false" indexed="true" required="true" stored="true"/>  name="iqdocid" type="string" multiValued="false" indexed="true" required=
> "true" stored="true"/>
>
> I don't think it's mentioned in the documentation about using routerField
> for the compositeId router, but based on the resolution of SOLR-5017
> <https://issues.apache.org/jira/browse/SOLR-5017>, we decided to use the
> compositeId router with routerField set to 'iqroutingkey' which is using
> the "!" notation. In general, the iqroutingkey field is of the form:
> !!
>
> Unless I misunderstood what was changed with that patch, that form should
> still route appropriately, and it seems that it has distributed the
> documents appropriately from our basic testing.
>
> On Thu, Mar 16, 2017 at 9:42 AM David Hastings <hastings.recurs...@gmail.com>
> wrote:
>
> i still would like to see an experiment where you change the field to id
> instead of iqdocid,
>
> On Thu, Mar 16, 2017 at 9:33 AM, Yonik Seeley <ysee...@gmail.com> wrote:
>
>> Something to do with routing perhaps? (the mapping of ids to shards,
>> by default is based on hashes of the id)
>> -Yonik
>>
>>
>> On Thu, Mar 16, 2017 at 9:16 AM, Chris Ulicny <culicny@iq.media> wrote:
>> > iqdocid is already set to be the uniqueKey value.
>> >
>> > I tried reindexing a few documents back into the problematic cloud and
> am
>> > getting the same behavior of no document found for get handler.
>> >
>> > I've also done some testing on standalone instances as well as some
> quick
>> > cloud setups (with embedded zk), and I cannot seem to replicate the
>> > problem. For each test, I used the exact same configset that is causing
>> the
>> > issue for us and indexed a document from that instance as well. I can
>> > provide more details if that would be useful in anyway.
>> >
>> > Standalone instance worked
>> > Cloud mode worked regardless of the use of the security plugin
>> > Cloud mode worked regardless of explicit get handler definition
>> > Cloud mode consistently worked with explicitly defining the get handler,
>> > then removing it and reloading the collection
>> >
>> > The only differences that I know of between the tests and the
> problematic
>> > cloud is that solr is running as a different user and using an external
>> > zookeeper ensemble. The running user has ownership of the solr
>> > installation, log, and data directories.
>> >
>> > I'm going to keep trying different setups to see if I can replicate the
>> > issue, but if anyone has any ideas on what direction might make the most
>> > sense, please let me know.
>> >
>> > Thanks again
>> >
>> > On Wed, Mar 15, 2017 at 5:49 PM Erick Erickson <erickerick...@gmail.com>
>> > wrote:
>> >
>> > Wait... Is iqdocid set to the  in your schema? That might
>> > be the missing thing.
>> >
>> >
>> >
>> > On Wed, Mar 15, 2017 at 11:20 AM, Chris Ulicny <culicny@iq.media> wrote:
>> >> Unless the behavior's changed on the way to version 6.3.0, the get
>> handler
>> >> used to use whatever field is set to be the uniqueKey. We have
>> > successfully
>> >> been using get on a 4.9.0 standalone core with no explicit "id" field
>> >> defined by passing in the value for the uniqueKey field to the get
>> > handler.
>> >> We tend to have a bunch of id fields floating around from different
>> >> sources, so we avoid keeping any of them named as "id"
>> >>
>> >> iqdocid is just a basic string type
>> >> > >> required="true" stored="true"/>
>> >>
>> >> I'll do some more testing on standalone versions,

Re: Get handler not working

2017-03-16 Thread Yonik Seeley
Something to do with routing perhaps? (the mapping of ids to shards,
by default is based on hashes of the id)
-Yonik


On Thu, Mar 16, 2017 at 9:16 AM, Chris Ulicny  wrote:
> iqdocid is already set to be the uniqueKey value.
>
> I tried reindexing a few documents back into the problematic cloud and am
> getting the same behavior of no document found for get handler.
>
> I've also done some testing on standalone instances as well as some quick
> cloud setups (with embedded zk), and I cannot seem to replicate the
> problem. For each test, I used the exact same configset that is causing the
> issue for us and indexed a document from that instance as well. I can
> provide more details if that would be useful in anyway.
>
> Standalone instance worked
> Cloud mode worked regardless of the use of the security plugin
> Cloud mode worked regardless of explicit get handler definition
> Cloud mode consistently worked with explicitly defining the get handler,
> then removing it and reloading the collection
>
> The only differences that I know of between the tests and the problematic
> cloud is that solr is running as a different user and using an external
> zookeeper ensemble. The running user has ownership of the solr
> installation, log, and data directories.
>
> I'm going to keep trying different setups to see if I can replicate the
> issue, but if anyone has any ideas on what direction might make the most
> sense, please let me know.
>
> Thanks again
>
> On Wed, Mar 15, 2017 at 5:49 PM Erick Erickson 
> wrote:
>
> Wait... Is iqdocid set to the  in your schema? That might
> be the missing thing.
>
>
>
> On Wed, Mar 15, 2017 at 11:20 AM, Chris Ulicny  wrote:
>> Unless the behavior's changed on the way to version 6.3.0, the get handler
>> used to use whatever field is set to be the uniqueKey. We have
> successfully
>> been using get on a 4.9.0 standalone core with no explicit "id" field
>> defined by passing in the value for the uniqueKey field to the get
> handler.
>> We tend to have a bunch of id fields floating around from different
>> sources, so we avoid keeping any of them named as "id"
>>
>> iqdocid is just a basic string type
>> > required="true" stored="true"/>
>>
>> I'll do some more testing on standalone versions, and see how that goes.
>>
>> On Wed, Mar 15, 2017 at 1:52 PM David Hastings <
> hastings.recurs...@gmail.com>
>> wrote:
>>
>>> from your previous email:
>>> "There is no "id"
>>> field defined in the schema."
>>>
>>> you need an id field to use the get handler
>>>
>>> On Wed, Mar 15, 2017 at 1:45 PM, Chris Ulicny  wrote:
>>>
>>> > I thought that "id" and "ids" were fixed parameters for the get
> handler,
>>> > but I never remember, so I've already tried both. Each time it comes
> back
>>> > with the same response of no document.
>>> >
>>> > On Wed, Mar 15, 2017 at 1:31 PM Alexandre Rafalovitch <
>>> arafa...@gmail.com>
>>> > wrote:
>>> >
>>> > > Actually.
>>> > >
>>> > > I think Real Time Get handler has "id" as a magical parameter, not as
>>> > > a field name. It maps to the real id field via the uniqueKey
>>> > > definition:
>>> > > https://cwiki.apache.org/confluence/display/solr/RealTime+Get
>>> > >
>>> > > So, if you have not, could you try the way you originally wrote it.
>>> > >
>>> > > Regards,
>>> > >Alex.
>>> > > 
>>> > > http://www.solr-start.com/ - Resources for Solr users, new and
>>> > experienced
>>> > >
>>> > >
>>> > > On 15 March 2017 at 13:22, Chris Ulicny  wrote:
>>> > > > Sorry, that is a typo. The get is using the iqdocid field. There is
>>> no
>>> > > "id"
>>> > > > field defined in the schema.
>>> > > >
>>> > > > solr/TestCollection/get?iqdocid=2957-TV-201604141900
>>> > > >
>>> > > > solr/TestCollection/select?q=*:*=iqdocid:2957-TV-201604141900
>>> > > >
>>> > > > On Wed, Mar 15, 2017 at 1:15 PM Erick Erickson <
>>> > erickerick...@gmail.com>
>>> > > > wrote:
>>> > > >
>>> > > >> Is this a typo or are you trying to use get with an "id" field and
>>> > > >> your filter query uses "iqdocid"?
>>> > > >>
>>> > > >> Best,
>>> > > >> Erick
>>> > > >>
>>> > > >> On Wed, Mar 15, 2017 at 8:31 AM, Chris Ulicny 
>>> > wrote:
>>> > > >> > Yes, we're using a fixed schema with the iqdocid field set as
> the
>>> > > >> uniqueKey.
>>> > > >> >
>>> > > >> > On Wed, Mar 15, 2017 at 11:28 AM Alexandre Rafalovitch <
>>> > > >> arafa...@gmail.com>
>>> > > >> > wrote:
>>> > > >> >
>>> > > >> >> What is your uniqueKey? Is it iqdocid?
>>> > > >> >>
>>> > > >> >> Regards,
>>> > > >> >>Alex.
>>> > > >> >> 
>>> > > >> >> http://www.solr-start.com/ - Resources for Solr users, new and
>>> > > >> experienced
>>> > > >> >>
>>> > > >> >>
>>> > > >> >> On 15 March 2017 at 11:24, Chris Ulicny 
>>> wrote:
>>> > > >> >> > Hi,
>>> > > >> >> >
>>> > > >> >> > I've been trying to use the get handler for a new solr cloud
>>> > > >> collection
>>> > > >> >> we

Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Yonik Seeley
FYI, I just opened https://issues.apache.org/jira/browse/SOLR-10122 for this
-Yonik

On Fri, Feb 10, 2017 at 4:32 PM, Yonik Seeley <ysee...@gmail.com> wrote:
> On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
> <michael.bry...@kcl.ac.uk> wrote:
>> Hi all,
>>
>> I'm converting my legacy facets to JSON facets and am seeing much better 
>> performance, especially with high cardinality facet fields. However, the one 
>> issue I can't seem to resolve is excessive memory usage (and OOM errors) 
>> when trying to simulate the effect of "group.facet" to sort facets according 
>> to a grouping field.
>
> Yeah, I sort of expected this... but haven't gotten around to
> implementing something that takes less memory yet.
> If you're faceting on A and sorting by unique(B), then memory use is
> O(cardinality(A)*cardinality(B))
> We can definitely do a lot better.
>
> -Yonik


Re: Simulating group.facet for JSON facets, high mem usage w/ sorting on aggregation...

2017-02-10 Thread Yonik Seeley
On Thu, Feb 9, 2017 at 6:58 AM, Bryant, Michael
 wrote:
> Hi all,
>
> I'm converting my legacy facets to JSON facets and am seeing much better 
> performance, especially with high cardinality facet fields. However, the one 
> issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
> trying to simulate the effect of "group.facet" to sort facets according to a 
> grouping field.

Yeah, I sort of expected this... but haven't gotten around to
implementing something that takes less memory yet.
If you're faceting on A and sorting by unique(B), then memory use is
O(cardinality(A)*cardinality(B))
We can definitely do a lot better.

-Yonik


Re: ClassCastException: BasicResultContext cannot be cast to SolrDocumentList

2016-12-20 Thread Yonik Seeley
This is a bug (that code should no longer be expecting a SolrDocumentList)
Can you open a JIRA issue?

-Yonik


On Tue, Dec 20, 2016 at 12:02 PM, Yago Riveiro  wrote:
> I'm hitting this exception in 6.3.0, any ideas?
>
> null:java.lang.ClassCastException:
> org.apache.solr.response.BasicResultContext cannot be cast to
> org.apache.solr.common.SolrDocumentList
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:315)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:518)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> -
> Best regards
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/ClassCastException-BasicResultContext-cannot-be-cast-to-SolrDocumentList-tp4310523.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Nested JSON Facets (Subfacets)

2016-12-15 Thread Yonik Seeley
Interesting I don't recall a bug like that being fixed.
Anyway, glad it works for you now!
-Yonik


On Thu, Dec 15, 2016 at 11:01 AM, Chantal Ackermann
 wrote:
> Hi Yonik,
>
> after upgrading to Solr 6.3.0, the nested function works as expected! (Both 
> with and without docValues.)
>
> "facets":{
> "count":3179500,
> "all_pop":1.5901646171168616E8,
> "shop_cat":{
>   "buckets":[{
>   "val":"Kontaktlinsen > Torische Linsen",
>   "count":75168,
>   "cat_sum":3752665.0497611803},
>
>
> Thanks,
> Chantal
>
>
>> Am 15.12.2016 um 16:00 schrieb Chantal Ackermann 
>> :
>>
>> Hi Yonik,
>>
>> are you certain that nesting a function works as documented on 
>> http://yonik.com/solr-subfacets/?
>>
>> top_authors:{
>>type: terms,
>>field: author,
>>limit: 7,
>>sort: "revenue desc",
>>facet:{
>>  revenue: "sum(sales)"
>>}
>>  }
>>
>>
>> I’m getting the feeling that the function is never really executed because, 
>> for my index, calling sum() with a non-number field (e.g. a multi-valued 
>> string field) throws an error when *not nested* but does *not throw an 
>> error* when nested:
>>
>>json.facet={all_pop: "sum(gtin)“}
>>
>>"error":{
>>"trace":“java.lang.UnsupportedOperationException
>>   at 
>> org.apache.lucene.queries.function.FunctionValues.doubleVal(FunctionValues.java:47)
>>
>> And the following does not throw an error but definitely should if the 
>> function would be executed:
>>
>>json.facet={all_pop:"sum(popularity)",shop_cat: {type:terms, 
>> field:shop_cat, facet: {cat_pop:"sum(gtin)"}}}
>>
>> returns:
>>
>> "facets":{
>>"count":2815500,
>>"all_pop":1.4065865823321116E8,
>>"shop_cat":{
>>  "buckets":[{
>>  "val":"Kontaktlinsen > Torische Linsen",
>>  "count":75168,
>>  "cat_pop":0.0},
>>{
>>  "val":"Damen-Mode/Inspirationen",
>>  "count":47053,
>>  "cat_pop":0.0},
>>
>> For completeness: here is the field directive for „gtin“ with 
>> „text_noleadzero“ based on „solr.TextField“:
>>
>>> required="false" multiValued="true“/>
>>
>>
>> Is this a bug or is the documentation a glimpse of the future? I will try 
>> version 6.3.0, now. I was still on 6.1.0 for the above tests.
>> (I have also tried with the „avg“ function, just to make sure, but same 
>> there.)
>>
>> Cheers,
>> Chantal
>


Re: Nested JSON Facets (Subfacets)

2016-12-14 Thread Yonik Seeley
That should work... what version of Solr are you using?  Did you
change the type of the popularity field w/o completely reindexing?

You can try to verify the number of documents in each bucket that have
the popularity field by adding another sub-facet next to cat_pop:
num_pop:{query:"popularity:[* TO *]"}

> A quick check with this json.facet parameter:
>
> json.facet: {cat_pop:"sum(popularity)“}
>
> returns:
>
> "facets“: {
> "count":2508,
> "cat_pop":21.0},

That looks like a pretty low sum for all those documents perhaps
most of them are missing "popularity" (or have a 0 popularity).
To test one of the buckets at the top-level this way, you could add
fq=shop_cat:"Men > Clothing > Jumpers & Cardigans"
and see if you get anything.

-Yonik


On Wed, Dec 14, 2016 at 12:46 PM, CA  wrote:
> Hi all,
>
> this is about using a function in nested facets, specifically the „sum()“ 
> function inside a „terms“ facet using the json.facet api.
>
> My json.facet parameter looks like this:
>
> json.facet={shop_cat: {type:terms, field:shop_cat, facet: 
> {cat_pop:"sum(popularity)"}}}
>
> A snippet of the result:
>
> "facets“: {
> "count":2508,
> "shop_cat“: {
> "buckets“: [{
> "val“: "Men > Clothing > Jumpers & Cardigans",
> "count":252,
> "cat_pop“:0.0
>  }, {
>"val":"Men > Clothing > Jackets & Coats",
>"count":157,
>"cat_pop“:0.0
>  }, // and more
>
> This looks fine all over but it turns out that „cat_pop“, the result of 
> „sum(popularity)“ is always 0.0 even if the documents for this facet value 
> have popularities > 0.
>
> A quick check with this json.facet parameter:
>
> json.facet: {cat_pop:"sum(popularity)“}
>
> returns:
>
> "facets“: {
> "count":2508,
> "cat_pop":21.0},
>
> To me, it seems it works fine on the base level but not when nested. Still, 
> Yonik’s documentation and the Jira issues indicate that it is possible to use 
> functions in nested facets so I might just be using the wrong structure? I 
> have a hard time finding any other examples on the i-net and I had no luck 
> changing the structure around.
> Could someone shed some light on this for me? It would also help to know if 
> it is not possible to sum the values up this way.
>
> Thanks a lot!
> Chantal
>
>


Re: Rollback w/ Atomic Update

2016-12-13 Thread Yonik Seeley
On Tue, Dec 13, 2016 at 10:36 AM, Todd Long  wrote:
> We've noticed that partial updates are not rolling back with subsequent
> commits based on the same document id. Our only success in mitigating this
> issue has been to issue an empty commit immediately following the rollback.

"rollback" is a lucene-level operation that isn't really supported at
the solr level:
https://issues.apache.org/jira/browse/SOLR-4733

-Yonik

> I've included an example below showing the partial updates unexpected
> results. We are currently using SolrJ 4.8.1 with the default deletion policy
> and auto commits disabled in the configuration. Any help would be greatly
> appreciated in better understanding this scenario.


Re: empty result set for a sort query

2016-12-12 Thread Yonik Seeley
Ah, 2-phase distributed search is the most likely answer (and
currently classified as more of a limitation than a bug)...
Phase 1 collects the top N ids from each shard (and merges them to
find the global top N)
Phase 2 retrieves the stored fields for the global top N

If any of the ids have been deleted between Phase 1 and Phase 2, then
you can get less than N docs back.

-Yonik


On Mon, Dec 12, 2016 at 4:26 AM, moscovig  wrote:
> I am not sure that it's related,
> but with local tests we got to a scenario where we
> Add doc that somehow has * empty key* and then, when querying with sort over
> creationTime with rows=1, we get empty result set.
> When specifying the recent doc shard with shards=shard2 we do have results.
>
> I don't think we have empty keys in our production schema but maybe it can
> give a clue.
>
> Thanks
> Gilad


Re: empty result set for a sort query

2016-12-11 Thread Yonik Seeley
On Sun, Dec 11, 2016 at 11:22 AM, moscovig  wrote:
> Hi
> In solr 6.2.1 as server and solr 6.2.0 for client
> It's a 2 shards index, 3 replicas for each shard.
>
> We are fetching the latest document with sorting over creationTime desc and
> rows=1.
>
> At the same time we are committing sanity tests that insert documents and
> delete them immediately.
>
> The weird thing is that sometimes we get an empty result set from the sort
> by creation time desc and rows=1,
> even though we have lots of documents in the index.
>
> It seems like at some point, the latest document is the sanity document that
> gets deleted, and we are trying to fetch that document, but it then gets
> deleted and we get an empty result set. We would expect Solr to send that
> document back or any other non deleted document.
> What could be the problem?
> Is this some kind of a bug in solr?

If there are documents in the index that should match the query, then
it would be a bug.
What query do you use?  If you use q=*:*=creationTime desc=1
then you should always get a document (since you indicate there are
many documents in the index).

If you don't, then first look to see if you have any custom plugins,
custom queries, or search processors that can change the result list.

-Yonik


Re: "on deck" searcher vs warming searcher

2016-12-09 Thread Yonik Seeley
We've got a patch to prevent the exceptions:
https://issues.apache.org/jira/browse/SOLR-9712

-Yonik


On Fri, Dec 9, 2016 at 7:45 PM, Joel Bernstein  wrote:
> The question about allowing more the one on-deck searcher is a good one.
> The current behavior with maxWarmingSearcher config is to throw an
> exception if searchers are being opened too frequently. There is probably a
> good reason why it was done this way but I'm not sure the history behind it.
>
> Currently I'm adding code to Alfresco's version of Solr that guards against
> having more the one on-deck searcher. This allows users to set the commit
> intervals low without having to worry about getting overlapping searchers.
> Something like this might useful in the standard Solr as well, if people
> don't like exceptions being thrown when searchers are opened too frequently.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
> On Fri, Dec 9, 2016 at 5:42 PM, Trey Grainger  wrote:
>
>> Shawn and Joel both answered the question with seemingly opposite answers,
>> but Joel's should be right. On Deck, as an idiom, means "getting ready to
>> go next". I think it has it's history in military / naval terminology (a
>> plane being "on deck" of an aircraft carrier was the next one to take off),
>> and was later used heavily in baseball (the "on deck" batter was the one
>> warming up to go next) and probably elsewhere.
>>
>> I've always understood the "on deck" searcher(s) being the same as the
>> warming searcher(s). So you have the "active" searcher and them the warming
>> or on deck searchers.
>>
>> -Trey
>>
>>
>> On Fri, Dec 9, 2016 at 11:54 AM, Erick Erickson 
>> wrote:
>>
>> > Jihwan:
>> >
>> > Correct. Do note that there are two distinct warnings here:
>> > 1> "Error opening new searcher. exceeded limit of
>> maxWarmingSearchers"
>> > 2> "PERFORMANCE WARNING: Overlapping onDeckSearchers=..."
>> >
>> > in <1>, the new searcher is _not_ opened.
>> > in <2>, the new searcher _is_ opened.
>> >
>> > In practice, getting either warning is an indication of
>> > mis-configuration. Consider a very large filterCache with large
>> > autowarm values. Every new searcher will then allocate space for the
>> > filterCache so having <1> is there to prevent runaway situations that
>> > lead to OOM errors.
>> >
>> > <2> is just letting you know that you should look at your usage of
>> > commit so you can avoid <1>.
>> >
>> > Best,
>> > Erick
>> >
>> > On Fri, Dec 9, 2016 at 8:44 AM, Jihwan Kim  wrote:
>> > > why is there a setting (maxWarmingSearchers) that even lets you have
>> more
>> > > than one:
>> > > Isn't it also for a case of (frequent) update? For example, one update
>> is
>> > > committed.  During the warming up  for this commit, another update is
>> > > made.  In this case the new commit also go through another warming.  If
>> > the
>> > > value is 1, the second warming will fail.  More number of concurrent
>> > > warming-up requires larger memory usage.
>> > >
>> > >
>> > > On Fri, Dec 9, 2016 at 9:14 AM, Erick Erickson <
>> erickerick...@gmail.com>
>> > > wrote:
>> > >
>> > >> bq: because shouldn't there only be one active
>> > >> searcher at a time?
>> > >>
>> > >> Kind of. This is a total nit, but there can be multiple
>> > >> searchers serving queries briefly (one hopes at least).
>> > >> S1 is serving some query when S2 becomes
>> > >> active and starts getting new queries. Until the last
>> > >> query S1 is serving is complete, they both are active.
>> > >>
>> > >> bq: why is there a setting
>> > >> (maxWarmingSearchers) that even lets
>> > >> you have more than one
>> > >>
>> > >> The contract is that when you commit (assuming
>> > >> you're opening a new searcher), then all docs
>> > >> indexed up to that point are visible. Therefore you
>> > >> _must_ open a new searcher even if one is currently
>> > >> warming or that contract would be violated. Since
>> > >> warming can take minutes, not opening a new
>> > >> searcher if one was currently warming could cause
>> > >> quite a gap.
>> > >>
>> > >>
>> > >> Best,
>> > >> Erick
>> > >>
>> > >> On Fri, Dec 9, 2016 at 7:30 AM, Brent 
>> wrote:
>> > >> > Hmmm, conflicting answers. Given the infamous "PERFORMANCE WARNING:
>> > >> > Overlapping onDeckSearchers" log message, it seems like the "they're
>> > the
>> > >> > same" answer is probably correct, because shouldn't there only be
>> one
>> > >> active
>> > >> > searcher at a time?
>> > >> >
>> > >> > Although it makes me curious, if there's a warning about having
>> > multiple
>> > >> > (overlapping) warming searchers, why is there a setting
>> > >> > (maxWarmingSearchers) that even lets you have more than one, or at
>> > least,
>> > >> > why ever set it to anything other than 1?
>> > >> >
>> > >> >
>> > >> >
>> > >> > --
>> > >> > View this message in context: http://lucene.472066.n3.
>> > >> 

Re: Solr 6 Performance Suggestions

2016-11-22 Thread Yonik Seeley
It depends highly on what your requests look like, and which ones are slower.
If you're request mix is heterogeneous, find the types of requests
that seem to have the largest slowdown and let us know what they look
like.

-Yonik


On Tue, Nov 22, 2016 at 8:54 AM, Max Bridgewater
 wrote:
> I migrated an application from Solr 4 to Solr 6.  solrconfig.xml  and
> schema.xml are sensibly the same. The JVM params are also pretty much
> similar.  The indicces have each about 2 million documents. No particular
> tuning was done to Solr 6 beyond the default settings. Solr 4 is running in
> Tomcat 7.
>
> Early results seem to show Solr 4 outperforming Solr 6. The first shows an
> average response time of 280 ms while the second averages at 430 ms. The
> test cases were exactly the same, the machines where exactly the same and
> heap settings exactly the same (Xms24g, Xmx24g). Requests were sent with
> Jmeter with 50 concurrent threads for 2h.
>
> I know that this is not enough information to claim that Solr 4 generally
> outperforms Solr 6. I also know that this pretty much depends on what the
> application does. So I am not claiming anything general. All I want to do
> is get some input before I start digging.
>
> What are some things I could tune to improve the numbers for Solr 6? Have
> you guys experienced such discrepancies?
>
> Thanks,
> Max.


Re: How to get "max(date)" from a facet field? (Solr 6.3)

2016-11-21 Thread Yonik Seeley
On Mon, Nov 21, 2016 at 3:42 PM, Michael Joyner  wrote:
> Help,
>
> (Solr 6.3)
>
> Trying to do a "sub-facet" using the new json faceting API, but can't seem
> to figure out how to get the "max" date in the subfacet?
>
> I've tried a couple of different ways:
>
> == query ==
>
> json.facet={
> code_s:{
> limit:-1,
> type:terms,field:code_s,facet:{
> issuedate_tdt:"max(issuedate_tdt)"
> }
> }
> }
>
> == partial response ==
>
> facets":{
> "count":1310359,
> "code_s":{
>   "buckets":[{
>   "val":"5W",
>   "count":255437,
>   "issuedate_tdt":1.4794452E12},
> {
>   "val":"LS",
>   "count":201407,
>   "issuedate_tdt":1.479186E12},
>
> -- the date values seem to come back out as longs converted to float/double
> which are then truncated and put into scientific notation? --

Hmmm, yeah... min/max are currently only implemented in terms of
function queries (so you can do stuff like max(add(field1,field2)))
The downside is that by default function queries use float/double, so
we need to add support for other types.

As a temporary workaround, the ms() function subtracts two dates and
gives the result in milliseconds, and may work better for losing less
information.
min(ms(NOW, issuedate_tdt))

-Yonik


Re: SolrJ optimize method -- not returning immediately when the "wait" options are false

2016-11-08 Thread Yonik Seeley
https://issues.apache.org/jira/browse/SOLR-2018
There used to be a waitFlush parameter (wait until the IndexWriter has
written all the changes) as well as a waitSearcher parameter (wait
until a new searcher has been registered... i.e. whatever changes you
made will be guaranteed to be visible).
The waitFlush parameter was removed because it was never implemented
(we always wait until IW has flushed).  So no, you should not expect
to see an immediate return with waitSearcher=false since it only
represents the open-and-register-searcher part.

-Yonik


On Tue, Nov 8, 2016 at 5:55 PM, Shawn Heisey  wrote:
> I have this code in my SolrJ program:
>
>   LOG.info("{}: background optimizing", logPrefix);
>   myOptimizeSolrClient.optimize(myName, false, false);
>   elapsedMillis = (System.nanoTime() - startNanos) / 100;
>   LOG.info("{}: Background optimize completed, elapsed={}", logPrefix,
> elapsedMillis);
>
> This is what I get when this code runs.  I expected it to return
> immediately, but it took 49 seconds:
>
> INFO  - 2016-11-08 15:10:56.316;   409; shard.c.inc.inclive.optimize;
> shard.c.inc.inclive: Background optimize completed, elapsed=49339
>
> I'm using SolrJ 5.5.3, and the SolrClient object is HttpSolrClient.  I
> have not tried 6.x versions.  The server that this is talking to is
> 5.3.2-SNAPSHOT.
>
> I found this in solr.log:
>
> 2016-11-08 15:10:56.315 INFO  (qtp1164175787-708968) [   x:inclive]
> org.apache.solr.update.processor.LogUpdateProcessor [inclive]
> webapp=/solr path=/update
> params={optimize=true=1=true=javabin=2}
> {optimize=} 0 49338
>
> It looks like waitSearcher is not being set properly by the SolrJ code.
> I could not see any obvious problem in the master branch, which I
> realize is not the same as the 5.5 code I'm running.
>
> I did try the request manually, both with waitSearcher set to true and
> to false, and in both cases, the request DID wait until the optimize was
> finished before it returned a response.  So even if the SolrJ problem is
> fixed, Solr itself will not work the way I'm expecting.  Is it correct
> to expect an immediate return for optimize when waitSearcher is false?
>
> I am not in a position to try this in 6.x versions.  Is there anyone out
> there who does have a 6.x index they can try it on, see if it's still a
> problem?
>
> Thanks,
> Shawn
>


Re: Parallelize Cursor approach

2016-11-04 Thread Yonik Seeley
No, you can't get cursor-marks ahead of time.
They are the serialized representation of the last sort values
encountered (hence not known ahead of time).

-Yonik


On Fri, Nov 4, 2016 at 8:48 PM, Chetas Joshi  wrote:
> Hi,
>
> I am using the cursor approach to fetch results from Solr (5.5.0). Most of
> my queries return millions of results. Is there a way I can read the pages
> in parallel? Is there a way I can get all the cursors well in advance?
>
> Let's say my query returns 2M documents and I have set rows=100,000.
> Can I have multiple threads iterating over different pages like
> Thread1 -> docs 1 to 100K
> Thread2 -> docs 101K to 200K
> ..
> ..
>
> for this to happen, can I get all the cursorMarks for a given query so that
> I can leverage the following code in parallel
>
> cursorQ.set(CursorMarkParams.CURSOR_MARK_PARAM, cursorMark)
> val rsp: QueryResponse = c.query(cursorQ)
>
> Thank you,
> Chetas.


Re: Facets based on sampling

2016-11-04 Thread Yonik Seeley
Sampling has been on my TODO list for the JSON Facet API.
How much it would help depends on where the bottlenecks are, but that
in conjunction with a hashing approach to collection (assuming field
cardinality is high) should definitely help.

-Yonik


On Fri, Nov 4, 2016 at 3:02 PM, John Davis  wrote:
> Hi,
> I am trying to improve the performance of queries with facets. I understand
> that for queries with high facet cardinality and large number results the
> current facet computation algorithms can be slow as they are trying to loop
> across all docs and facet values.
>
> Does there exist an option to compute facets by just looking at the top-n
> results instead of all of them or a sample of results based on some query
> parameters? I couldn't find one and if it does not exist, has this come up
> before? This would definitely not be a precise facet count but using
> reasonable sampling algorithms we should be able to extrapolate well.
>
> Thank you in advance for any advice!
>
> John


Re: Aggregate Values Inside a Facet Range

2016-11-04 Thread Yonik Seeley
On Fri, Nov 4, 2016 at 2:25 PM, Furkan KAMACI  wrote:
> I mean, I have to facet by dates and aggregate values inside that facet
> range. Is it possible to do that without multiple queries at Solr?

This (old) blog shows a percentiles calculation under a range facet:
http://yonik.com/percentiles-for-solr-faceting/

-Yonik


Re: Merge policy

2016-10-27 Thread Yonik Seeley
On Thu, Oct 27, 2016 at 9:56 AM, Arkadi Colson  wrote:

> Thanks for the answer!
> Do you know if there is a way to trigger an optimize for only 1 shard and
> not the whole collection at once?
>

Adding a "distrib=false" parameter should work I think.

-Yonik


Re: JSON Facet Syntax Sorting

2016-10-26 Thread Yonik Seeley
On Wed, Oct 26, 2016 at 3:16 AM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> I'm using Solr 6.2.1.
>
> For the JSON Facet Syntax, are we able to sort on multiple values at one go?
>
> Like for example, if I want to sort by count, follow by the average price.
> is this the correct way tot do?

Sorting by multiple metrics isn't yet supported.

-Yonik

>  json.facet={
>categories:{
>  type : terms,
>  field : cat,
>  sort : { count : desc},
>  sort : { x : desc},
>  facet:{
>x : "avg(price)",
>y : "sum(price)"
>  }
>}
>  }
>
>
> Regards,
> Edwin


Re: Graph Traversal Question

2016-10-26 Thread Yonik Seeley
On Wed, Oct 26, 2016 at 7:13 AM, Grant Ingersoll <gsing...@apache.org> wrote:
> On Tue, Oct 25, 2016 at 6:26 PM Yonik Seeley <ysee...@gmail.com> wrote:
>
> In your example below it would be akin to injecting the rating onto those
> responses as well, not just in the 'fq'.

Gotcha... Yeah, I remember wondering how to do that myself.

-Yonik


Re: Does _version_ field in schema need to be indexed and/or stored?

2016-10-25 Thread Yonik Seeley
On Tue, Oct 25, 2016 at 6:41 PM, Brent  wrote:
> I know that in the sample config sets, the _version_ field is indexed and not
> stored, like so:
>
> 
>
> Is there any reason it needs to be indexed?

It may depend on your solr version, but the starting configsets
currently only have docvalues:

./solr/server/solr/configsets/basic_configs/conf/managed-schema:


-Yonik


  1   2   3   4   5   6   7   8   9   10   >