Re: Enabling/disabling docValues

2019-06-10 Thread John Davis
You have made many assumptions which might not always be realistic a)
TextField is always tokenized b) Users care about precise counts and c)
Users have the luxury or ability to do a full re-index anytime. These are
real issues and there is no black/white solution. I will ask Lucene folks
on the actual implementation.

On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson 
wrote:

> bq. Does lucene look at %docs in each state, or the first doc or something
> else?
>
> Frankly I don’t care since no matter what, the results of faceting mixed
> definitions is not useful.
>
> tl;dr;
>
> “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> means just what I choose it to mean — neither more nor less.’
>
> So “undefined" in this case means “I don’t see any value at all in chasing
> that info down” ;).
>
> Changing from regular text to SortableText means that the results will be
> inaccurate no matter what. For example, I have a doc with the value “my dog
> has fleas”. When NOT using SortableText, there are multiple tokens so facet
> counts would be:
>
> my (1)
> dog (1)
> has (1)
> fleas (1)
>
> But for SortableText will be:
>
> my dog has fleas (1)
>
> Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> doc1 was  indexed before switching to SortableText and doc2 after.
> Presumably  the output you want is:
>
> my dog has fleas (1)
> my cat has fleas (1)
>
> But you can’t get that output.  There are three cases:
>
> 1> Lucene treats all documents as SortableText, faceting on the docValues
> parts. No facets on doc1
>
> my  cat has fleas (1)
>
> 2> Lucene treats all documents as tokenized, faceting on each individual
> token. Faceting is performed on the tokenized content of both,  docValues
> in doc2  ignored
>
> my  (2)
> dog (1)
> has (2)
> fleas (2)
> cat (1)
>
>
> 3> Lucene does the best it can, faceting on the tokens for docs without
> SortableText and docValues if the doc was indexed with Sortable text. doc1
> faceted on tokenized, doc2 on docValues
>
> my  (1)
> dog (1)
> has (1)
> fleas (1)
> my cat has fleas (1)
>
> Since none of those cases is what I want, there’s no point I can see in
> chasing down what actually happens….
>
> Best,
> Erick
>
> P.S. I _think_ Lucene tries to use the definition from the first segment,
> but since whether the lists of segments to be  merged don’t look at the
> field definitions at all. Whether the first segment in the list has
> SortableText or not will not be predictable in a general way even within a
> single run.
>
>
> > On Jun 9, 2019, at 6:53 PM, John Davis 
> wrote:
> >
> > Understood, however code is rarely random/undefined. Does lucene look at
> %
> > docs in each state, or the first doc or something else?
> >
> > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson 
> > wrote:
> >
> >> It’s basically undefined. When segments are merged that have dissimilar
> >> definitions like this what can Lucene do? Consider:
> >>
> >> Faceting on a text (not sortable) means that each individual token in
> the
> >> index is uninverted on the Java heap and the facets are computed for
> each
> >> individual term.
> >>
> >> Faceting on a SortableText field just has a single term per document,
> and
> >> that in the docValues structures as opposed to the inverted index.
> >>
> >> Now you change the value and start indexing. At some point a segment
> >> containing no docValues is merged with a segment containing docValues
> for
> >> the field. The resulting mixed segment is in this state. If you facet on
> >> the field, should the docs without docValues have each individual term
> >> counted? Or just the SortableText values in the docValues structure?
> >> Neither one is right.
> >>
> >> Also remember that Lucene has no notion of schema. That’s entirely
> imposed
> >> on Lucene by Solr carefully constructing low-level analysis chains.
> >>
> >> So I’d _strongly_ recommend you re-index your corpus to a new collection
> >> with the current definition, then perhaps use CREATEALIAS to seamlessly
> >> switch.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 9, 2019, at 12:50 PM, John Davis 
> >> wrote:
> >>>
> >>> Hi there,
> >>> We recently changed a field from TextField + no docValues to
> >>> SortableTextField which has docValues enabled by default. Once I did
> >> this I
> >>> do not see any facet values for the field. I know that once all the
> docs
> >>> are re-indexed facets should work again, however can someone clarify
> the
> >>> current logic of lucene/solr how facets will be computed when schema is
> >>> changed from no docValues to docValues and vice-versa?
> >>>
> >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> >>> returned?
> >>> 3. Something else?
> >>>
> >>>
> >>> Varun
> >>
> >>
>
>


How to increase maximum size of files allowed in configuration for MiniSolrCloudCluster

2019-06-10 Thread Pratik Patel
Hi,

I am trying to upload a configuration to "MiniSolrCloudCluster" in my unit
test. This configuration has some binary files for NLP related
functionality. Some of these binary files are bigger than 5 MB. If I try to
upload configuration with these files then it doesn't work. I can set up
the cluster fine if I remove all binary files bigger than 5 MB.

I have noticed the same issue when I try to restore a backup having
configuration files bigger than 5 MB.

Does jetty have some limit on the size of configuration files? Is there a
way to override this?

Thanks,
Pratik


Re: Sort date stored in text field?

2019-06-10 Thread Shawn Heisey

On 6/10/2019 3:26 PM, Dave Beckstrom wrote:

I have a field called metatag.date that is field-type:
org.apache.solr.schema.TextFieldThe field is being populated by NUTCH,
which grabs the date from the html:





I'm trying to sort by date   (metatag.date desc) passed on the URL and it's
not working.


In general, sorting by a field using the TextField class is probably not 
going to do what you think it will.


Can you share the full field definition in the schema, and the fieldType 
definition for the type used in the field definition?


Probably what you'll need to do is copyField that to another field 
that's using the StrField class (usually defined as "string" in most 
schemas), and ensure it has docValues turned on.  For sorting, you could 
make it indexed="false" and stored="false" -- just enable docValues.


Thanks,
Shawn


Sort date stored in text field?

2019-06-10 Thread Dave Beckstrom
Hi Everyone,

Running SOLR 7.3.1

I have a field called metatag.date that is field-type:
org.apache.solr.schema.TextFieldThe field is being populated by NUTCH,
which grabs the date from the html:



and stores it in the  metatag.date field in SOLR.

I'm trying to sort by date   (metatag.date desc) passed on the URL and it's
not working.

I don't think I have an option of using the field-type  DatePointField
because I have no way of formatting the date in the format:
1995-12-31T23:59:59Z

Anyone have any suggestions?

-- 
*Fig Leaf Software is now Collective FLS, Inc.*
*
*
*Collective FLS, Inc.* 

https://www.collectivefls.com/   






Scoring to Synonym in query

2019-06-10 Thread Rathor, Piyush
Hi Team

Do we have a mechanism to provide score in query for synonym search.
My synonym field is called - "First_syn"

Thanks

This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

Deloitte refers to a Deloitte member firm, one of its related entities, or 
Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a 
separate legal entity and a member of DTTL. DTTL does not provide services to 
clients. Please see www.deloitte.com/about to learn more.

v.E.1


Re: Collections API timeout

2019-06-10 Thread Софія Строчик
Yes, I've checked them and all nodes are pointing to the sme IP and the
same port (2181). Also all of them are visible in the SolrCloud Graph
section so this would mean they are part of the same cloud.

Largest file is solrconfig which is 58K so this shouldn't be a problem
either.
The potential problem I see is one of the nodes which is loading cloud info
(specifically the *Tree *section) slower than the others.
It reaches the admin interface timeout and displays the "Connection to Solr
lost" message when accessed from UI.
But the same request
(http://ip:port/solr/admin/zookeeper?_=1560198364377=json)
works if issued from comand line.

I've checked the logs of the corresponding instance and can see entries
like this on startup:
2019-06-10 20:26:17.061 ERROR (qtp1335503880-18) [   ]
o.a.s.c.c.ZkStateReader Collection collection2 is not lazy or watched!
The other instances don't have these messages so maybe it is related to the
loading issue, but I'm also not sure about this because I can't find any
further information on this error.

пн, 10 черв. 2019 о 22:05 Erick Erickson  пише:

> Hmmm, I didn’t really look carefully at the end of your e-mail. There not
> being an /overseer znode _looks_ like one or more of your Solr nodes isn’t
> connecting to the proper ZooKeeper ensemble.
>
> bq. All of the instances are able to talk to zookeeper (they are
> >
> >>> displayed as active in the SolrCloud view, so they must be able to
> >> connect,
> >>> right?).
>
> Well, maybe or maybe not. The particular Solr node that you’re working on
> can see ZK, true. But are all of them looking  at the _same_ ensemble? Are
> any of the Solr nodes somehow  running with embedded ZooKeeper through a
> typo or something? And since that’s in the  ZooKeeper log, is the ensemble
> properly configured?  For troubleshooting _only_, I might go back to a
> single ZK instance just long enough to eliminate that possibility.
>
> bq. o.a.s.s.SolrDispatchFilter Could not consume full client request
> >>> org.eclipse.jetty.io.EofException: Early EOF
>
> This usually indicates either massive requests or a mis-configured jetty
> such that the request size exceeds the max allowed. There are  a few
> settings that can be extended, but this is pretty unusual. Unless you have
> lots and lots and lots of nodes, the request size should be reasonably
> small.
>
> Hmmm, do you  have any massive files in your config (schema, solrconfig,
> synonym files, etc?)? There is a 1M default limit on the size of files,
> perhaps you’re exceeding that. One test would be to use a minimal configset
> to see if that encounters the same issue.
>
> Best,
> Erick
>
>
> > On Jun 10, 2019, at 11:51 AM, Софія Строчик  wrote:
> >
> > Hi Erick, thanks for your reply!
> >
> > I didn't mention it but we have tried async requests. Then it does not
> time
> > out of course, but instead appears to run indefinitely, with
> REQUESTSTATUS
> > response like this:
> > {
> >  "responseHeader":{
> >"status":0,
> >"QTime":1},
> >  "status":{
> >"state":"submitted",
> >"msg":"found [123] in submitted tasks"}}
> >
> > These requests then pile up in zookeeper's collection-queue-work without
> > ever moving to the completed or failed status.
> >
> > While I guess some operations are expensive and can run for a long time,
> it
> > doesn't seem likely that all of these have to take hours (without high
> load
> > on any of the servers!)
> >
> > Maybe you have any other suggestions because this one doesn't seem to be
> > the case :(
> >
> > пн, 10 черв. 2019 о 21:14 Erick Erickson  пише:
> >
> >> Certainly at times  some things  just  take a  long time. The 180
> >> second timeout is fairly arbitrary.
> >> GC pauses, creating a zillion replicas etc. can cause timeouts like
> >> this to be exceeded.
> >>
> >> Rather than rely on lengthening some magic timeout value and hoping, I
> >> suggest you use
> >> the async option, see:
> >> https://lucene.apache.org/solr/guide/7_3/collections-api.html
> >>
> >> Then you need to periodically check the status of that job to see the
> >> completion status.
> >>
> >> Do note  this bit in particular:
> >>
> >> As of now, REQUESTSTATUS does not automatically clean up the tracking
> >> data structures...
> >>
> >> in the  link above.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Jun 10, 2019 at 11:07 AM Софія Строчик 
> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> recently when trying to delete a collection we have noticed that all
> >> calls
> >>> to the Collections API time out after 180s.
> >>> Something similar is described here
> >>> <
> >>
> http://lucene.472066.n3.nabble.com/Can-t-create-collection-td4314225.html>
> >>> however
> >>> restarting the instance or the server does not help.
> >>>
> >>> *This is what the response to the API call looks like:*
> >>> {
> >>>  "responseHeader":{
> >>>"status":500,
> >>>"QTime":180163},
> >>>  "error":{
> >>>"metadata":[
> >>>  "error-class","org.apache.solr.common.SolrException",
> >>>   

Re: Collections API timeout

2019-06-10 Thread Erick Erickson
Hmmm, I didn’t really look carefully at the end of your e-mail. There not being 
an /overseer znode _looks_ like one or more of your Solr nodes isn’t connecting 
to the proper ZooKeeper ensemble.

bq. All of the instances are able to talk to zookeeper (they are
> 
>>> displayed as active in the SolrCloud view, so they must be able to
>> connect,
>>> right?).

Well, maybe or maybe not. The particular Solr node that you’re working on can 
see ZK, true. But are all of them looking  at the _same_ ensemble? Are any of 
the Solr nodes somehow  running with embedded ZooKeeper through a typo or 
something? And since that’s in the  ZooKeeper log, is the ensemble properly 
configured?  For troubleshooting _only_, I might go back to a single ZK 
instance just long enough to eliminate that possibility.

bq. o.a.s.s.SolrDispatchFilter Could not consume full client request
>>> org.eclipse.jetty.io.EofException: Early EOF

This usually indicates either massive requests or a mis-configured jetty such 
that the request size exceeds the max allowed. There are  a few settings that 
can be extended, but this is pretty unusual. Unless you have lots and lots and 
lots of nodes, the request size should be reasonably small.

Hmmm, do you  have any massive files in your config (schema, solrconfig, 
synonym files, etc?)? There is a 1M default limit on the size of files, perhaps 
you’re exceeding that. One test would be to use a minimal configset to see if 
that encounters the same issue.

Best,
Erick


> On Jun 10, 2019, at 11:51 AM, Софія Строчик  wrote:
> 
> Hi Erick, thanks for your reply!
> 
> I didn't mention it but we have tried async requests. Then it does not time
> out of course, but instead appears to run indefinitely, with REQUESTSTATUS
> response like this:
> {
>  "responseHeader":{
>"status":0,
>"QTime":1},
>  "status":{
>"state":"submitted",
>"msg":"found [123] in submitted tasks"}}
> 
> These requests then pile up in zookeeper's collection-queue-work without
> ever moving to the completed or failed status.
> 
> While I guess some operations are expensive and can run for a long time, it
> doesn't seem likely that all of these have to take hours (without high load
> on any of the servers!)
> 
> Maybe you have any other suggestions because this one doesn't seem to be
> the case :(
> 
> пн, 10 черв. 2019 о 21:14 Erick Erickson  пише:
> 
>> Certainly at times  some things  just  take a  long time. The 180
>> second timeout is fairly arbitrary.
>> GC pauses, creating a zillion replicas etc. can cause timeouts like
>> this to be exceeded.
>> 
>> Rather than rely on lengthening some magic timeout value and hoping, I
>> suggest you use
>> the async option, see:
>> https://lucene.apache.org/solr/guide/7_3/collections-api.html
>> 
>> Then you need to periodically check the status of that job to see the
>> completion status.
>> 
>> Do note  this bit in particular:
>> 
>> As of now, REQUESTSTATUS does not automatically clean up the tracking
>> data structures...
>> 
>> in the  link above.
>> 
>> Best,
>> Erick
>> 
>> On Mon, Jun 10, 2019 at 11:07 AM Софія Строчик  wrote:
>>> 
>>> Hi everyone,
>>> 
>>> recently when trying to delete a collection we have noticed that all
>> calls
>>> to the Collections API time out after 180s.
>>> Something similar is described here
>>> <
>> http://lucene.472066.n3.nabble.com/Can-t-create-collection-td4314225.html>
>>> however
>>> restarting the instance or the server does not help.
>>> 
>>> *This is what the response to the API call looks like:*
>>> {
>>>  "responseHeader":{
>>>"status":500,
>>>"QTime":180163},
>>>  "error":{
>>>"metadata":[
>>>  "error-class","org.apache.solr.common.SolrException",
>>>  "root-error-class","org.apache.solr.common.SolrException"],
>>>"msg":"overseerstatus the collection time out:180s",
>>>"trace":"org.apache.solr.common.SolrException: overseerstatus the
>>> collection time out:180s\n\tat
>>> 
>> org.apache.solr.handler.admin.CollectionsHandler.sendToOCPQueue(CollectionsHandler.java:367)\n\tat
>>> 
>> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:272)\n\tat
>>> 
>> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
>>> 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
>>> 
>> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
>>> 
>> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
>>> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
>>> 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
>>> 
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
>>> 
>> 

[ANNOUNCE] Apache Solr Reference Guide for 8.0 released

2019-06-10 Thread Cassandra Targett
The Lucene PMC is pleased to announce that the Solr Reference Guide for 8.0
is available.

This 1,452 page PDF is the definitive guide to Apache Solr, the search
server built on Apache Lucene.

The PDF can be downloaded from:
https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/apache-solr-ref-guide-8.0.pdf

The Guide is also available online, at
http://lucene.apache.org/solr/guide/8_0/.

While the Guide for 8.0 was delayed quite a bit after the release of 8.0
binaries, we don't anticipate the same delay for the 8.1 Guide, and are
working to make it available as soon as possible.

Regards,
The Lucene PMC


Re: Collections API timeout

2019-06-10 Thread Софія Строчик
Hi Erick, thanks for your reply!

I didn't mention it but we have tried async requests. Then it does not time
out of course, but instead appears to run indefinitely, with REQUESTSTATUS
response like this:
{
  "responseHeader":{
"status":0,
"QTime":1},
  "status":{
"state":"submitted",
"msg":"found [123] in submitted tasks"}}

These requests then pile up in zookeeper's collection-queue-work without
ever moving to the completed or failed status.

While I guess some operations are expensive and can run for a long time, it
doesn't seem likely that all of these have to take hours (without high load
on any of the servers!)

Maybe you have any other suggestions because this one doesn't seem to be
the case :(

пн, 10 черв. 2019 о 21:14 Erick Erickson  пише:

> Certainly at times  some things  just  take a  long time. The 180
> second timeout is fairly arbitrary.
> GC pauses, creating a zillion replicas etc. can cause timeouts like
> this to be exceeded.
>
> Rather than rely on lengthening some magic timeout value and hoping, I
> suggest you use
> the async option, see:
> https://lucene.apache.org/solr/guide/7_3/collections-api.html
>
> Then you need to periodically check the status of that job to see the
> completion status.
>
> Do note  this bit in particular:
>
> As of now, REQUESTSTATUS does not automatically clean up the tracking
> data structures...
>
> in the  link above.
>
> Best,
> Erick
>
> On Mon, Jun 10, 2019 at 11:07 AM Софія Строчик  wrote:
> >
> > Hi everyone,
> >
> > recently when trying to delete a collection we have noticed that all
> calls
> > to the Collections API time out after 180s.
> > Something similar is described here
> > <
> http://lucene.472066.n3.nabble.com/Can-t-create-collection-td4314225.html>
> > however
> > restarting the instance or the server does not help.
> >
> > *This is what the response to the API call looks like:*
> > {
> >   "responseHeader":{
> > "status":500,
> > "QTime":180163},
> >   "error":{
> > "metadata":[
> >   "error-class","org.apache.solr.common.SolrException",
> >   "root-error-class","org.apache.solr.common.SolrException"],
> > "msg":"overseerstatus the collection time out:180s",
> > "trace":"org.apache.solr.common.SolrException: overseerstatus the
> > collection time out:180s\n\tat
> >
> org.apache.solr.handler.admin.CollectionsHandler.sendToOCPQueue(CollectionsHandler.java:367)\n\tat
> >
> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:272)\n\tat
> >
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
> >
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
> >
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
> > org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
> >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
> >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
> >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
> >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
> >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
> >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
> >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
> >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
> >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
> >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
> >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
> >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> >
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
> >
> 

Re: Collections API timeout

2019-06-10 Thread Erick Erickson
Certainly at times  some things  just  take a  long time. The 180
second timeout is fairly arbitrary.
GC pauses, creating a zillion replicas etc. can cause timeouts like
this to be exceeded.

Rather than rely on lengthening some magic timeout value and hoping, I
suggest you use
the async option, see:
https://lucene.apache.org/solr/guide/7_3/collections-api.html

Then you need to periodically check the status of that job to see the
completion status.

Do note  this bit in particular:

As of now, REQUESTSTATUS does not automatically clean up the tracking
data structures...

in the  link above.

Best,
Erick

On Mon, Jun 10, 2019 at 11:07 AM Софія Строчик  wrote:
>
> Hi everyone,
>
> recently when trying to delete a collection we have noticed that all calls
> to the Collections API time out after 180s.
> Something similar is described here
> 
> however
> restarting the instance or the server does not help.
>
> *This is what the response to the API call looks like:*
> {
>   "responseHeader":{
> "status":500,
> "QTime":180163},
>   "error":{
> "metadata":[
>   "error-class","org.apache.solr.common.SolrException",
>   "root-error-class","org.apache.solr.common.SolrException"],
> "msg":"overseerstatus the collection time out:180s",
> "trace":"org.apache.solr.common.SolrException: overseerstatus the
> collection time out:180s\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.sendToOCPQueue(CollectionsHandler.java:367)\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:272)\n\tat
> org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
> org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
> org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
> org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
> org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
> 

Collections API timeout

2019-06-10 Thread Софія Строчик
Hi everyone,

recently when trying to delete a collection we have noticed that all calls
to the Collections API time out after 180s.
Something similar is described here

however
restarting the instance or the server does not help.

*This is what the response to the API call looks like:*
{
  "responseHeader":{
"status":500,
"QTime":180163},
  "error":{
"metadata":[
  "error-class","org.apache.solr.common.SolrException",
  "root-error-class","org.apache.solr.common.SolrException"],
"msg":"overseerstatus the collection time out:180s",
"trace":"org.apache.solr.common.SolrException: overseerstatus the
collection time out:180s\n\tat
org.apache.solr.handler.admin.CollectionsHandler.sendToOCPQueue(CollectionsHandler.java:367)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:272)\n\tat
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:734)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:715)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1317)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1219)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:531)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:762)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:680)\n\tat
java.lang.Thread.run(Thread.java:745)\n",
"code":500}}

*The errors look like this in the logs:*

2019-06-10 15:37:19.446 ERROR (qtp315932542-5748) [   ]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: reload the
collection time out:180s
at
org.apache.solr.handler.admin.CollectionsHandler.sendToOCPQueue(CollectionsHandler.java:367)
at
org.apache.solr.handler.admin.CollectionsHandler.invokeAction(CollectionsHandler.java:272)
at
org.apache.solr.handler.admin.CollectionsHandler.handleRequestBody(CollectionsHandler.java:246)
at

Re: Enabling/disabling docValues

2019-06-10 Thread Erick Erickson
bq. Does lucene look at %docs in each state, or the first doc or something else?

Frankly I don’t care since no matter what, the results of faceting mixed 
definitions is not useful.

tl;dr;

“When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it means 
just what I choose it to mean — neither more nor less.’

So “undefined" in this case means “I don’t see any value at all in chasing that 
info down” ;).

Changing from regular text to SortableText means that the results will be 
inaccurate no matter what. For example, I have a doc with the value “my dog has 
fleas”. When NOT using SortableText, there are multiple tokens so facet counts 
would be:

my (1)
dog (1)
has (1)
fleas (1)

But for SortableText will be:

my dog has fleas (1)

Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”. doc1 
was  indexed before switching to SortableText and doc2 after. Presumably  the 
output you want is:

my dog has fleas (1)
my cat has fleas (1)

But you can’t get that output.  There are three cases:

1> Lucene treats all documents as SortableText, faceting on the docValues 
parts. No facets on doc1

my  cat has fleas (1) 

2> Lucene treats all documents as tokenized, faceting on each individual token. 
Faceting is performed on the tokenized content of both,  docValues in doc2  
ignored

my  (2)
dog (1)
has (2)
fleas (2)
cat (1)


3> Lucene does the best it can, faceting on the tokens for docs without 
SortableText and docValues if the doc was indexed with Sortable text. doc1 
faceted on tokenized, doc2 on docValues

my  (1)
dog (1)
has (1)
fleas (1)
my cat has fleas (1)

Since none of those cases is what I want, there’s no point I can see in chasing 
down what actually happens….

Best,
Erick

P.S. I _think_ Lucene tries to use the definition from the first segment, but 
since whether the lists of segments to be  merged don’t look at the field 
definitions at all. Whether the first segment in the list has SortableText or 
not will not be predictable in a general way even within a single run.


> On Jun 9, 2019, at 6:53 PM, John Davis  wrote:
> 
> Understood, however code is rarely random/undefined. Does lucene look at %
> docs in each state, or the first doc or something else?
> 
> On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson 
> wrote:
> 
>> It’s basically undefined. When segments are merged that have dissimilar
>> definitions like this what can Lucene do? Consider:
>> 
>> Faceting on a text (not sortable) means that each individual token in the
>> index is uninverted on the Java heap and the facets are computed for each
>> individual term.
>> 
>> Faceting on a SortableText field just has a single term per document, and
>> that in the docValues structures as opposed to the inverted index.
>> 
>> Now you change the value and start indexing. At some point a segment
>> containing no docValues is merged with a segment containing docValues for
>> the field. The resulting mixed segment is in this state. If you facet on
>> the field, should the docs without docValues have each individual term
>> counted? Or just the SortableText values in the docValues structure?
>> Neither one is right.
>> 
>> Also remember that Lucene has no notion of schema. That’s entirely imposed
>> on Lucene by Solr carefully constructing low-level analysis chains.
>> 
>> So I’d _strongly_ recommend you re-index your corpus to a new collection
>> with the current definition, then perhaps use CREATEALIAS to seamlessly
>> switch.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 9, 2019, at 12:50 PM, John Davis 
>> wrote:
>>> 
>>> Hi there,
>>> We recently changed a field from TextField + no docValues to
>>> SortableTextField which has docValues enabled by default. Once I did
>> this I
>>> do not see any facet values for the field. I know that once all the docs
>>> are re-indexed facets should work again, however can someone clarify the
>>> current logic of lucene/solr how facets will be computed when schema is
>>> changed from no docValues to docValues and vice-versa?
>>> 
>>> 1. Until ALL the docs are re-indexed, no facets will be returned?
>>> 2. Once certain fraction of docs are re-indexed, those facets will be
>>> returned?
>>> 3. Something else?
>>> 
>>> 
>>> Varun
>> 
>> 



Re: Loading pre created index files into MiniSolrCloudCluster of test framework

2019-06-10 Thread Pratik Patel
So, I found a way to programmatically restore a collection from a backup.
I though that I could create a backup of a collection, put it on the
classpath, restore it during unit test set up and run the queries against
newly created collection using restore.
Theoretically, it sounded like it would work.

I have following code doing the restore.

CollectionAdminRequest.Restore restore =
CollectionAdminRequest.restoreCollection( newCollectionName,
backupName )
.setLocation( pathToBackup );

CollectionAdminResponse resp = restore.process( cluster.getSolrClient() );

AbstractDistribZkTestBase.waitForRecoveriesToFinish(
newCollectionName, cluster.getSolrClient().getZkStateReader(),
true, true, 30);


However, any query I run against this new collection returns zero
documents. I have tried queries which should match many documents but they
all return zero documents. It seems like the data is not really loaded
during the restore operation.
I stepped through the  "doRestore()" method of class RestoreCore.java which
is internally doing the restore, I see that it has no errors or exceptions
and the restore operation status is successful, but in reality there is no
data in new collection. I see that new collection is created but it seems
to be without any data.

Am I missing something here? Any idea what could be the cause of this?

Thanks!
Pratik








On Thu, Jun 6, 2019 at 11:18 AM Pratik Patel  wrote:

> Thanks for the reply Alexandre, only special thing about JSON/XML is that
> in order to export the data in that form, I need to have "docValues"
> enabled for all the fields which are to be retrieved. I need to retrieve
> all the fields and I can not enable docValues on all fields.
> If there was a way to export data in JSON format without having to change
> schema and index then I would have no issues with JSON.
> I can not use "select" handler as it does not include parent/child
> relationships.
>
> The options I have are following I guess. I am not sure if they are real
> possibilities though.
>
> 1. Find a way to load pre-created index files either through
> SolrCloudClient or directly to ZK
> 2. Find a way to export the data in JSON format without having to make all
> fields docValues enabled.
> 3. Use Merge Index tool with an empty index and a real index. I am don't
> know if it is possible to do this through solrJ though.
>
> Please let me know if there is better way available, it would really help.
> Just so you know, I am trying to do this for unit tests related to solr
> queries. Ultimately I want to load some pre-created data into
> MiniSolrCloudCluster.
>
> Thanks a lot,
> Pratik
>
>
> On Wed, Jun 5, 2019 at 6:56 PM Alexandre Rafalovitch 
> wrote:
>
>> Is there something special about parent/child blocks you cannot do through
>> JSON? Or XML?
>>
>> Both Solr XML and Solr JSON support it.
>>
>> New style parent/child mapping is also supported in latest Solr but I
>> think
>> it is done differently.
>>
>> Regards,
>> Alex
>>
>> On Wed, Jun 5, 2019, 6:29 PM Pratik Patel,  wrote:
>>
>> > Hello Everyone,
>> >
>> > I am trying to write some unit tests for solr queries which requires
>> some
>> > data in specific state. There is a way to load this data through json
>> files
>> > but the problem is that the required data needs to have parent-child
>> blocks
>> > to be present.
>> > Because of this, I would prefer if there is a way to load pre-created
>> index
>> > files into the cluster.
>> > I checked the solr test framework and related examples but couldn't find
>> > any example of index files being loaded in cloud mode.
>> >
>> > Is there a way to load index files into solr running in cloud mode?
>> >
>> > Thanks!
>> > Pratik
>> >
>>
>


Re: highlighting not working as expected

2019-06-10 Thread David Smiley
Please try hl.method=unified and tell us if that helps.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Mon, Jun 3, 2019 at 4:06 AM Martin Frank Hansen (MHQ)  wrote:

> Hi,
>
> I am having some difficulties making highlighting work. For some reason
> the highlighting feature only works on some fields but not on other fields
> even though these fields are stored.
>
> An example of a request looks like this:
> http://localhost/solr/mytest/select?fl=id,doc.Type,Journalnummer,Sagstitel=Sagstitel=%3C/b%3E=%3Cb%3E=on=rotte
>
> It simply returns an empty set, for all documents even though I can see
> several documents which have “Sagstitel” containing the word “rotte”
> (rotte=rat).  What am I missing here?
>
> I am using the standard highlighter as below.
>
>
> 
> 
>   
>   
>  default="true"
>   class="solr.highlight.GapFragmenter">
> 
>   100
> 
>   
>
>   
>  class="solr.highlight.RegexFragmenter">
> 
>   
>   70
>   
>   0.5
>   
>   [-\w
> ,/\n\]{20,200}
> 
>   
>
>   
> default="true"
>  class="solr.highlight.HtmlFormatter">
> 
>   b
>   /b
> 
>   
>
>   
>   class="solr.highlight.HtmlEncoder" />
>
>   
>   class="solr.highlight.SimpleFragListBuilder"/>
>
>   
>   class="solr.highlight.SingleFragListBuilder"/>
>
>   
>  default="true"
>class="solr.highlight.WeightedFragListBuilder"/>
>
>   
>default="true"
> class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>
>   
>class="solr.highlight.ScoreOrderFragmentsBuilder">
> 
>   
>   
> 
>   
>
>   default="true"
>class="solr.highlight.SimpleBoundaryScanner">
> 
>   10
>   .,!? 
> 
>   
>
>   class="solr.highlight.BreakIteratorBoundaryScanner">
> 
>   
>   WORD
>   
>   
>   da
> 
>   
> 
>   
>
> Hope that some one can help, thanks in advance.
>
> Best regards
> Martin
>
>
>
> Internal - KMD A/S
>
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du
> KMD’s Privatlivspolitik, der
> fortæller, hvordan vi behandler oplysninger om dig.
>
> Protection of your personal data is important to us. Here you can read
> KMD’s Privacy Policy outlining how we
> process your personal data.
>
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information.
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst
> informere afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi
> dig slette e-mailen i dit system uden at videresende eller kopiere den.
> Selvom e-mailen og ethvert vedhæftet bilag efter vores overbevisning er fri
> for virus og andre fejl, som kan påvirke computeren eller it-systemet,
> hvori den modtages og læses, åbnes den på modtagerens eget ansvar. Vi
> påtager os ikke noget ansvar for tab og skade, som er opstået i forbindelse
> med at modtage og bruge e-mailen.
>
> Please note that this message may contain confidential information. If you
> have received this message by mistake, please inform the sender of the
> mistake by sending a reply, then delete the message from your system
> without making, distributing or retaining any copies of it. Although we
> believe that the message and any attachments are free from viruses and
> other errors that might affect the computer or it-system where it is
> received and read, the recipient opens the message at his or her own risk.
> We assume no responsibility for any loss or damage arising from the receipt
> or use of this message.
>


Re: Streaming expression function which can give parent document along with its child documents ?

2019-06-10 Thread Pratik Patel
If your children documents have a link to parent documents (like parent id
or something) then you can use graph traversal to do this.

On Mon, Jun 10, 2019 at 8:01 AM Jai Jamba 
wrote:

> Can anyone help me in this ?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Query takes a long time Solr 6.1.0

2019-06-10 Thread Shawn Heisey

On 6/10/2019 3:24 AM, vishal patel wrote:

We have 27 collections and each collection has many schema fields and in live too 
many search and index create requests come and most of the searching 
requests are sorting, faceting, grouping, and long query.
So approx average 40GB heap are used so we gave 80GB memory.


Unless you've been watching an actual *graph* of heap usage over a 
significant amount of time, you can't learn anything useful from it.


And it's very possible that you can't get anything useful even from a 
graph, unless that graph is generated by analyzing a lengthy garbage 
collection log.



our directory in solrconfig.xml




When using MMAP, one of the memory columns should show a total that's 
approximately equal to the max heap plus the size of all indexes being 
handled by Solr.  None of the columns in your Resource Monitor memory 
screenshot show numbers over 400GB, which is what I would expect based 
on what you said about the index size.


MMapDirectoryFactory is a decent choice, but Solr's default of 
NRTCachingDirectoryFactory is probably better.  Switching to NRT will 
not help whatever is causing your performance problems, though.



Here our schema file and solrconfig XML and GC log, please verify it. is it 
anything wrong or suggestions for improvement?
https://drive.google.com/drive/folders/1wV9bdQ5-pP4s4yc8jrYNz77YYVRmT7FG


That GC log covers a grand total of three and a half minutes.  It's 
useless.  Heap usage is nearly constant for the full time at about 30GB. 
 Without a much more comprehensive log, I cannot offer any useful 
advice.  I'm looking for logs that lasts several hours, and a few DAYS 
would be better.


Your caches are commented out, so that is not contributing to heap 
usage.  Another reason to drop the heap size, maybe.



2019-06-06T11:55:53.456+0100: 1053797.556: Total time for which application 
threads were stopped: 42.4594545 seconds, Stopping threads took: 26.7301882 
seconds


Part of the problem here is that stopping threads took 26 seconds.  I 
have never seen anything that high before.  It should only take a 
*small* fraction of a second to stop all threads.  Something seems to be 
going very wrong here.  One thing that it *might* be is something called 
"the four month bug", which is fixed by adding -XX:+PerfDisableSharedMem 
to the JVM options.  Here's a link to the blog post about that problem:


https://www.evanjones.ca/jvm-mmap-pause.html

It's not clear whether the 42 seconds *includes* the 26 seconds, or 
whether there was 42 seconds of pause AFTER the threads were stopped.  I 
would imagine that the larger number includes the smaller number.  Might 
need to ask Oracle engineers.  Pause times like this do not surprise me 
with a heap this big, but 26 seconds to stop threads sounds like a major 
issue, and I am not sure about what might be causing it.  My guess about 
the four month bug above is a shot in the dark that might be completely 
wrong.


Thanks,
Shawn


RE: [SPAM] Re: query parsed in different ways in two identical solr instances

2019-06-10 Thread Danilo Tomasoni
Yes I identical because the configuration (solrconfig.xml etc) is identical, 
just some fields changed.
Sorry I was not so precise in the description of the environment.

Nice to know it's already fixed.

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


From: Alexandre Rafalovitch [arafa...@gmail.com]
Sent: 10 June 2019 15:32
To: solr-user
Subject: Re: [SPAM] Re: query parsed in different ways in two identical solr 
instances

Ok, great.

We now moved from "identical setup breaks things in a bugfix version"
to "strange behavior when field does not exist". The "identical" part
was actually throwing us off the trail.

And all this leads us to
https://issues.apache.org/jira/browse/SOLR-5163 , fixed in 8.0.

Hope it helps,
Alex.

On Mon, 10 Jun 2019 at 09:19, Danilo Tomasoni  wrote:
>
> Hello I was able to reproduce this behaviour in an isolated environment,
> and performed some differential analysis between the two versions (that has 
> different schemas, diff of schemas attached)
>
> With the schema of solr1, the query is parsed as +(+() +())
> while with the schema of solr-test, the same query is parsed as +(() 
> ())
>
> The query is
>
> "q":"(f1:PUBMEDPMID12159614 AND (_query_:\"{!edismax 
> qf='medline_chemical_terms medline_mesh_terms' q.op=OR mm=1 v=$subquery1}\"))"
>
> in solr1 and also in solr test f1 equals
> "f.f1.qf":"id pmid pmc source_id other_id doi manuscript_id publication_id 
> secondary_ids"}}
>
> And then I suddenly remembered that the field secondary_ids was renamed to 
> external_data in solr-test (before the bulk import).
>
> So I changed f1 definition removing secondary_ids and adding external_data..
> and now the behaviour is the same!
>
> How is that possible? why the schema (and in this case a non-existing field) 
> can influence in such a profound way the behaviour of the query parser?
>
> I think that this is a subtle bug and an error should be raised instead of 
> performing an unexpected query.
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the 
> protection of natural persons with regard to the processing of personal data, 
> we inform you that all the data we possess are object of treatment in the 
> respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how; you 
> may ask for their correction, cancellation or you may oppose to their use by 
> written request sent by recorded delivery to The Microsoft Research – 
> University of Trento Centre for Computational and Systems Biology Scarl, 
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
> 
> From: Alexandre Rafalovitch [arafa...@gmail.com]
> Sent: 10 June 2019 12:49
> To: solr-user
> Subject: [SPAM] Re: query parsed in different ways in two identical solr 
> instances
>
> Were you able to simplify it to the simplest use case showing the issue? Or
> reproduce it on the stock Solr with stock example? Because otherwise, we
> would be just as stuck in a Jira as now. It is the same people helping
>
> For example, is the _query_ part significant?
>
> Also, did you try running both queries with echoParams=all just to
> eliminate stray differences? I know you looked at the debug line, but
> perhaps this is worth a check too.
>
> Regards,
> Alex
>
>
>
> On Mon, Jun 10, 2019, 5:46 AM Danilo Tomasoni,  wrote:
>
> > Hello all,
> > maybe I should consider this as a bug and open an issue?
> >
> > Danilo Tomasoni
> >
> > Fondazione The Microsoft Research - University of Trento Centre for
> > Computational and Systems Biology (COSBI)
> > Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> > tomas...@cosbi.eu
> > http://www.cosbi.eu
> >
> > As for the European General Data Protection Regulation 2016/679 on 

RE: No files to download for index generation

2019-06-10 Thread Oakley, Craig (NIH/NLM/NCBI) [C]
Does anyone yet have any insight on interpreting the severity of this message?

-Original Message-
From: Oakley, Craig (NIH/NLM/NCBI) [C]  
Sent: Tuesday, June 04, 2019 4:07 PM
To: solr-user@lucene.apache.org
Subject: No files to download for index generation

We have occasionally been seeing an error such as the following:
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Master's generation: 1424625
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Master's version: 1559619115480
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Slave's generation: 1424624
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Slave's version: 1559619050130
2019-06-03 23:32:45.583 INFO  (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher Starting replication process
2019-06-03 23:32:45.587 ERROR (indexFetcher-45-thread-1) [   ] 
o.a.s.h.IndexFetcher No files to download for index generation: 1424625

Is that last line actually an error as in "there SHOULD be files to download, 
but there are none"?

Or is it simply informative as in "there are no files to download, so we are 
all done here"?


Re: [SPAM] Re: query parsed in different ways in two identical solr instances

2019-06-10 Thread Alexandre Rafalovitch
Ok, great.

We now moved from "identical setup breaks things in a bugfix version"
to "strange behavior when field does not exist". The "identical" part
was actually throwing us off the trail.

And all this leads us to
https://issues.apache.org/jira/browse/SOLR-5163 , fixed in 8.0.

Hope it helps,
Alex.

On Mon, 10 Jun 2019 at 09:19, Danilo Tomasoni  wrote:
>
> Hello I was able to reproduce this behaviour in an isolated environment,
> and performed some differential analysis between the two versions (that has 
> different schemas, diff of schemas attached)
>
> With the schema of solr1, the query is parsed as +(+() +())
> while with the schema of solr-test, the same query is parsed as +(() 
> ())
>
> The query is
>
> "q":"(f1:PUBMEDPMID12159614 AND (_query_:\"{!edismax 
> qf='medline_chemical_terms medline_mesh_terms' q.op=OR mm=1 v=$subquery1}\"))"
>
> in solr1 and also in solr test f1 equals
> "f.f1.qf":"id pmid pmc source_id other_id doi manuscript_id publication_id 
> secondary_ids"}}
>
> And then I suddenly remembered that the field secondary_ids was renamed to 
> external_data in solr-test (before the bulk import).
>
> So I changed f1 definition removing secondary_ids and adding external_data..
> and now the behaviour is the same!
>
> How is that possible? why the schema (and in this case a non-existing field) 
> can influence in such a profound way the behaviour of the query parser?
>
> I think that this is a subtle bug and an error should be raised instead of 
> performing an unexpected query.
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for 
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the 
> protection of natural persons with regard to the processing of personal data, 
> we inform you that all the data we possess are object of treatment in the 
> respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how; you 
> may ask for their correction, cancellation or you may oppose to their use by 
> written request sent by recorded delivery to The Microsoft Research – 
> University of Trento Centre for Computational and Systems Biology Scarl, 
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
> 
> From: Alexandre Rafalovitch [arafa...@gmail.com]
> Sent: 10 June 2019 12:49
> To: solr-user
> Subject: [SPAM] Re: query parsed in different ways in two identical solr 
> instances
>
> Were you able to simplify it to the simplest use case showing the issue? Or
> reproduce it on the stock Solr with stock example? Because otherwise, we
> would be just as stuck in a Jira as now. It is the same people helping
>
> For example, is the _query_ part significant?
>
> Also, did you try running both queries with echoParams=all just to
> eliminate stray differences? I know you looked at the debug line, but
> perhaps this is worth a check too.
>
> Regards,
> Alex
>
>
>
> On Mon, Jun 10, 2019, 5:46 AM Danilo Tomasoni,  wrote:
>
> > Hello all,
> > maybe I should consider this as a bug and open an issue?
> >
> > Danilo Tomasoni
> >
> > Fondazione The Microsoft Research - University of Trento Centre for
> > Computational and Systems Biology (COSBI)
> > Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> > tomas...@cosbi.eu
> > http://www.cosbi.eu
> >
> > As for the European General Data Protection Regulation 2016/679 on the
> > protection of natural persons with regard to the processing of personal
> > data, we inform you that all the data we possess are object of treatment in
> > the respect of the normative provided for by the cited GDPR.
> > It is your right to be informed on which of your data are used and how;
> > you may ask for their correction, cancellation or you may oppose to their
> > use by written request sent by recorded delivery to The Microsoft Research
> > – University of Trento Centre for Computational and Systems Biology Scarl,
> > Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> > P Please don't print this e-mail unless you really need to
> >
> > 
> > From: Danilo Tomasoni
> > Sent: 07 June 2019 11:47
> > To: solr-user@lucene.apache.org
> > Subject: RE: query parsed in different ways in two identical solr instances
> >
> > any thoughts on that difference in the solr parsing? is it correct that
> > the first looks like an AND while the second looks like and OR?
> > Thank you
> >
> > Danilo Tomasoni
> >
> > Fondazione The Microsoft Research - University of Trento Centre for
> > Computational and Systems Biology (COSBI)
> > Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> > tomas...@cosbi.eu
> > http://www.cosbi.eu
> >
> > As for the European General Data Protection 

RE: [SPAM] Re: query parsed in different ways in two identical solr instances

2019-06-10 Thread Danilo Tomasoni
Hello I was able to reproduce this behaviour in an isolated environment, 
and performed some differential analysis between the two versions (that has 
different schemas, diff of schemas attached)

With the schema of solr1, the query is parsed as +(+() +())
while with the schema of solr-test, the same query is parsed as +(() ())

The query is

"q":"(f1:PUBMEDPMID12159614 AND (_query_:\"{!edismax qf='medline_chemical_terms 
medline_mesh_terms' q.op=OR mm=1 v=$subquery1}\"))"

in solr1 and also in solr test f1 equals 
"f.f1.qf":"id pmid pmc source_id other_id doi manuscript_id publication_id 
secondary_ids"}}

And then I suddenly remembered that the field secondary_ids was renamed to 
external_data in solr-test (before the bulk import).

So I changed f1 definition removing secondary_ids and adding external_data..
and now the behaviour is the same!

How is that possible? why the schema (and in this case a non-existing field) 
can influence in such a profound way the behaviour of the query parser?

I think that this is a subtle bug and an error should be raised instead of 
performing an unexpected query.

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


From: Alexandre Rafalovitch [arafa...@gmail.com]
Sent: 10 June 2019 12:49
To: solr-user
Subject: [SPAM] Re: query parsed in different ways in two identical solr 
instances

Were you able to simplify it to the simplest use case showing the issue? Or
reproduce it on the stock Solr with stock example? Because otherwise, we
would be just as stuck in a Jira as now. It is the same people helping

For example, is the _query_ part significant?

Also, did you try running both queries with echoParams=all just to
eliminate stray differences? I know you looked at the debug line, but
perhaps this is worth a check too.

Regards,
Alex



On Mon, Jun 10, 2019, 5:46 AM Danilo Tomasoni,  wrote:

> Hello all,
> maybe I should consider this as a bug and open an issue?
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
> 
> From: Danilo Tomasoni
> Sent: 07 June 2019 11:47
> To: solr-user@lucene.apache.org
> Subject: RE: query parsed in different ways in two identical solr instances
>
> any thoughts on that difference in the solr parsing? is it correct that
> the first looks like an AND while the second looks like and OR?
> Thank you
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 

Re: Basic Authentication in Standalone Configuration ?

2019-06-10 Thread Colvin Cowie
Hello,

You need to use the *set *command in windows cmd files to set values. The
example solr.in.cmd has commented out examples, e.g.





*REM Settings for authenticationREM Please configure only one of
SOLR_AUTHENTICATION_CLIENT_BUILDER or SOLR_AUTH_TYPE parametersREM set
SOLR_AUTHENTICATION_CLIENT_BUILDER=org.apache.solr.client.solrj.impl.PreemptiveBasicAuthClientBuilderFactoryREM
set SOLR_AUTH_TYPE=basicREM set
SOLR_AUTHENTICATION_OPTS="-Dbasicauth=solr:SolrRocks"*

So that would be


*set SOLR_AUTH_TYPE=basicset
SOLR_AUTHENTICATION_OPTS="-Dbasicauth=solr:SolrRocks"*

Hope that helps

On Mon, 10 Jun 2019 at 13:01, Paul  wrote:

> Hi,
>
> I am not sure if Basic Authentication is possible in SOLR standalone
> configuration (version 7.6). I have a working SOLR installation using SSL.
> When following the docs I add options into solr.in.cmd, as in:
>
> SOLR_AUTH_TYPE="basic"
> SOLR_AUTHENTICATION_OPTS="-Dbasicauth=solr:SolrRocks"
>
> When I go to start SOLR I get:
>
> 'SOLR_AUTH_TYPE' is not recognized as an internal or external command,
> operable program or batch file.
> 'SOLR_AUTHENTICATION_OPTS' is not recognized as an internal or external
> command, operable program or batch file.
>
> This is as per
> https://www.apache.si/lucene/solr/ref-guide/apache-solr-ref-guide-7.7.pdf
> and in there it refers to '*If you are using SolrCloud*, you must upload
> security.json to ZooKeeper. You can use this example command, ensuring that
> the ZooKeeper port is correct '.
>
> I am not using SolrCloud 
>
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


ContentStreamUpdateRequest no longer closes stream

2019-06-10 Thread Colvin Cowie
Hello, I'm in the process of moving from Solr 6. to Solr 8.
We have a client application that streams CSV files to Solr using
ContentStreamUpdateRequest and then deletes the CSV file once the data is
indexed. That worked fine in Solr 6, but when using 8, the file is locked
and can't be deleted. (This is on Windows)

This seems to be because of the changes made in
https://issues.apache.org/jira/browse/SOLR-12142
--















*@Override  public RequestWriter.ContentWriter getContentWriter(String
expectedType) {if (contentStreams == null || contentStreams.isEmpty()
|| contentStreams.size() > 1) return null;ContentStream stream =
contentStreams.get(0);return new RequestWriter.ContentWriter() {
@Override  public void write(OutputStream os) throws IOException {
  IOUtils.copy(stream.getStream(), os);  }  @Override  public
String getContentType() {return stream.getContentType();  }
};  }*
--
As far as I know, IOUtils.copy will not close the stream.

Adding a close to it, is enough to "fix" it for me




* try {IOUtils.copy(innerStream, os);  }
finally {IOUtils.closeQuietly(innerStream);  }*

I've attached a simple test case. It passes with the change above and fails
without it.

So, is this a bug, or is there something I'm supposed to be doing elsewhere
to close the stream?

Thanks,
Colvin


Re: Streaming expression function which can give parent document along with its child documents ?

2019-06-10 Thread Jai Jamba
Can anyone help me in this ?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Basic Authentication in Standalone Configuration ?

2019-06-10 Thread Paul
Hi,

I am not sure if Basic Authentication is possible in SOLR standalone
configuration (version 7.6). I have a working SOLR installation using SSL.
When following the docs I add options into solr.in.cmd, as in:

SOLR_AUTH_TYPE="basic"
SOLR_AUTHENTICATION_OPTS="-Dbasicauth=solr:SolrRocks"

When I go to start SOLR I get:

'SOLR_AUTH_TYPE' is not recognized as an internal or external command,
operable program or batch file.
'SOLR_AUTHENTICATION_OPTS' is not recognized as an internal or external
command, operable program or batch file.

This is as per
https://www.apache.si/lucene/solr/ref-guide/apache-solr-ref-guide-7.7.pdf
and in there it refers to '*If you are using SolrCloud*, you must upload
security.json to ZooKeeper. You can use this example command, ensuring that
the ZooKeeper port is correct '.

I am not using SolrCloud   








--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: query parsed in different ways in two identical solr instances

2019-06-10 Thread Alexandre Rafalovitch
Were you able to simplify it to the simplest use case showing the issue? Or
reproduce it on the stock Solr with stock example? Because otherwise, we
would be just as stuck in a Jira as now. It is the same people helping

For example, is the _query_ part significant?

Also, did you try running both queries with echoParams=all just to
eliminate stray differences? I know you looked at the debug line, but
perhaps this is worth a check too.

Regards,
Alex



On Mon, Jun 10, 2019, 5:46 AM Danilo Tomasoni,  wrote:

> Hello all,
> maybe I should consider this as a bug and open an issue?
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
> 
> From: Danilo Tomasoni
> Sent: 07 June 2019 11:47
> To: solr-user@lucene.apache.org
> Subject: RE: query parsed in different ways in two identical solr instances
>
> any thoughts on that difference in the solr parsing? is it correct that
> the first looks like an AND while the second looks like and OR?
> Thank you
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
> 
> From: Danilo Tomasoni [tomas...@cosbi.eu]
> Sent: 06 June 2019 16:21
> To: solr-user@lucene.apache.org
> Subject: RE: query parsed in different ways in two identical solr instances
>
> The two collections are not identical, many overlapping documents but with
> some different field names (test has also extra fields that 1 didn't have).
> Actually we have 42.000.000 docs in solr1, and 40.000.000 in solr-test,
> but I think this shouldn'd be relevant because the query is basically like
>
> id=x AND mesh=list of phrase queries
>
> where the second part of the and is handled through a nested query
> (_query_ magic keyword).
>
> I expect that a query like this one would return 1 documents (x) or 0
> documents.
>
> The thing that puzzles me is that on solr1 the engine is returning 1
> document (x)
> while on test the engine is returning 68.000 documents..
> If you look at my first e-mail you will notice that in the correct engine
> the parsed query is like
>
> +(+(...) +(...))
>
> That is correct for an AND
>
> while in the test engine the query is parsed like
>
> +((...) (...))
>
> which is more like an OR...
>
>
> Danilo Tomasoni
>
> Fondazione The Microsoft Research - University of Trento Centre for
> Computational and Systems Biology (COSBI)
> Piazza Manifattura 1,  38068 Rovereto (TN), Italy
> tomas...@cosbi.eu
> http://www.cosbi.eu
>
> As for the European General Data Protection Regulation 2016/679 on the
> protection of natural persons with regard to the processing of personal
> data, we inform you that all the data we possess are object of treatment in
> the respect of the normative provided for by the cited GDPR.
> It is your right to be informed on which of your data are used and how;
> you may ask for their correction, cancellation or you may oppose to their
> use by written request sent by recorded delivery to The Microsoft Research
> – University of Trento Centre for Computational and Systems Biology Scarl,
> Piazza Manifattura 1, 38068 Rovereto (TN), Italy.
> P Please don't print this e-mail unless you really need to
>
> 
> From: 

RE: query parsed in different ways in two identical solr instances

2019-06-10 Thread Danilo Tomasoni
Hello all,
maybe I should consider this as a bug and open an issue?

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


From: Danilo Tomasoni
Sent: 07 June 2019 11:47
To: solr-user@lucene.apache.org
Subject: RE: query parsed in different ways in two identical solr instances

any thoughts on that difference in the solr parsing? is it correct that the 
first looks like an AND while the second looks like and OR?
Thank you

Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


From: Danilo Tomasoni [tomas...@cosbi.eu]
Sent: 06 June 2019 16:21
To: solr-user@lucene.apache.org
Subject: RE: query parsed in different ways in two identical solr instances

The two collections are not identical, many overlapping documents but with some 
different field names (test has also extra fields that 1 didn't have).
Actually we have 42.000.000 docs in solr1, and 40.000.000 in solr-test, but I 
think this shouldn'd be relevant because the query is basically like

id=x AND mesh=list of phrase queries

where the second part of the and is handled through a nested query (_query_ 
magic keyword).

I expect that a query like this one would return 1 documents (x) or 0 documents.

The thing that puzzles me is that on solr1 the engine is returning 1 document 
(x)
while on test the engine is returning 68.000 documents..
If you look at my first e-mail you will notice that in the correct engine the 
parsed query is like

+(+(...) +(...))

That is correct for an AND

while in the test engine the query is parsed like

+((...) (...))

which is more like an OR...


Danilo Tomasoni

Fondazione The Microsoft Research - University of Trento Centre for 
Computational and Systems Biology (COSBI)
Piazza Manifattura 1,  38068 Rovereto (TN), Italy
tomas...@cosbi.eu
http://www.cosbi.eu

As for the European General Data Protection Regulation 2016/679 on the 
protection of natural persons with regard to the processing of personal data, 
we inform you that all the data we possess are object of treatment in the 
respect of the normative provided for by the cited GDPR.
It is your right to be informed on which of your data are used and how; you may 
ask for their correction, cancellation or you may oppose to their use by 
written request sent by recorded delivery to The Microsoft Research – 
University of Trento Centre for Computational and Systems Biology Scarl, Piazza 
Manifattura 1, 38068 Rovereto (TN), Italy.
P Please don't print this e-mail unless you really need to


From: Alexandre Rafalovitch [arafa...@gmail.com]
Sent: 06 June 2019 15:53
To: solr-user
Subject: Re: query parsed in different ways in two identical solr instances

Those two queries look same after sorting the parameters, yet the
results are clearly different. That means the difference is deeper.

1) Have you checked that both collections have the same amount of
documents (e.g. mismatched final commit). Does basic "query=*:*"
return the same counts in the same initial order?
2) Are you absolutely sure you are comparing 7.3.0 with 7.3.1? There
was SOLR-11501 that may be relevant, but it was fixed in 7.2:
https://issues.apache.org/jira/browse/SOLR-11501

Regards,
   Alex.

Are you 

Re: Query takes a long time Solr 6.1.0

2019-06-10 Thread vishal patel
> An 80GB heap is ENORMOUS.  And you have two of those per server.  Do you
> *know* that you need a heap that large?  You only have 50 million
> documents total, two instances that each have 80GB seems completely
> unnecessary.  I would think that one instance with a much smaller heap
> would handle just about anything you could throw at 50 million documents.

> With 160GB taken by heaps, you're leaving less than 100GB of memory to
> cache over 700GB of index.  This is not going to work well, especially
> if your index doesn't have many fields that are stored.  It will cause a
> lot of disk I/O.

We have 27 collections and each collection has many schema fields and in live 
too many search and index create requests come and most of the searching 
requests are sorting, faceting, grouping, and long query.
So approx average 40GB heap are used so we gave 80GB memory.

> Unless you have changed the DirectoryFactory to something that's not
> default, your process listing does not reflect over 700GB of index data.
> If you have changed the DirectoryFactory, then I would strongly
> recommend removing that part of your config and letting Solr use its
> default.

our directory in solrconfig.xml



Here our schema file and solrconfig XML and GC log, please verify it. is it 
anything wrong or suggestions for improvement?
https://drive.google.com/drive/folders/1wV9bdQ5-pP4s4yc8jrYNz77YYVRmT7FG


GC log ::
2019-06-06T11:55:37.729+0100: 1053781.828: [GC (Allocation Failure) 
1053781.828: [ParNew
Desired survivor size 3221205808 bytes, new threshold 8 (max 8)
- age   1:  268310312 bytes,  268310312 total
- age   2:  220271984 bytes,  488582296 total
- age   3:   75942632 bytes,  564524928 total
- age   4:   76397104 bytes,  640922032 total
- age   5:  126931768 bytes,  767853800 total
- age   6:   92672080 bytes,  860525880 total
- age   7:2810048 bytes,  863335928 total
- age   8:   11755104 bytes,  875091032 total
: 15126407K->1103229K(17476288K), 15.7272287 secs] 
45423308K->31414239K(80390848K), 15.7274518 secs] [Times: user=212.05 
sys=16.08, real=15.73 secs]
Heap after GC invocations=68829 (full 187):
 par new generation   total 17476288K, used 1103229K [0x8000, 
0x00058000, 0x00058000)
  eden space 13981056K,   0% used [0x8000, 0x8000, 
0x0003d556)
  from space 3495232K,  31% used [0x0004aaab, 0x0004ee00f508, 
0x00058000)
  to   space 3495232K,   0% used [0x0003d556, 0x0003d556, 
0x0004aaab)
 concurrent mark-sweep generation total 62914560K, used 30311010K 
[0x00058000, 0x00148000, 0x00148000)
 Metaspace   used 50033K, capacity 50805K, committed 53700K, reserved 55296K
}
2019-06-06T11:55:53.456+0100: 1053797.556: Total time for which application 
threads were stopped: 42.4594545 seconds, Stopping threads took: 26.7301882 
seconds

For which reason GC paused 42 seconds?

Heavy searching and indexing create & update in our Solr Cloud.
So, Should we divide a cloud between 27 collections? Should we add one more 
shard?

Sent from Outlook

From: Shawn Heisey 
Sent: Friday, June 7, 2019 9:00 PM
To: solr-user@lucene.apache.org
Subject: Re: Query takes a long time Solr 6.1.0

On 6/6/2019 5:45 AM, vishal patel wrote:
> One server(256GB RAM) has two below Solr instance and other application also
> 1) shards1 (80GB heap ,790GB Storage, 449GB Indexed data)
> 2) replica of shard2 (80GB heap, 895GB Storage, 337GB Indexed data)
>
> The second server(256GB RAM and 1 TB storage) has two below Solr instance and 
> other application also
> 1) shards2 (80GB heap, 790GB Storage, 338GB Indexed data)
> 2) replica of shard1 (80GB heap, 895GB Storage, 448GB Indexed data)

An 80GB heap is ENORMOUS.  And you have two of those per server.  Do you
*know* that you need a heap that large?  You only have 50 million
documents total, two instances that each have 80GB seems completely
unnecessary.  I would think that one instance with a much smaller heap
would handle just about anything you could throw at 50 million documents.

With 160GB taken by heaps, you're leaving less than 100GB of memory to
cache over 700GB of index.  This is not going to work well, especially
if your index doesn't have many fields that are stored.  It will cause a
lot of disk I/O.

> Both server memory and disk usage:
> https://drive.google.com/drive/folders/11GoZy8C0i-qUGH-ranPD8PCoPWCxeS-5

Unless you have changed the DirectoryFactory to something that's not
default, your process listing does not reflect over 700GB of index data.
  If you have changed the DirectoryFactory, then I would strongly
recommend removing that part of your config and letting Solr use its
default.

> Note: Average 40GB heap used normally in each Solr instance. when replica 
> gets down at that time disk IO are high and also GC pause time above 15 
> seconds. We can not identify the exact issue of replica