Re: clearing document cache || solr 6.6

2019-01-29 Thread Shawn Heisey

On 1/29/2019 11:27 PM, sachin gk wrote:

Is there a way to clear the *document cache* after we commit to the indexer.


All Solr caches are invalidated when you issue a commit with 
openSearcher set to true.  The default setting is true, and normally it 
doesn't get set to false unless you explicitly set it.  Most of the 
time, autoCommit has openSearcher set to false.


The documentCache cannot be warmed directly, but it does get items added 
to it if there are any warming queries, which may come from autowarming 
queryResultCache.


Thanks,
Shawn


Re: Solrcloud TimeoutException: Idle timeout expired

2019-01-29 Thread Deepak Goel
Document is not being passed. It has zero content.

It could be due to no memory in heap. For this please check GC logs

On Tue, 29 Jan 2019, 08:54 Schaum Mallik  I am seeing this error in our logs. Our Solr heap is set to more than 10G.
> Any clues which anyone can provide will be very helpful.
>
> Thank you
>
> null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
> timeout expired: 12/12 ms
> at
> org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1075)
> at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313)
> at
> org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWrapper.java:74)
> at
> org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100)
> at
> org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
> at
> org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
> at
> org.apache.solr.common.util.FastInputStream.peek(FastInputStream.java:60)
> at
> org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:107)
> at
> org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
> at
> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2539)
> at
> org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at
> org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
> at org.eclipse.jetty.server.Server.handle(Server.java:531)
> at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
> at org.eclipse.jetty.io
> .AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
> at org.eclipse.jetty.io
> .FillInterest.fillable(FillInterest.java:102)
> at org.eclipse.jetty.io
> .ChannelEndPoint$2.run(ChannelEndPoint.java:118)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
> at
> org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
> at
> org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
> at
> 

Re: Number of segments in collection is more than what is set in TieredMergePolicyFactory

2019-01-29 Thread Shawn Heisey

On 1/28/2019 10:14 AM, Zheng Lin Edwin Yeo wrote:

We have the following TieredMergePolicyFactory configuration in our
solrconfig,xml


   10
   10
   10


These three settings are the really important ones.  Except for 
maxMergeAtOnceExplicit, you have these at the default settings.  The 
default for maxMergeAtOnceExplicit is 30 ... and you shouldn't lower it 
without a really good reason.  It mostly comes into play during an 
optimize ... when you lower it, optimizes may take longer than normal. 
It won't be able to merge as many segments at the same time, so the 
number of passes required to complete the optimize could increase.


The most important setting here is segmentsPerTier ... this does not 
mean you will never have more than 10 total segments, it means that at 
each tier, Lucene will try to keep the number of segments below 10. 
With a large index, you are likely to have 3 or 4 tiers, possibly more.


On an index where I spent a lot of time, my settings were, respective to 
yours, 35, 105, and 35.  I often had more than 100 segments in those 
indexes.  It was behaving correctly.



What could be the reason that it is not able to merge the segments to 3,
with each of the  segment size to be 5 GB?


It is working as designed, just not as you expected.

Thanks,
Shawn


Re: Error using collapse parser with /export

2019-01-29 Thread Rahul Goswami
I checked again and looks like all documents with the same "id_field"
reside on the same shard, in which case I would expect collapse parser to
work. Here is my complete query:

http://localhost:8983/solr/mycollection/stream/?expr=search(mycollection
,sort="field1 asc,field2
asc",fl="fileld1,field2,field3",qt="/export",q="*:*",fq="((field4:1)
OR (field4:2))",fq="{!collapse field=id_field sort='field3 desc'}")

The same query with "select" handler does return the collapse result fine.
Looks like this might be a bug afterall (while working with /export)?

Thanks,
Rahul


On Sun, Jan 27, 2019 at 9:55 PM Rahul Goswami  wrote:

> Hi Joel,
>
> Thanks for responding to the query.
>
> Answers to your questions:
> 1) After collapsing is it not possible to use the /select handler?  - The
> collapsing itself is causing the failure (or did I not understand your
> question right?)
> 2) After exporting is it possible to unique the records using the
> unique  Streaming Expression?   (This can't be done since we require the
> unique document in a group subject to a sort order as in the query above.
> Looking at the Streaming API, 'unique' streaming expression doesn't give
> the capability to sort within a group. Or is there a way to do this?)
>
> I re-read the documentation
> 
> :
> "In order to use these features with SolrCloud, the documents must be
> located on the same shard."
>
> Looks like the "id_field"  in the collapse criteria above is coming from
> documents not present in the same shard. I'll verify this tomorrow and
> update the thread.
>
> Thanks,
> Rahul
>
> On Mon, Jan 21, 2019 at 2:26 PM Joel Bernstein  wrote:
>
>> I haven't had time to look into the details of this issue but it's not
>> clear that these two features will be able to be used together. Although
>> that it would be nice if the could.
>>
>> A couple of questions about your use case:
>>
>> 1) After collapsing is it not possible to use the /select handler?
>> 2) After exporting is it possible to unique the records using the unique
>> Streaming Expression?
>>
>> Either of those cases would be the typical uses of these features.
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Sun, Jan 20, 2019 at 10:13 PM Rahul Goswami 
>> wrote:
>>
>> > Hello,
>> >
>> > Following up on my query. I know this might be too specific an issue.
>> But I
>> > just want to know that it's a legitimate bug and the supported
>> operation is
>> > allowed with the /export handler. If someone has an idea about this and
>> > could confirm, that would be great.
>> >
>> > Thanks,
>> > Rahul
>> >
>> > On Thu, Jan 17, 2019 at 4:58 PM Rahul Goswami 
>> > wrote:
>> >
>> > > Hello,
>> > >
>> > > I am using SolrCloud on Solr 7.2.1.
>> > > I get the NullPointerException in the Solr logs (in ExportWriter.java)
>> > > when the /stream handler is invoked with a search() streaming
>> expression
>> > > with qt="/export" containing fq="{!collapse field=id_field sort="time
>> > > desc"} (among other fq's. I tried eliminating one fq at a time to find
>> > the
>> > > problematic one. The one with collapse parser is what makes it fail).
>> > >
>> > >
>> > > I see an open JIRA for this issue (with a submitted patch which has
>> not
>> > > yet been accepted):
>> > >
>> > > https://issues.apache.org/jira/browse/SOLR-8291
>> > >
>> > >
>> > >
>> > > In my case useFilterForSortedQuery=false
>> > >
>> > > org.apache.solr.servlet.HttpSolrCall
>> null:java.lang.NullPointerException
>> > > at
>> org.apache.lucene.util.BitSetIterator.(BitSetIterator.java:61)
>> > > at
>> org.apache.solr.handler.ExportWriter.writeDocs(ExportWriter.java:243)
>> > > at
>> > >
>> org.apache.solr.handler.ExportWriter.lambda$null$1(ExportWriter.java:222)
>> > > at
>> > >
>> >
>> org.apache.solr.response.JSONWriter.writeIterator(JSONResponseWriter.java:523)
>> > > at
>> > >
>> >
>> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:180)
>> > > at
>> org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
>> > > at
>> > >
>> org.apache.solr.handler.ExportWriter.lambda$null$2(ExportWriter.java:222)
>> > > at
>> > >
>> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
>> > > at
>> > >
>> >
>> org.apache.solr.response.TextResponseWriter.writeVal(TextResponseWriter.java:198)
>> > > at
>> org.apache.solr.response.JSONWriter$2.put(JSONResponseWriter.java:559)
>> > > at
>> > >
>> >
>> org.apache.solr.handler.ExportWriter.lambda$write$3(ExportWriter.java:220)
>> > > at
>> > >
>> org.apache.solr.response.JSONWriter.writeMap(JSONResponseWriter.java:547)
>> > > at org.apache.solr.handler.ExportWriter.write(ExportWriter.java:218)
>> > > at org.apache.solr.core.SolrCore$3.write(SolrCore.java:2627)
>> > > at
>> > >
>> >
>> org.apache.solr.response.QueryResponseWriterUtil.writeQueryResponse(QueryResponseWriterUtil.java:49)
>> > > at
>> > >
>> 

clearing document cache || solr 6.6

2019-01-29 Thread sachin gk
Hi All,

Is there a way to clear the *document cache* after we commit to the indexer.

-- 
Regards,
Sachin


Re: SPLITSHARD not working as expected

2019-01-29 Thread Rahul Goswami
Thanks for the reply Jan. I have been referring to documentation for
SPLISHARD on 7.2.1

which
seems to be missing some important information present in 7.6
.
Especially these two pieces of information.:
"When using splitMethod=rewrite (default) you must ensure that the node
running the leader of the parent shard has enough free disk space i.e.,
more than twice the index size, for the split to succeed "

"The first replicas of resulting sub-shards will always be placed on the
shard leader node"

The idea of having an entire shard (both the replicas of it) present on the
same node did come across as an unexpected behavior at the beginning.
Anyway, I guess I am going to have to take care of the rebalancing with
MOVEREPLICA following a SPLITSHARD.

Thanks for the clarification.


On Mon, Jan 28, 2019 at 3:40 AM Jan Høydahl  wrote:

> This is normal. Please read
> https://lucene.apache.org/solr/guide/7_6/collections-api.html#splitshard
> PS: Images won't make it to the list, but don't think you need a
> screenshot here, what you describe is the default behaviour.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 28. jan. 2019 kl. 09:05 skrev Rahul Goswami :
> >
> > Hello,
> > I am using Solr 7.2.1. I created a two node example collection on the
> same machine. Two shards with two replicas each. I then called SPLITSHARD
> on shard2 and expected the split shards to have one replica on each node.
> However I see that for shard2_1, both replicas reside on the same node. Is
> this a valid behavior?  Unless I am missing something, this could be
> potentially fatal.
> >
> > Here's the query and the cluster state post split:
> >
> http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true
> <
> http://localhost:8983/solr/admin/collections?action=SPLITSHARD=gettingstarted=shard2=true>
>
> >
> >
> >
> > Thanks,
> > Rahul
>
>


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi everyone,

We have tried to do the setup and indexing on the latest Solr 7.6.0

However, we faced exactly the same issue as what we faced in Solr 7.5.0, in
which the search for customers collection slowed down once we indexed
policies collection.

Regards,
Edwin

On Wed, 30 Jan 2019 at 01:19, Zheng Lin Edwin Yeo 
wrote:

> Hi Paul,
>
> Thanks for the reply and suggestion
>
> Yes, we have installed RamMap, and are analyzing the results from there.
> The problem we are facing is that once the query for that collection
> becomes slow, it will not be fast again even after we restart Solr or the
> entire machine.
>
> Regards,
> Edwin
>
> On Tue, 29 Jan 2019 at 20:30,  wrote:
>
>> Hi
>>
>> If the reason for the difference in speed is that the index is being read
>> from disk, I would expect that the first query would be slow, but
>> subsequent queries on the same collection should speed up. A query on the
>> other collection could then be slower. In this case I would say that this
>> is normal behavior. The OS file cache cannot be relied upon to give the
>> same results in different circumstances, including different software
>> versions.
>>
>> You may wish to install the RamMap tool[1], [2], although you may be
>> having the inverse problem to that described in [1]. You can then see how
>> much space is used by the cache and other demands.
>>
>> If subsequent queries are fast, then to me it does not seem like a
>> problem for a development machine.  For production you may wish to store
>> the indices in ram and/or change from windows to linux, id it is important
>> that all queries including the first are very fast.
>>
>> Have a nice day
>> Paul
>>
>> -Ursprüngliche Nachricht-
>> Von: Shawn Heisey 
>> Gesendet: Dienstag, 29. Januar 2019 13:25
>> An: solr-user@lucene.apache.org
>> Betreff: Re: Indexing in one collection affect index in another collection
>>
>> On 1/29/2019 5:06 AM, Zheng Lin Edwin Yeo wrote:
>> > My guess is after we change our searchFields_tcs schema which is:
>> >
>> > *From*:
>> > > > stored="true" multiValued="true" termVectors="true" termPositions="true"
>> > termOffsets="true"/>
>> >
>> > *To:*
>> > > > stored="true" multiValued="true" storeOffsetsWithPositions="true"
>> > termVectors="true" termPositions="false" termOffsets="false"/>
>>
>> Adding termVectors will make the index bigger.  Potentially much bigger.
>>   This will increase the overall RAM requirement of the server,
>> especially if the server is handling software other than Solr.  Anything
>> that makes the index bigger can affect performance.
>>
>> > The above change was done in order to use the Solr recommended unified
>> > highlighter (Posting with light term vectors) with Solr's
>> > documentation claimed it is the fastest.
>> >
>> > My best guess is Solr 7.5.0 has some bugs that slowed down the whole
>> > index and queries with the new approach (above new dynamicField
>> > schema), which it affects the index OS filecaching or any other issues.
>> >
>> > So I kindly suggest you look deeper and see whether such bugs are
>> exists?
>>
>> I know almost nothing about highlighting.  I wouldn't be able to look for
>> bugs.
>>
>> Thanks,
>> Shawn
>>
>


Re: Number of segments in collection is more than what is set in TieredMergePolicyFactory

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi,

Anyone has any insights of this?

Thank you in advance.

Regards,
Edwin

On Tue, 29 Jan 2019 at 01:14, Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> We have the following TieredMergePolicyFactory configuration in our
> solrconfig,xml
>
> 
>   10
>   10
>   10
>   10
>   5120
>   0.1
>   2048
>   10.0
> 
>
> However, when we index data to the collection, the number of segments that
> we are getting does not match what we configured.
> For example, our collection size is 13.7 GB. With the above
> TieredMergePolicyFactory configuration, we should expect to have 3 segments
> (since 13.7 / 5 = 2.74, which rounds up to 3). But we are getting 24
> segments in our collection, which we have attached the screenshot in the
> link below.
>
> https://drive.google.com/file/d/1hjIQVk_L2Bn9MYOmCdf2wKD_f_D2DNV6/view?usp=sharing
>
> What could be the reason that it is not able to merge the segments to 3,
> with each of the  segment size to be 5 GB?
>
> Regards,
> Edwin
>
>
>
>


Re: The parent shard will never be delete/clean?

2019-01-29 Thread zhenyuan wei
That is Cool~ , I'll try it. Thanks !



Andrzej Białecki  于2019年1月23日周三 下午8:53写道:

> Solr 7.4.0 added a periodic maintenance task that cleans up old inactive
> parent shards left after the split. “Old” means 2 days by default.
>
> > On 22 Jan 2019, at 15:31, Jason Gerlowski  wrote:
> >
> > Hi,
> >
> > You might want to check out the documentation, which goes over
> > split-shard in a bit more detail:
> >
> https://lucene.apache.org/solr/guide/7_6/collections-api.html#CollectionsAPI-splitshard
> >
> > To answer your question directly though, no.  Split-shard creates two
> > new subshards, but it doesn't do anything to remove or cleanup the
> > original shard.  The original shard remains with its data and will
> > delegate future requests to the result shards.
> >
> > Hope that helps,
> >
> > Jason
> >
> > On Tue, Jan 22, 2019 at 4:17 AM zhenyuan wei  wrote:
> >>
> >> Hi,
> >>   If I split shard1 to shard1_0,shard1_1, Is the parent shard1 will
> >> never be clean up?
> >>
> >>
> >> Best,
> >> Tinswzy
> >
>
>


Creating shard with core.properties

2019-01-29 Thread Bharath Kumar
Hi All,

I am trying to create a shard using solr 7.6.0 using just core.properties
file (like auto-discovering the shard) with legacyCloud set to false. But i
am getting an error message like below even though i specify the
coreNodeName in the core.properties file:-

"coreNodeName " + coreNodeName + " does not exist in shard " +
cloudDesc.getShardId() +
", ignore the exception if the replica was deleted");

Please note my zookeeper state is new and does not have any state
registered earlier. Can you please help? The reason i need this is, we are
trying to migrate from 6.1 to 7.6.0 and i have a single shard with 2
replicas created using core.properties and not using the collection api.
-- 
Thanks & Regards,
Bharath MV Kumar

"Life is short, enjoy every moment of it"


Solrcloud TimeoutException: Idle timeout expired

2019-01-29 Thread Schaum Mallik
I am seeing this error in our logs. Our Solr heap is set to more than 10G.
Any clues which anyone can provide will be very helpful.

Thank you

null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 12/12 ms
at 
org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1075)
at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:313)
at 
org.apache.solr.servlet.ServletInputStreamWrapper.read(ServletInputStreamWrapper.java:74)
at 
org.apache.commons.io.input.ProxyInputStream.read(ProxyInputStream.java:100)
at 
org.apache.solr.common.util.FastInputStream.readWrappedStream(FastInputStream.java:79)
at 
org.apache.solr.common.util.FastInputStream.refill(FastInputStream.java:88)
at 
org.apache.solr.common.util.FastInputStream.peek(FastInputStream.java:60)
at 
org.apache.solr.handler.loader.JavabinLoader.parseAndLoadDocs(JavabinLoader.java:107)
at 
org.apache.solr.handler.loader.JavabinLoader.load(JavabinLoader.java:55)
at 
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:97)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2539)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1634)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)
at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at 
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
at org.eclipse.jetty.server.Server.handle(Server.java:531)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:760)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:678)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Idle timeout
expired: 12/12 ms
at 

Re: Solr 5.2.1 replication hangs possibly during segment merge after a delete operation

2019-01-29 Thread Ravi Prakash
Thanks.

I am not explicitly asking solr to optimize. I do send -commit yes in the POST 
command when I execute the delete query.

In the master-slave node where replication is hung I see this:

On the master:
-bash-4.1$ ls -al data/index/segments_*
-rw-rw-r--. 1 u g 1269 Jan 29 16:23 data/index/segments_13w4 
 
On the Slave:
-bash-4.1$ ls -al data/index.20181027004100961/segment*
-rw-rw-r--. 1 u g 1594 Jan 28 22:23 data/index.20181027004100961/segments_13no

And Slave replication admin page is spinning on 
Current File: segments_13no0 bytes / 1.56 KB [0%]

Usually I have seen and experienced that when it is unable to download a 
specific file from master, current replication fails (not sure if it times out) 
and triggers a full copy. Or at least I am able to abort replication from the 
UI.
In this case where it cannot find a segment file itself, possibly because a 
merge happened on the master and the segment file was recreated in master, it 
is not able to find the old segment file that the slave is looking for and 
instead of triggering a full copy but it just sits there. Abort Replication 
does nothing. Tried the abortfetch api as well which returns OK but UI 
continues to spin...

Service solr stop/start makes the next poll for replication succeed.

Not all master-slave setup hangs this way. They are also running the delete 
cron job once daily. That makes me believe that some deletes after some 
threshold is crossed kicks off a merge that creates a new segments file and 
that blocks the replication after that...

Ravi

On 1/29/19, 4:44 AM, "Shawn Heisey"  wrote:

Sent by an external sender
--

On 1/28/2019 5:39 PM, Ravi Prakash wrote:
> I have a situation where I am trying to setup a once daily cron job on 
the master node to delete old documents from the index based on our retention 
policy.

This reply may not do you any good.  Just wanted you to know up front 
that I might not be helpful.

> The cron job basically does this (min and max are a day dange):
>  DELETE="\"started:[${MINDATE} TO ${MAXDELDATE}]\""
>   /opt/solr/bin/post -c  -type application/json -out yes 
-commit yes -d {delete:{query:"$DELETE"}}

That is a delete by query.

Are you possibly asking Solr to background optimize the index before you 
do the deleteByQuery?  Because if you do something that begins a merge, 
then issue a deleteByQuery while the merge is happening, the delete and 
all further changes to the index will pause until the merge is done.  An 
optimize is a forced merge and can take a very long time to complete. 
Getting around that problem involves using deleteById instead of 
deleteByQuery.

I have no idea whether replication would be affected by the blocking 
that deleteByQuery causes.  I wouldn't expect it to be affected, but 
I've been surprised by Solr's behavior before.

Thanks,
Shawn




Re: HttpParser URI is too large

2019-01-29 Thread levtannen
Thank you Jan. This solution worked. The warning message "URI is too large
>81920" disappeared. But this fix unleashed an another problem: The INFO
message that was suppressed by the previous error now is displayed in all
its length. And it is way too long because it lists all 100 collections. I
do not need this message in a log and I hope it can be prevented by setting
an appropriate logger level to DEBUG or to ERROR in solr4j2.xml, but I do
not know the name of this logger. Could you or somebody in the community to
figure this out from the following fragment of the log?
Regards
Lev Tannen

32610 2019-01-29 18:07:01.870 INFO  (qtp817348612-19) [   ]
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/metrics
params={wt=javabin=2=solr.core.ME-B-cases.shard1.replica_n4:UPDATE./update.requests=s
 
olr.core.TXS-A-documents.shard1.replica_n1:INDEX.sizeInBytes=solr.core.LAW-A-documents.shard1.replica_n4:UPDATE./update.requests=solr.core.CA1-A-documents.shard1.replica_n1:INDEX.sizeInBytes=solr.core.NYE
 
-A-documents.shard1.replica_n4:UPDATE./update.requests=solr.core.TXS-B-cases.shard1.replica_n4:QUERY./select.requests=solr.core.VI-B-documents.shard1.replica_n4:UPDATE./update.requests=solr.core.GU-A-docu
 
ments.shard1.replica_n2:QUERY./select.requests=solr.core.GAN-B-documents.shard1.replica_n4:QUERY./select.requests=solr.core.VT-B-documents.shard1.replica_n1:QUERY./select.requests=solr.core.CA7-B-document
 
s.shard1.replica_n5:INDEX.sizeInBytes=solr.core.MIW-B-cases.shard1.replica_n4:INDEX.sizeInBytes=solr.core.MSS-A-documents.shard1.replica_n2:UPDATE./update.requests=solr.core.CA2-A-documents.shard1.replica
 
_n4:UPDATE./update.requests=solr.core.NYS-A-cases.shard1.replica_n4:UPDATE./update.requests=solr.core.VAE-B-cases.shard1.replica_n4:INDEX.sizeInBytes=solr.core.NYS-B-documents.shard1.replica_n1:INDEX.size
 
InBytes=solr.core.NCE-B-cases.shard1.replica_n4:QUERY./select.requests=solr.core.FLN-B-cases.shard1.replica_n4:INDEX.sizeInBytes=solr.core.TNM-A-cases.shard1.replica_n1:UPDATE./update.requests=solr.co
 
re.OKW-A-cases.shard1.replica_n2:INDEX.sizeInBytes=solr.core.PR-B-cases.shard1.replica_n1:UPDATE./update.requests=solr.core.INS-B-documents.shard1.replica_n4:INDEX.sizeInBytes=solr.core.CA2-B-documents.sh
 
ard1.replica_n3:QUERY./select.requests=solr.core.GU-A-cases.shard1.replica_n1:QUERY./select.requests=solr.core.FLM-B-cases.shard1.replica_n4:UPDATE./update.requests=solr.core.ARE-B-documents.shard1.replic
 
a_n4:QUERY./select.requests=solr.core.NYN-B-documents.shard1.replica_n4:INDEX.sizeInBytes=solr.core.GAN-B-documents.shard1.replica_n4:UPDATE./update.requests=solr.core.GAN-A-cases.shard1.replica_n1:UPDATE
 
./update.requests=solr.core.CAN-B-cases.shard1.replica_n4:INDEX.sizeInBytes=solr.core.NCE-B-documents.shard1.replica_n4:UPDATE./update.requests=solr.core.WIE-B-cases.shard1.replica_n2:UPDATE./update.reque
 
sts=solr.core.VAE-B-cases.shard1.replica_n4:QUERY./select.requests=solr.core.CAS-A-documents.shard1.replica_n2:UPDATE./update.requests=solr.core.PR-A-cases.shard1.replica_n4:UPDATE./update.requests=so
 
lr.core.CAS-A-documents.shard1.replica_n2:INDEX.sizeInBytes=solr.core.ALM-B-documents.shard1.replica_n2:UPDATE./update.requests=solr.core.VAW-A-cases.shard1.replica_n4:QUERY./select.requests=solr.core.FLM

...



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr relevancy score different on replicated nodes

2019-01-29 Thread Walter Underwood
Is this a sharded Solr Cloud collection? If so, you can try using global IDF.
That should make the scores more similar on different nodes.

https://lucene.apache.org/solr/guide/6_6/distributed-requests.html#DistributedRequests-ConfiguringstatsCache_DistributedIDF_

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jan 29, 2019, at 10:38 AM, David Hastings  
> wrote:
> 
> Maybe instead of using the solr score in your metrics, find a way to use
> the documents location in the results?   you can never trust the score to
> be consistent, its constantly changing as the indexes changes
> 
> On Tue, Jan 29, 2019 at 1:29 PM Ashish Bisht 
> wrote:
> 
>> Hi Erick,
>> 
>> Our business wanted score not to be totally based on default relevancy
>> algo.
>> Instead a mix of solr relevancy+usermetrics(80%+20%).
>> 
>> Each result doc is calculated against max score as a fraction of
>> 80.Remaining 20 is from user metrics.
>> 
>> Finally sort happens on new score.
>> 
>> But say we got first page correctly, and for the second page if the request
>> goes to other replica where max score is different. UI may result give
>> wrong
>> sort as compared to first page. For e.g last value of page 1 is 70 and
>> first
>> value of second page can be 72 I. e distorted sorting.
>> 
>> On top of it we are not using pagination but a infinite scroll which makes
>> it more noticeable.
>> 
>> Please suggest.
>> 
>> Regards
>> Ashish
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 



Re: Solr relevancy score different on replicated nodes

2019-01-29 Thread David Hastings
Maybe instead of using the solr score in your metrics, find a way to use
the documents location in the results?   you can never trust the score to
be consistent, its constantly changing as the indexes changes

On Tue, Jan 29, 2019 at 1:29 PM Ashish Bisht 
wrote:

> Hi Erick,
>
> Our business wanted score not to be totally based on default relevancy
> algo.
> Instead a mix of solr relevancy+usermetrics(80%+20%).
>
> Each result doc is calculated against max score as a fraction of
> 80.Remaining 20 is from user metrics.
>
> Finally sort happens on new score.
>
> But say we got first page correctly, and for the second page if the request
> goes to other replica where max score is different. UI may result give
> wrong
> sort as compared to first page. For e.g last value of page 1 is 70 and
> first
> value of second page can be 72 I. e distorted sorting.
>
> On top of it we are not using pagination but a infinite scroll which makes
> it more noticeable.
>
> Please suggest.
>
> Regards
> Ashish
>
>
>
>
>
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Solr relevancy score different on replicated nodes

2019-01-29 Thread Ashish Bisht
Hi Erick, 

Our business wanted score not to be totally based on default relevancy algo.
Instead a mix of solr relevancy+usermetrics(80%+20%). 

Each result doc is calculated against max score as a fraction of
80.Remaining 20 is from user metrics. 

Finally sort happens on new score. 

But say we got first page correctly, and for the second page if the request
goes to other replica where max score is different. UI may result give wrong
sort as compared to first page. For e.g last value of page 1 is 70 and first
value of second page can be 72 I. e distorted sorting. 

On top of it we are not using pagination but a infinite scroll which makes
it more noticeable. 

Please suggest. 

Regards
Ashish








--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Paul,

Thanks for the reply and suggestion

Yes, we have installed RamMap, and are analyzing the results from there.
The problem we are facing is that once the query for that collection
becomes slow, it will not be fast again even after we restart Solr or the
entire machine.

Regards,
Edwin

On Tue, 29 Jan 2019 at 20:30,  wrote:

> Hi
>
> If the reason for the difference in speed is that the index is being read
> from disk, I would expect that the first query would be slow, but
> subsequent queries on the same collection should speed up. A query on the
> other collection could then be slower. In this case I would say that this
> is normal behavior. The OS file cache cannot be relied upon to give the
> same results in different circumstances, including different software
> versions.
>
> You may wish to install the RamMap tool[1], [2], although you may be
> having the inverse problem to that described in [1]. You can then see how
> much space is used by the cache and other demands.
>
> If subsequent queries are fast, then to me it does not seem like a problem
> for a development machine.  For production you may wish to store  the
> indices in ram and/or change from windows to linux, id it is important that
> all queries including the first are very fast.
>
> Have a nice day
> Paul
>
> -Ursprüngliche Nachricht-
> Von: Shawn Heisey 
> Gesendet: Dienstag, 29. Januar 2019 13:25
> An: solr-user@lucene.apache.org
> Betreff: Re: Indexing in one collection affect index in another collection
>
> On 1/29/2019 5:06 AM, Zheng Lin Edwin Yeo wrote:
> > My guess is after we change our searchFields_tcs schema which is:
> >
> > *From*:
> >  > stored="true" multiValued="true" termVectors="true" termPositions="true"
> > termOffsets="true"/>
> >
> > *To:*
> >  > stored="true" multiValued="true" storeOffsetsWithPositions="true"
> > termVectors="true" termPositions="false" termOffsets="false"/>
>
> Adding termVectors will make the index bigger.  Potentially much bigger.
>   This will increase the overall RAM requirement of the server, especially
> if the server is handling software other than Solr.  Anything that makes
> the index bigger can affect performance.
>
> > The above change was done in order to use the Solr recommended unified
> > highlighter (Posting with light term vectors) with Solr's
> > documentation claimed it is the fastest.
> >
> > My best guess is Solr 7.5.0 has some bugs that slowed down the whole
> > index and queries with the new approach (above new dynamicField
> > schema), which it affects the index OS filecaching or any other issues.
> >
> > So I kindly suggest you look deeper and see whether such bugs are exists?
>
> I know almost nothing about highlighting.  I wouldn't be able to look for
> bugs.
>
> Thanks,
> Shawn
>


Re: How to specify custom update chain in a SolrJ request

2019-01-29 Thread Chris Wareham

Answering myself, the solution is to update my code as follows:

UpdateRequest request = new UpdateRequest();
request.setParam("update.chain", "skipexisting");

for (Map.Entry user : users.entrySet()) {
SolrInputDocument document = new SolrInputDocument();
document.addField("id", user.key().toString());
document.addField("applications", Collections.singletonMap("set", 
user.value()));


request.add(document);
request.process(solrClient);
}

solrClient.commit();

On 29/01/2019 16:27, Chris Wareham wrote:
I'm trying to update records in my Solr core, and have configured a 
custom update chain that skips updates to records that don't exist:



    
    
  true
  true
    
    
    
  

My SolrJ update code is currently:

for (Map.Entry user : users.entrySet()) {
     SolrInputDocument document = new SolrInputDocument();
     document.addField("id", user.key().toString());
     document.addField("applications", Collections.singletonMap("set", 
user.value()));


     solrClient.add(document);
}

solrClient.commit();

I can't seem to specify the update chain to use and I assume I need to 
use the UpdateRequest class. However, it's not clear how I go about 
setting a parameter on the UpdateRequest in order to specify the update 
chain.


Chris


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Shawn,

No worries, and thanks for your clarification.

We make these changes in order to use the Unifed Highlighter, with
hl.offsetSource = POSTING, and add "light" term vectors.

The settings comes from what is written in the Solr guide on highlighting,
which says the following:

*Postings*: Supported by the Unified Highlighter. Set
storeOffsetsWithPositions to true. This adds a moderate amount of extra
data to the index but it speeds up highlighting tremendously, especially
compared to analysis with longer text fields.

However, wildcard queries will fall back to analysis unless "light" term
vectors are added.

   -

   *with Term Vectors (light)*: Supported only by the Unified Highlighter.
   To enable this mode set termVectors to true but no other term vector
   related options on the field being highlighted.

   This adds even more data to the index than just
storeOffsetsWithPositions but
   not as much as enabling all the extra term vector options. Term Vectors are
   only accessed by the highlighter when a wildcard query is used and will
   prevent a fall back to analysis of the stored text.

   This is definitely the fastest option for highlighting wildcard queries
   on large text fields.


Below is the link to the guide:
https://lucene.apache.org/solr/guide/7_5/highlighting.html

Regards,
Edwin


On Tue, 29 Jan 2019 at 20:39, Shawn Heisey  wrote:

> On 1/29/2019 5:25 AM, Shawn Heisey wrote:
> > Adding termVectors will make the index bigger.  Potentially much bigger.
> > This will increase the overall RAM requirement of the server,
> > especially if the server is handling software other than Solr.  Anything
> > that makes the index bigger can affect performance.
>
> I misread the change.  Apologies.  Both definitions have termVectors.  I
> didn't notice it in the second definition because it was on a different
> line than in the first one.
>
> After figuring out what you changed, I cannot figure out what it is
> you're trying to do, and I'm not sure that the settings make sense.
> You've added/changed these three settings:
>
> storeOffsetsWithPositions="true"
> termPositions="false"
> termOffsets="false"
>
> It seems to me that the first new setting is directly contrary to the
> other two new settings.  I really have no idea what the outcome of the
> changes will be.
>
> Thanks,
> Shawn
>


Re: MLT - unexpected design choice

2019-01-29 Thread Maria Mestre
Hi Alessandro and Matt,

Thanks so much for your help!

@Alessandro: I will do so, thank you :-)



> On 29 Jan 2019, at 12:26, Alessandro Benedetti  wrote:
> 
> Hi Maria,
> this is actually a great catch!
> I have been working a lot on the More Like This and this mistake never
> caught my attention.
> 
> I agree with you, feel free to open a Jira Issue.
> 
> First of all what you say, makes sense.
> Secondly it is the way it is the standard way used in the similarity Lucene
> calculations :
> 
> 
> 
> 
> 
> 
> 
> 
> *public Explanation idfExplain(CollectionStatistics collectionStats,
> TermStatistics termStats) {  final long df = termStats.docFreq();
> final long docCount = collectionStats.docCount();  final float idf =
> idf(df, docCount);  return Explanation.match(idf, "idf, computed as
> log((docCount+1)/(docFreq+1)) + 1 from:",  Explanation.match(df,
> "docFreq, number of documents containing term"),
> Explanation.match(docCount, "docCount, total number of documents with
> field"));}*
> 
> 
> *Indeed the int numDocs = ir.numDocs(); should actually be allocated
> per term in the for loop, using the field stats, something like:*
> 
> *numDocs = ir.getDocCount(fieldName)*
> 
> Feel free to open the Jira issue and attach a patch with at least a
> testCase that shows the bugfix.
> 
> I will be available for doing the review.
> 
> 
> Cheers
> 
> --
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.sease.io=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4=KEO05uAuRQl8rAIP9s17NGMjXRfT6hiTPrY4lqZgdu4=
>  
> 
> 
> 
> On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce  > wrote:
> 
>> Hi Maria,
>> 
>> Would it help to add a filter to your query to restrict the results to
>> just those where the description field is populated? Eg. add
>> 
>> fq=description:[* TO *]
>> 
>> to your query parameters.
>> 
>> Apologies if I'm misunderstanding the problem!
>> 
>> Best,
>> 
>> Matt
>> 
>> 
>> On 28/01/2019 16:29, Maria Mestre wrote:
>>> Hi all,
>>> 
>>> First of all, I’m not a Java developer, and a SolR newbie. I have worked
>> with Elasticsearch for some years (not contributing, just as a user), so I
>> think I have the basics of text search engines covered. I am always
>> learning new things though!
>>> 
>>> I created an index in SolR and used more-like-this on it, by passing a
>> document_id. My data has a special feature, which is that one of the fields
>> is called “description” but is only populated about 10% of the time. Most
>> of the time it is empty. I am using that field to query similar documents.
>>> 
>>> So I query the /mlt endpoint using these parameters (for example):
>>> 
>>> {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
>>> mlt=true,
>>> mlt.fl=description,
>>> mlt.mindf=1,
>>> mlt.mintf=1,
>>> mlt.maxqt=5,
>>> wt=json,
>>> mlt.interestingTerms=details}
>>> 
>>> The issue I have is that when retrieving the key scored terms
>> (interestingTerms), the code uses the total number of documents in the
>> index, not the total number of documents with populated “description”
>> field. This is where it’s done in the code:
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_lucene-2Dsolr_blob_master_lucene_queries_src_java_org_apache_lucene_queries_mlt_MoreLikeThis.java-23L651=DwIFaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=dDMVizYQyLCVjZtUiuOfTiDX-PplOc_mxo-mESuppfQ=cyGFtTUeNu0Xk1tpUujTkyQm4S-13HewPzKKYnSmeX4=KtS7zF20Gy-Ij9SeQ-XfafUkPqn8C8855G6KbnNVR6I=
>>  
>> 
>>> 
>>> The effect of this choice is that the “idf” does not vary much, given
>> that numDocs >> number of documents with “description”, so the key terms
>> end up being just the terms with the highest term frequencies.
>>> 
>>> It is inconsistent because the MLT-search then uses these extracted key
>> terms and scores all documents using an idf which is computed only on the
>> subset of documents with “description”. So one part of the MLT uses a
>> different numDocs than another part. This sounds like an odd choice, and
>> not expected at all, and I wonder if I’m missing something.
>>> 
>>> Best,
>>> Maria
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> Matt Pearce
>> Flax - Open Source Enterprise Search
>> 

How to specify custom update chain in a SolrJ request

2019-01-29 Thread Chris Wareham
I'm trying to update records in my Solr core, and have configured a 
custom update chain that skips updates to records that don't exist:



   
   
 true
 true
   
   
   
 

My SolrJ update code is currently:

for (Map.Entry user : users.entrySet()) {
SolrInputDocument document = new SolrInputDocument();
document.addField("id", user.key().toString());
document.addField("applications", Collections.singletonMap("set", 
user.value()));


solrClient.add(document);
}

solrClient.commit();

I can't seem to specify the update chain to use and I assume I need to 
use the UpdateRequest class. However, it's not clear how I go about 
setting a parameter on the UpdateRequest in order to specify the update 
chain.


Chris


SOLR 7.5 returns http response 304 to SOLR admin UI query - is this correct when httpCaching never304="true" is a set?

2019-01-29 Thread Standen Guy
Hi All,
   I have recently upgraded to SOLR 7.5 from SOLR 4.10.3 and 
believe I have noticed a change in the way HTTP caching is operating.

I have installed the vanilla SOLR 7.5 on Windows 2012 R2

I have run the techproducts example  where the solrconfig includes :






 



I have opened the SOLR admin  UI  on IE11 and run a query (*:*) against the 
techproducts core. If I re-execute exactly  the same query from the UI by 
re-pressing the

"execute query" button the results are exactly the same ( including the QTime 
value). Running IE11 in debug mode (F12) with "Always Refresh from server" 
switched OFF it

appears the http response from SOLR is 304 (use cached results).


If I change the IE11 "internet options/ temporary internet files/ check for 
newer versions of stored pages" from value "Automatically" to  "Every time I 
visit the

webpage"  then repressing the "execute query" button returns http response 200 
as expected.

I was expecting the  configuration to prevent 
304's being returned from SOLR - is this a mistaken assumption?

I need to ensure that a user will not get browser cached results when doing 
queries from the SOLR Admin UI. I can not change the IE11 browser internet 
options. Is there

a way to ensure that the server always returns the latest results and does not 
allow browser caching to be used?


Many Thanks,

Guy


Unless otherwise stated, this email has been sent from Fujitsu Services Limited 
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No 
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW;  PFU 
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of 
Europe Limited (registered in England No. 4153469) both with registered offices 
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE. 
This email is only for the use of its intended recipient. Its contents are 
subject to a duty of confidence and may be privileged. Fujitsu does not 
guarantee that this email has not been intercepted and amended or that it is 
virus-free.

SOLR 7.5 returns http response 304 to SOLR admin UI query - is this correct when is a set?

2019-01-29 Thread Standen Guy
Hi All,
   I have recently upgraded to SOLR 7.5 from SOLR 4.10.3 and 
believe I have noticed a change in the way HTTP caching is operating.

I have installed the vanilla SOLR 7.5 on Windows 2012 R2

I have run the techproducts example  where the solrconfig includes :






 



I have opened the SOLR admin  UI  on IE11 and run a query (*:*) against the 
techproducts core. If I re-execute exactly  the same query from the UI by 
re-pressing the
"execute query" button the results are exactly the same ( including the QTime 
value). Running IE11 in debug mode (F12) with "Always Refresh from server" 
switched OFF it
appears the http response from SOLR is 304 (use cached results).


If I change the IE11 "internet options/ temporary internet files/ check for 
newer versions of stored pages" from value "Automatically" to  "Every time I 
visit the
webpage"  then repressing the "execute query" button returns http response 200 
as expected.

I was expecting the  configuration to prevent 
304's being returned from SOLR - is this a mistaken assumption?

I need to ensure that a user will not get browser cached results when doing 
queries from the SOLR Admin UI. I can not change the IE11 browser internet 
options. Is there
a way to ensure that the server always returns the latest results and does not 
allow browser caching to be used?


Many Thanks,

Guy


Unless otherwise stated, this email has been sent from Fujitsu Services Limited 
(registered in England No 96056); Fujitsu EMEA PLC (registered in England No 
2216100) both with registered offices at: 22 Baker Street, London W1U 3BW;  PFU 
(EMEA) Limited, (registered in England No 1578652) and Fujitsu Laboratories of 
Europe Limited (registered in England No. 4153469) both with registered offices 
at: Hayes Park Central, Hayes End Road, Hayes, Middlesex, UB4 8FE. 
This email is only for the use of its intended recipient. Its contents are 
subject to a duty of confidence and may be privileged. Fujitsu does not 
guarantee that this email has not been intercepted and amended or that it is 
virus-free.

Re: Solr relevancy score different on replicated nodes

2019-01-29 Thread Erick Erickson
No, this is not a bug but a consequence of the design. ExactStats can help,
but there is no guarantee that different replicas will compute the exact same
score. Scores should be very close however.

You haven't explained why you need the scores to match. 99% of the time,
worrying about scores at this level is misguided. So I'd really try to
figure out
whether they're necessary or not.

Best,
Erick

On Tue, Jan 29, 2019 at 1:51 AM Ashish Bisht  wrote:
>
> Hi Erick,
>
> To test this scenario I added replica again and from few days have been
> monitoring metrics like Num Docs, Max Doc, Deleted Docs from *Overview*
> section of core.Checked *Segments Info* section too.Everything looks in
> sync.
>
> http://:8983/solr/#/MyTestCollection_*shard1_replica_n7*/
> http://:8983/solr/#/MyTestCollection_*4_shard1_replica_n7*/
>
> If in future they go out of sync,just wanted to confirm if this is a bug
> although you mentioned as
>
> *bq. Shouldn't both replica and leader come to same state
> after this much long period.
>
> No. After that long, the docs will be the same, all the docs
> present on one replica will be present and searchable on
> the other. However, they will be in different segments so the
> "stats skew" will remain. *
>
>
> We need these score,so as a temporary solution if we monitor these metrics
> for any issues and take action (either optimize or delete-add replica)
> accordingly.Does it make sense?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Solr 5.2.1 replication hangs possibly during segment merge after a delete operation

2019-01-29 Thread Shawn Heisey

On 1/28/2019 5:39 PM, Ravi Prakash wrote:

I have a situation where I am trying to setup a once daily cron job on the 
master node to delete old documents from the index based on our retention 
policy.


This reply may not do you any good.  Just wanted you to know up front 
that I might not be helpful.



The cron job basically does this (min and max are a day dange):
 DELETE="\"started:[${MINDATE} TO ${MAXDELDATE}]\""
  /opt/solr/bin/post -c  -type application/json -out yes -commit yes -d 
{delete:{query:"$DELETE"}}


That is a delete by query.

Are you possibly asking Solr to background optimize the index before you 
do the deleteByQuery?  Because if you do something that begins a merge, 
then issue a deleteByQuery while the merge is happening, the delete and 
all further changes to the index will pause until the merge is done.  An 
optimize is a forced merge and can take a very long time to complete. 
Getting around that problem involves using deleteById instead of 
deleteByQuery.


I have no idea whether replication would be affected by the blocking 
that deleteByQuery causes.  I wouldn't expect it to be affected, but 
I've been surprised by Solr's behavior before.


Thanks,
Shawn


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Shawn Heisey

On 1/29/2019 5:25 AM, Shawn Heisey wrote:
Adding termVectors will make the index bigger.  Potentially much bigger. 
This will increase the overall RAM requirement of the server, 
especially if the server is handling software other than Solr.  Anything 
that makes the index bigger can affect performance.


I misread the change.  Apologies.  Both definitions have termVectors.  I 
didn't notice it in the second definition because it was on a different 
line than in the first one.


After figuring out what you changed, I cannot figure out what it is 
you're trying to do, and I'm not sure that the settings make sense. 
You've added/changed these three settings:


storeOffsetsWithPositions="true"
termPositions="false"
termOffsets="false"

It seems to me that the first new setting is directly contrary to the 
other two new settings.  I really have no idea what the outcome of the 
changes will be.


Thanks,
Shawn


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Shawn Heisey

On 1/29/2019 5:06 AM, Zheng Lin Edwin Yeo wrote:

My guess is after we change our searchFields_tcs schema which is:

*From*:


*To:*



Adding termVectors will make the index bigger.  Potentially much bigger. 
 This will increase the overall RAM requirement of the server, 
especially if the server is handling software other than Solr.  Anything 
that makes the index bigger can affect performance.



The above change was done in order to use the Solr recommended unified
highlighter (Posting with light term vectors) with Solr's documentation
claimed it is the fastest.

My best guess is Solr 7.5.0 has some bugs that slowed down the whole index
and queries with the new approach (above new dynamicField schema), which it
affects the index OS filecaching or any other issues.

So I kindly suggest you look deeper and see whether such bugs are exists?


I know almost nothing about highlighting.  I wouldn't be able to look 
for bugs.


Thanks,
Shawn


AW: Indexing in one collection affect index in another collection

2019-01-29 Thread paul.dodd
Hi

If the reason for the difference in speed is that the index is being read from 
disk, I would expect that the first query would be slow, but subsequent queries 
on the same collection should speed up. A query on the other collection could 
then be slower. In this case I would say that this is normal behavior. The OS 
file cache cannot be relied upon to give the same results in different 
circumstances, including different software  versions.

You may wish to install the RamMap tool[1], [2], although you may be having the 
inverse problem to that described in [1]. You can then see how much space is 
used by the cache and other demands.

If subsequent queries are fast, then to me it does not seem like a problem for 
a development machine.  For production you may wish to store  the indices in 
ram and/or change from windows to linux, id it is important that all queries 
including the first are very fast.

Have a nice day
Paul

-Ursprüngliche Nachricht-
Von: Shawn Heisey  
Gesendet: Dienstag, 29. Januar 2019 13:25
An: solr-user@lucene.apache.org
Betreff: Re: Indexing in one collection affect index in another collection

On 1/29/2019 5:06 AM, Zheng Lin Edwin Yeo wrote:
> My guess is after we change our searchFields_tcs schema which is:
> 
> *From*:
>  stored="true" multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> 
> *To:*
>  stored="true" multiValued="true" storeOffsetsWithPositions="true"
> termVectors="true" termPositions="false" termOffsets="false"/>

Adding termVectors will make the index bigger.  Potentially much bigger. 
  This will increase the overall RAM requirement of the server, especially if 
the server is handling software other than Solr.  Anything that makes the index 
bigger can affect performance.

> The above change was done in order to use the Solr recommended unified 
> highlighter (Posting with light term vectors) with Solr's 
> documentation claimed it is the fastest.
> 
> My best guess is Solr 7.5.0 has some bugs that slowed down the whole 
> index and queries with the new approach (above new dynamicField 
> schema), which it affects the index OS filecaching or any other issues.
> 
> So I kindly suggest you look deeper and see whether such bugs are exists?

I know almost nothing about highlighting.  I wouldn't be able to look for bugs.

Thanks,
Shawn


Re: MLT - unexpected design choice

2019-01-29 Thread Alessandro Benedetti
Hi Maria,
this is actually a great catch!
I have been working a lot on the More Like This and this mistake never
caught my attention.

I agree with you, feel free to open a Jira Issue.

First of all what you say, makes sense.
Secondly it is the way it is the standard way used in the similarity Lucene
calculations :








*public Explanation idfExplain(CollectionStatistics collectionStats,
TermStatistics termStats) {  final long df = termStats.docFreq();
final long docCount = collectionStats.docCount();  final float idf =
idf(df, docCount);  return Explanation.match(idf, "idf, computed as
log((docCount+1)/(docFreq+1)) + 1 from:",  Explanation.match(df,
"docFreq, number of documents containing term"),
Explanation.match(docCount, "docCount, total number of documents with
field"));}*


*Indeed the int numDocs = ir.numDocs(); should actually be allocated
per term in the for loop, using the field stats, something like:*

*numDocs = ir.getDocCount(fieldName)*

Feel free to open the Jira issue and attach a patch with at least a
testCase that shows the bugfix.

I will be available for doing the review.


Cheers

--
Alessandro Benedetti
Search Consultant, R Software Engineer, Director
www.sease.io


On Tue, Jan 29, 2019 at 11:41 AM Matt Pearce  wrote:

> Hi Maria,
>
> Would it help to add a filter to your query to restrict the results to
> just those where the description field is populated? Eg. add
>
> fq=description:[* TO *]
>
> to your query parameters.
>
> Apologies if I'm misunderstanding the problem!
>
> Best,
>
> Matt
>
>
> On 28/01/2019 16:29, Maria Mestre wrote:
> > Hi all,
> >
> > First of all, I’m not a Java developer, and a SolR newbie. I have worked
> with Elasticsearch for some years (not contributing, just as a user), so I
> think I have the basics of text search engines covered. I am always
> learning new things though!
> >
> > I created an index in SolR and used more-like-this on it, by passing a
> document_id. My data has a special feature, which is that one of the fields
> is called “description” but is only populated about 10% of the time. Most
> of the time it is empty. I am using that field to query similar documents.
> >
> > So I query the /mlt endpoint using these parameters (for example):
> >
> > {q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
> > mlt=true,
> > mlt.fl=description,
> > mlt.mindf=1,
> > mlt.mintf=1,
> > mlt.maxqt=5,
> > wt=json,
> > mlt.interestingTerms=details}
> >
> > The issue I have is that when retrieving the key scored terms
> (interestingTerms), the code uses the total number of documents in the
> index, not the total number of documents with populated “description”
> field. This is where it’s done in the code:
> https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651
> >
> > The effect of this choice is that the “idf” does not vary much, given
> that numDocs >> number of documents with “description”, so the key terms
> end up being just the terms with the highest term frequencies.
> >
> > It is inconsistent because the MLT-search then uses these extracted key
> terms and scores all documents using an idf which is computed only on the
> subset of documents with “description”. So one part of the MLT uses a
> different numDocs than another part. This sounds like an odd choice, and
> not expected at all, and I wonder if I’m missing something.
> >
> > Best,
> > Maria
> >
> >
> >
> >
> >
> >
>
> --
> Matt Pearce
> Flax - Open Source Enterprise Search
> www.flax.co.uk
>


AW: Indexing in one collection affect index in another collection

2019-01-29 Thread paul.dodd
References, sorry:

[1] 
https://support.microsoft.com/en-ca/help/976618/you-experience-performance-issues-in-applications-and-services-when-th
[2] https://docs.microsoft.com/en-us/sysinternals/downloads/rammap

-Ursprüngliche Nachricht-
Von: Dodd, Paul Sutton (UB) 
Gesendet: Dienstag, 29. Januar 2019 13:31
An: 'solr-user@lucene.apache.org' 
Betreff: AW: Indexing in one collection affect index in another collection

Hi

If the reason for the difference in speed is that the index is being read from 
disk, I would expect that the first query would be slow, but subsequent queries 
on the same collection should speed up. A query on the other collection could 
then be slower. In this case I would say that this is normal behavior. The OS 
file cache cannot be relied upon to give the same results in different 
circumstances, including different software  versions.

You may wish to install the RamMap tool[1], [2], although you may be having the 
inverse problem to that described in [1]. You can then see how much space is 
used by the cache and other demands.

If subsequent queries are fast, then to me it does not seem like a problem for 
a development machine.  For production you may wish to store  the indices in 
ram and/or change from windows to linux, id it is important that all queries 
including the first are very fast.

Have a nice day
Paul

-Ursprüngliche Nachricht-
Von: Shawn Heisey 
Gesendet: Dienstag, 29. Januar 2019 13:25
An: solr-user@lucene.apache.org
Betreff: Re: Indexing in one collection affect index in another collection

On 1/29/2019 5:06 AM, Zheng Lin Edwin Yeo wrote:
> My guess is after we change our searchFields_tcs schema which is:
> 
> *From*:
>  stored="true" multiValued="true" termVectors="true" termPositions="true"
> termOffsets="true"/>
> 
> *To:*
>  stored="true" multiValued="true" storeOffsetsWithPositions="true"
> termVectors="true" termPositions="false" termOffsets="false"/>

Adding termVectors will make the index bigger.  Potentially much bigger. 
  This will increase the overall RAM requirement of the server, especially if 
the server is handling software other than Solr.  Anything that makes the index 
bigger can affect performance.

> The above change was done in order to use the Solr recommended unified 
> highlighter (Posting with light term vectors) with Solr's 
> documentation claimed it is the fastest.
> 
> My best guess is Solr 7.5.0 has some bugs that slowed down the whole 
> index and queries with the new approach (above new dynamicField 
> schema), which it affects the index OS filecaching or any other issues.
> 
> So I kindly suggest you look deeper and see whether such bugs are exists?

I know almost nothing about highlighting.  I wouldn't be able to look for bugs.

Thanks,
Shawn


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Shawn,

Thanks for you reply.

However, we did not delete our index when the screenshot was taken. All the
indexes are still in Solr.

My guess is after we change our searchFields_tcs schema which is:

*From*:


*To:*


The above change was done in order to use the Solr recommended unified
highlighter (Posting with light term vectors) with Solr's documentation
claimed it is the fastest.

My best guess is Solr 7.5.0 has some bugs that slowed down the whole index
and queries with the new approach (above new dynamicField schema), which it
affects the index OS filecaching or any other issues.

So I kindly suggest you look deeper and see whether such bugs are exists?

Note: If you need my schema and configuration files, please refer to my
earlier correspondences in the same thread.

Regards,
Edwin

On Tue, 29 Jan 2019 at 18:38, Shawn Heisey  wrote:

> On 1/26/2019 4:48 PM, Zheng Lin Edwin Yeo wrote:
> > Thanks for your reply. Below are the replies to your email:
> >
> > 1) We have tried to set the heap size to be 8g previously when we faced
> the
> > same issue, and changing to 7g does not help too.
> >
> > 2) We are using standard disk at the moment.
> >
> > 3) In the link is the screenshot of the process list that is sort by
> Commit.
> >
> https://drive.google.com/file/d/1TzxaAqbDJwYO0aHo9GW34p2kncnylRkG/view?usp=sharing
>
> My original thought is still the best idea I have.  I think that the
> other software on the system is heavily using the disk cache and not
> leaving enough of it for Solr's data.
>
>  From what I can tell, the other software on the system is not using
> MMAP for disk access, so the large amount of disk cache usage is not
> reflected in the "Commit" number for those programs.
>
> In the last screenshot, the Solr instances appear to be handling very
> little index data -- the Commit number is actually *smaller* than the
> Working Set number, which will not be the case when there is a lot of
> index data.  I'm betting that at the point when that screenshot was
> taken, all the index data had been deleted, possibly in preparation for
> rebuilding the indexes.
>
> Thanks,
> Shawn
>


Limit facet terms based on a substring using the JSON facet API

2019-01-29 Thread Tom Van Cuyck
Hi

In the old Solr facet API there are the facet.contains and
facet.conains.ignoreCase parameters to limit the facet values to those
terms containing the specified substring.
Is there an equivalent option in the JSON facet API? Or is there a way to
obtain the same behavior with the JSON API? I can't find anything in the
official documentation.

Kind regards, Tom
-- 

Would you like to receive our newsletter to stay updated? Please click here



Tom Van Cuyck
Software Engineer


ONTOFORCE
WINNER of EY scale-up of the year 2018
@: tom.vancu...@ontoforce.com
T: +32 9 292 80 37 <+32+9+292+80+37>
W: http://www.ontoforce.com
W: http://www.disqover.com
AA Tower, Technologiepark 122 (3/F), 9052 Gent, Belgium

CIC, One Broadway, MA 02142 Cambridge, United States


DISCLAIMER This message (including any attachments) may contain information
which is confidential and/or protected by intellectual property rights and
is intended for the sole use of the recipient(s) named above. Any use of
the information herein (including, but not limited to, total or partial
reproduction, communication or distribution in any form) by persons other
than the designated recipient(s) is prohibited. If you have received it by
mistake, please notify the sender by return email and delete this message
from your system. Please note that emails are susceptible to change.
ONTOFORCE shall not be liable for the improper or incomplete transmission
of the information contained in this communication nor for any delay in its
receipt or damage to your system. ONTOFORCE does not guarantee that the
integrity of this communication is free of viruses, interceptions or
interference.


Re: MLT - unexpected design choice

2019-01-29 Thread Matt Pearce

Hi Maria,

Would it help to add a filter to your query to restrict the results to 
just those where the description field is populated? Eg. add


fq=description:[* TO *]

to your query parameters.

Apologies if I'm misunderstanding the problem!

Best,

Matt


On 28/01/2019 16:29, Maria Mestre wrote:

Hi all,

First of all, I’m not a Java developer, and a SolR newbie. I have worked with 
Elasticsearch for some years (not contributing, just as a user), so I think I 
have the basics of text search engines covered. I am always learning new things 
though!

I created an index in SolR and used more-like-this on it, by passing a 
document_id. My data has a special feature, which is that one of the fields is 
called “description” but is only populated about 10% of the time. Most of the 
time it is empty. I am using that field to query similar documents.

So I query the /mlt endpoint using these parameters (for example):

{q=id:"0c7c4d74-0f37-44ea-8933-cd2ee7964457”,
mlt=true,
mlt.fl=description,
mlt.mindf=1,
mlt.mintf=1,
mlt.maxqt=5,
wt=json,
mlt.interestingTerms=details}

The issue I have is that when retrieving the key scored terms 
(interestingTerms), the code uses the total number of documents in the index, 
not the total number of documents with populated “description” field. This is 
where it’s done in the code: 
https://github.com/apache/lucene-solr/blob/master/lucene/queries/src/java/org/apache/lucene/queries/mlt/MoreLikeThis.java#L651

The effect of this choice is that the “idf” does not vary much, given that numDocs 
>> number of documents with “description”, so the key terms end up being just 
the terms with the highest term frequencies.

It is inconsistent because the MLT-search then uses these extracted key terms 
and scores all documents using an idf which is computed only on the subset of 
documents with “description”. So one part of the MLT uses a different numDocs 
than another part. This sounds like an odd choice, and not expected at all, and 
I wonder if I’m missing something.

Best,
Maria




  



--
Matt Pearce
Flax - Open Source Enterprise Search
www.flax.co.uk


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Shawn Heisey

On 1/26/2019 4:48 PM, Zheng Lin Edwin Yeo wrote:

Thanks for your reply. Below are the replies to your email:

1) We have tried to set the heap size to be 8g previously when we faced the
same issue, and changing to 7g does not help too.

2) We are using standard disk at the moment.

3) In the link is the screenshot of the process list that is sort by Commit.
https://drive.google.com/file/d/1TzxaAqbDJwYO0aHo9GW34p2kncnylRkG/view?usp=sharing


My original thought is still the best idea I have.  I think that the 
other software on the system is heavily using the disk cache and not 
leaving enough of it for Solr's data.


From what I can tell, the other software on the system is not using 
MMAP for disk access, so the large amount of disk cache usage is not 
reflected in the "Commit" number for those programs.


In the last screenshot, the Solr instances appear to be handling very 
little index data -- the Commit number is actually *smaller* than the 
Working Set number, which will not be the case when there is a lot of 
index data.  I'm betting that at the point when that screenshot was 
taken, all the index data had been deleted, possibly in preparation for 
rebuilding the indexes.


Thanks,
Shawn


Re: PatternReplaceFilterFactory problem

2019-01-29 Thread Chris Wareham
Thanks for the help - changing the field type of the destination for the 
copy fields to "text_en" solved the problem. I'd foolishly assumed that 
the analysis of the source fields was applied then the resulting tokens 
passed to the copy field, which doesn't really make sense now that I 
think about it!


So the indexing process is:

+---+ ++ +-+
|companyName| |  companyName   | | companyName |
|input data |>|text_en analysis|>|index|
+---+ ++ +-+
  |
  |   ++ +-+
  +-->|  text  |>|text |
  |text_en analysis| |index|
  ++ +-+

Rather than:

+---+ ++   +-+
|companyName| |  companyName   |   | companyName |
|input data |>|text_en analysis|-->|index|
+---+ ++   +-+
  |
   +-+ +-+
   | text|>|text |
   |text_general analysis| |index|
   +-+ +-+


On 28/01/2019 12:37, Scott Stults wrote:

Hi Chris,

You've included the field definition of type text_en, but in your queries
you're searching the field "text", which is of type text_general. That may
be the source of your problem, but if looking into that doesn't help send
the definition of text_general as well.

Hope that helps!

-Scott

On Mon, Jan 28, 2019 at 6:02 AM Chris Wareham <
chris.ware...@graduate-jobs.com> wrote:


I'm trying to index some data which often includes domain names. I'd
like to remove the .com TLD, so I have modified the text_en field type
by adding a PatternReplaceFilterFactory filter. However, it doesn't
appear to be working as a search for "text:(mydomain.com)" matches
records but "text:(mydomain)" does not.


  








  
  








  


The actual field definitions are as follows:













Re: Large Number of Collections takes down Solr 7.3

2019-01-29 Thread Hendrik Haddorp
How much memory do the Solr instances have? Any more details on what 
happens when the Solr instances start to fail?

We are using multiple Solr clouds to keep the collection count low(er).

On 29.01.2019 06:53, Gus Heck wrote:

Does it all have to be in a single cloud?

On Mon, Jan 28, 2019, 10:34 PM Shawn Heisey 
On 1/28/2019 8:12 PM, Monica Skidmore wrote:

I would have to negotiate with the middle-ware teams - but, we've used a

core per customer in master-slave mode for about 3 years now, with great
success.  Our pool of data is very large, so limiting a customer's searches
to just their core keeps query times fast (or at least reduces the chances
of one customer impacting another with expensive queries.  There is also a
little security added - since the customer is required to provide the core
to search, there is less chance that they'll see another customer's data in
their responses (like they might if they 'forgot' to add a filter to their
query.  We were hoping that moving to Cloud would help our management of
the largest customers - some of which we'd like to sub-shard with the cloud
tooling.  We expected cloud to support as many cores/collections as our
2-versions-old Solr instances - but we didn't count on all the increased
network traffic or the extra complications of bringing up a large cloud
cluster.

At this time, SolrCloud will not handle what you're trying to throw at
it.  Without Cloud, Solr can fairly easily handle thousands of indexes,
because there is no communication between nodes about cluster state.
The immensity of that communication (handled via ZooKeeper) is why
SolrCloud can't scale to thousands of shard replicas.

The solution to this problem will be twofold:  1) Reduce the number of
work items in the Overseer queue.  2) Make the Overseer do its job a lot
faster.  There have been small incremental improvements towards these
goals, but as you've noticed, we're definitely not there yet.

On the subject of a customer forgetting to add a filter ... your systems
should be handling that for them ... if the customer has direct access
to Solr, then all bets are off... they'll be able to do just about
anything they want.  It is possible to configure a proxy to limit what
somebody can get to, but it would be pretty complicated to come up with
a proxy configuration that fully locks things down.

Using shards is completely possible without SolrCloud.  But SolrCloud
certainly does make it a lot easier.

How many records in your largest customer indexes?  How big are those
indexes on disk?

Thanks,
Shawn





Re: Solr relevancy score different on replicated nodes

2019-01-29 Thread Ashish Bisht
Hi Erick,

To test this scenario I added replica again and from few days have been
monitoring metrics like Num Docs, Max Doc, Deleted Docs from *Overview*
section of core.Checked *Segments Info* section too.Everything looks in
sync.

http://:8983/solr/#/MyTestCollection_*shard1_replica_n7*/
http://:8983/solr/#/MyTestCollection_*4_shard1_replica_n7*/

If in future they go out of sync,just wanted to confirm if this is a bug
although you mentioned as

*bq. Shouldn't both replica and leader come to same state 
after this much long period. 

No. After that long, the docs will be the same, all the docs 
present on one replica will be present and searchable on 
the other. However, they will be in different segments so the 
"stats skew" will remain. *


We need these score,so as a temporary solution if we monitor these metrics
for any issues and take action (either optimize or delete-add replica)
accordingly.Does it make sense?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Indexing in one collection affect index in another collection

2019-01-29 Thread Zheng Lin Edwin Yeo
Hi Shawn / Jan,

Do we have any further insights about this problem?
The same problem still happens even after we make the changes and re-index
all the data.

Regards,
Edwin

On Sun, 27 Jan 2019 at 07:48, Zheng Lin Edwin Yeo 
wrote:

> Hi Shawn,
>
> Thanks for your reply. Below are the replies to your email:
>
> 1) We have tried to set the heap size to be 8g previously when we faced
> the same issue, and changing to 7g does not help too.
>
> 2) We are using standard disk at the moment.
>
> 3) In the link is the screenshot of the process list that is sort by
> Commit.
>
> https://drive.google.com/file/d/1TzxaAqbDJwYO0aHo9GW34p2kncnylRkG/view?usp=sharing
>
> Regards,
> Edwin
>
> On Sun, 27 Jan 2019 at 02:07, Shawn Heisey  wrote:
>
>> On 1/26/2019 9:40 AM, Zheng Lin Edwin Yeo wrote:
>> > We have tried to add -a "-XX:+AlwaysPreTouch" that starts Solr, but
>> there
>> > is no noticeable difference in the performance.
>> >
>> > As for the screenshot, I have captured another one after we added  -a
>> > "-XX:+AlwaysPreTouch", and it is sorted on the Working Set column.
>> > Below is the link to the new screenshot:
>> >
>> https://drive.google.com/file/d/1YEsJxnCeRorvBRCSqeowZOu3Fpena5Mo/view?usp=sharing
>>
>> That would mean that it's probably not a heap issue.  You could try
>> increasing the heap size on each Solr instance to 7g as a test to see
>> whether it helps at all.  I'd be a little bit surprised if that helps.
>>
>> I can't tell much about the software other than Solr that's running on
>> this machine, but my best guess at this point is that Solr index
>> information is being pushed out of the disk cache by the other software
>> running on the machine, making it so that when Solr needs to do a query,
>> a lot of information must be read from disk instead of the cache.  Disks
>> are very very slow compared to memory.  SSD is faster, but still quite a
>> bit slower than main memory.
>>
>> What kind of disk are you using?  If it's standard disks, I don't know
>> how easily you could try putting the index data on SSD.  If doing so
>> makes it quite a bit faster, then my suspicion above is probably correct.
>>
>> A "by the way" question:  What do you see if you sort the process list
>> by Commit instead?  Doing this might not reveal anything useful.  Only
>> software using MMAP for file access (which Solr does by default) would
>> show up near the top of that list, so it's possible that a new sort
>> would not reveal anything interesting.
>>
>> Thanks,
>> Shawn
>>
>