Re: Cost of enabling doc values

2018-06-13 Thread Erick Erickson
I pretty much agree with your business side.

The rough size of the docValues fields is one of X for each doc. So
say you have an int field. Size is near maxDoc * 4 bytes. This is not
totally accurate, there is some int packing done for instance, but
it'll do. If you really want an accurate count, look at the
before/after size of your *.dvd, *.dvm segment files in your index.

However, it's "pay me now or pay me later". The critical operations
are faceting, grouping and sorting. If you do any of those operations
on a field that is _not_ docValues=true, it will be uninverted on the
_java heap_, where it will consume GC cycles, put pressure on all your
other operations, etc. This process will be done _every_ time you open
a new searcher and use these fields.

If the field _does_ have docValues=true, that will be held in the OS's
memory space, _not_ the JVM's heap due to using MMapDirectory (see:
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html).
Among other virtues, it can be swapped out (although you don't want it
to be, it's still better than OOMing). Plus loading it is just reading
it off disk rather than the expensive uninversion process.

And if you don't do any of those operations (grouping, sorting and
faceting), then the bits just sit there on disk doing nothing.

So say you carefully define what fields will be used for any of the
three operations and enable docValues. Then 3 months later the
business side comes back with "oh, we need to facet on another field".
Your choices are:
1> live with the increased heap usage and other resource contention.
Perhaps along the way panicking because your processes OOM and prod
goes down.
or
2> reindex from scratch, starting with a totally new collection.

And note the fragility here. Your application can be humming along
just fine for months. Then one fine day someone innocently submits a
query that sorts on a new field that has docValues=false and B-OOM.

If (and only if) you can _guarantee_ that fieldX will never be used
for any of the three operations, then turning off docValues for that
field will save you some disk space. But that's the only advantage.
Well, alright. If you have to do a full index replication that'll
happen a bit faster too.

So I prefer to err on the side of caution. I recommend making fields
docValues=true unless I can absolutely guarantee (and business _also_
agrees)
1>  that fieldX will never be used for sorting, grouping or faceting,
or
2> if the can't promise that they guarantee to give me time to
completely reindex,

Best,
Erick


On Wed, Jun 13, 2018 at 4:30 PM, root23  wrote:
> Hi all,
> Does anyone know how much typically index size increments when we enable doc
> value on a field.
> Our business side want to enable sorting fields on most of our fields. I am
> trying to push back saying that it will increase the index size, since
> enabling docvalues will create the univerted index.
>
> I know the size probably depends on what values are in the fields but i need
> a general idea so that i can convince them that enabling on the fields is
> costly and it will incur this much cost.
>
> If anyone knows how to find this out looking at an existing solr index which
> has docvalues enabled , that will  also be great help.
>
> Thanks !!!
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Autoscaling and inactive shards

2018-06-13 Thread Shalin Shekhar Mangar
Yes, I believe Noble is working on this. See
https://issues.apache.org/jira/browse/SOLR-11985

On Wed, Jun 13, 2018 at 1:35 PM Jan Høydahl  wrote:

> Ok, get the meaning of preferences.
>
> Would there be a way to write a generic rule that would suggest moving
> shards to obtain balance, without specifying absolute core counts? I.e. if
> you have three nodes
> A: 3 cores
> B: 5 cores
> C: 3 cores
>
> Then that rule would suggest two moves to end up with 4 cores on all three
> (unless that would violate disk space or load limits)?
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 12. jun. 2018 kl. 08:10 skrev Shalin Shekhar Mangar <
> shalinman...@gmail.com>:
> >
> > Hi Jan,
> >
> > Comments inline:
> >
> > On Tue, Jun 12, 2018 at 2:19 AM Jan Høydahl  > wrote:
> >
> >> Hi
> >>
> >> I'm trying to have Autoscaling move a shard to another node after
> manually
> >> splitting.
> >> We have two nodes, one has a shard1 and the other node is empty.
> >>
> >> After SPLITSHARD you have
> >>
> >> * shard1 (inactive)
> >> * shard1_0
> >> * shard1_1
> >>
> >> For autoscaling we have the {"minimize" : "cores"} cluster preference
> >> active. Because of that I'd expect that Autoscaling would suggest to
> move
> >> e.g. shard1_1 to the other (empty) node, but it doesn't. Then I create a
> >> rule just to test {"cores": "<2", "node": "#ANY"}, but still no
> >> suggestions. Not until I delete the inactive shard1, then it suggests to
> >> move one of the two remaining shards to the other node.
> >>
> >> So my two questions are
> >> 1. Is it by design that inactive shards "count" wrt #cores?
> >>   I understand that it consumes disk but it is not active otherwise,
> >>   so one could argue that it should not be counted in core/replica
> rules?
> >>
> >
> > Today, inactive slices also count towards the number of cores -- though
> > technically correct, it is probably an oversight.
> >
> >
> >> 2. Why is there no suggestion to move a shard due to the "minimize
> cores"
> >> reference itself?
> >>
> >
> > The /autoscaling/suggestions end point only suggests if there are policy
> > violations. Preferences such as minimize:cores are more of a sorting
> order
> > so they aren't really being violated. After you add the rule, the
> framework
> > still cannot give a suggestion that satisfies your rule. This is because
> > even if shard1_1 is moved to node2, node1 still has shard1 and shard1_0.
> So
> > the system ends up not suggesting anything. You should get a suggestion
> if
> > you add a third node to the cluster though.
> >
> > Also see SOLR-11997  https://issues.apache.org/jira/browse/SOLR-11997>> which
> > will tell users that a suggestion could not be returned because we cannot
> > satisfy the policy. There are a slew of other improvements to suggestions
> > planned that will return suggestions even when there are no policy
> > violations.
> >
> >
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com 
> >>
> >>
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
>
>

-- 
Regards,
Shalin Shekhar Mangar.


Cost of enabling doc values

2018-06-13 Thread root23
Hi all,
Does anyone know how much typically index size increments when we enable doc
value on a field.
Our business side want to enable sorting fields on most of our fields. I am
trying to push back saying that it will increase the index size, since
enabling docvalues will create the univerted index. 

I know the size probably depends on what values are in the fields but i need
a general idea so that i can convince them that enabling on the fields is
costly and it will incur this much cost.

If anyone knows how to find this out looking at an existing solr index which
has docvalues enabled , that will  also be great help.

Thanks !!!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: How to avoid join queries

2018-06-13 Thread Erik Hatcher



> On Jun 13, 2018, at 4:24 PM, root23  wrote:

...

>  But i
> know use of join is discouraged in solr and i do not want to use it.

…

Why do you say that?   I, for one, find great power and joy using `{!join}`.

Erik



Re: SolrCore Initialization Failures

2018-06-13 Thread shefalid
Thanks for your response.

None of the processes are deleting any index files.
One data directory is only pointed by one core.

We are writing data at a high ingestion rate (100,000 records per second).
Commit happens once every 30 seconds.

Also a periodic service runs to backup the data to our backup data store.
This service starts by saving commit point & when its done, it released the
commit point.

Could this combination of these factors some how cause the deletion of
files?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: SolrCore Initialization Failures

2018-06-13 Thread shefalid
Thanks for your response.

None of the processes are deleting any index files.
One data directory is only pointed by one core.

We are writing data at a high ingestion rate (100,000 records per second).
Commit happens once every 30 seconds.

Also a periodic service runs to backup the data to our backup data store.
This service starts by saving commit point & when its done, it released the
commit point.

Could this combination of these factors some how cause the deletion of
files?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


How to avoid join queries

2018-06-13 Thread root23
Hi all,
I have a following use case. 
lets say my document is like this.
doc={
  name:abc,
  status:def,
  store_id:store_1
  parent:nike
}

Now lets say in our use case at some point of time store_1 moved under a
different parent. lets say adidas.
Our business use case we want to move all the existing documents also to the
new parent. So the parent should see all the documents even which are
inserted when the store_1 was under parent: nike.

How do we do this ?

These are the two approaches which i explored.

1. write parent field as part of the document (as shown above) and if the
parent is changed update all the documents with the new parent. 

2. separate the parent outside of this document to a different core which
maintains the relationship between a store and its parent. And then on query
time join these two cores to get the desired result.


option 1 is  not feasible as this might require updating millions of
records.  So we are mostly leading towards the second option of join. But i
know use of join is discouraged in solr and i do not want to use it. But i
am not able to figure out a way around this.

If someone else has another idea to model it differently so that we can
avoid joins ,please enlighten me.

Thanks !!!



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Exception when processing streaming expression

2018-06-13 Thread Joel Bernstein
Can your provide some example expressions that are causing these exceptions?

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Jun 13, 2018 at 9:02 AM, Christian Spitzlay <
christian.spitz...@biologis.com> wrote:

> Hi,
>
> I am seeing a lot of (reproducible) exceptions in my solr log file
> when I execute streaming expressions:
>
> o.a.s.s.HttpSolrCall  Unable to write response, client closed connection
> or we are shutting down
> org.eclipse.jetty.io.EofException
> at org.eclipse.jetty.io.ChannelEndPoint.flush(
> ChannelEndPoint.java:292)
> at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:429)
> at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:322)
> at org.eclipse.jetty.io.AbstractEndPoint.write(
> AbstractEndPoint.java:372)
> at org.eclipse.jetty.server.HttpConnection$SendCallback.
> process(HttpConnection.java:794)
> […]
> at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(
> EatWhatYouKill.java:131)
> at org.eclipse.jetty.util.thread.ReservedThreadExecutor$
> ReservedThread.run(ReservedThreadExecutor.java:382)
> at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(
> QueuedThreadPool.java:708)
> at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(
> QueuedThreadPool.java:626)
> at java.base/java.lang.Thread.run(Thread.java:844)
> Caused by: java.io.IOException: Broken pipe
> at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
> at java.base/sun.nio.ch.SocketDispatcher.writev(
> SocketDispatcher.java:51)
> at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:148)
> at java.base/sun.nio.ch.SocketChannelImpl.write(
> SocketChannelImpl.java:506)
> at org.eclipse.jetty.io.ChannelEndPoint.flush(
> ChannelEndPoint.java:272)
> ... 69 more
>
>
> I have read up on the exception message and found
> http://lucene.472066.n3.nabble.com/Unable-to-write-response-client-closed-
> connection-or-we-are-shutting-down-tt4350349.html#a4350947
> but I don’t understand how an early client connect can cause what I am
> seeing:
>
> What puzzles me is that the response has been delivered in full to the
> client library, including the document with EOF.
>
> So Solr must have already processed the streaming expression and returned
> the result.
> It’s just that the log is filled with stacktraces of this exception that
> suggests something went wrong.
> I don’t understand why this happens when the query seems to have succeeded.
>
>
> Best regards,
> Christian
>
>
>


Re: 7.3.1 creates thousands of threads after start up

2018-06-13 Thread Shawn Heisey

On 6/13/2018 4:04 AM, Markus Jelsma wrote:

You mentioned shard handler tweaks, thanks. I see we have an incorrect setting 
there for maximumPoolSize, way too high, but that doesn't account for the 
number of threads created. After reducing the number, for dubious reasons, 
twice the number of threads are created and the node dies.


The specific config in the shard handler I was thinking of was socket 
timeouts.


The EofException in Jetty almost always indicates that the client 
disconnected before the server responded.  A low socket timeout will 
cause a client to disconnect if the server takes longer than the timeout 
to finish its work.  When the server finally does finish and try to 
respond, it throws the EofException because the connection it was 
expecting to use is no longer there.


I believe that the default socket timeout in Solr's shard handler is 60 
seconds.  Which is a relative eternity in most situations.  It would 
take a particularly nasty GC pause problem to exceed that timeout.



Regarding memory leaks, of course, the first that came to mind is that i made 
an error which only causes trouble on 7.3, but it is unreproducible so far, 
even if i fully replicate production in a test environment. Since it only leaks 
on commits, first suspect were URPs, and the URPs are the only things i can 
disable in production without affecting customers. Needless to say, it weren't 
the URPs.


Custom update processors could cause leaks, but it is not something that 
I would expect from a typical URP implementation. If you've disabled 
them and it's still happening, then that's probably not it.  It's 
plugins for queries that have the most potential for not closing 
resources correctly even when written by experienced programmers.  I'm 
not sure what the potential for leaks is on index-time plugins, but I 
suspect that it's less likely than problems with query plugins.


Thanks,
Shawn



RE: [EXT] Re: Extracting top level URL when indexing document

2018-06-13 Thread Hanjan, Harinder
Thank you Alex.  I have managed to get this to work via 
URLClassifyProcessorFactory. If anyone is interested, it can be easily done via 
with the following solrconfig.xml



  true
  SolrId
  hostname
  





 urlProcessor
   
  

I will look at how to submit a patch to the Java doc.

Thanks!
Harinder

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Wednesday, June 13, 2018 12:13 AM
To: solr-user 
Subject: [EXT] Re: Extracting top level URL when indexing document

Try URLClassifyProcessorFactory in the processing chain instead, configured in 
solrconfig.xml

There is very little documentation for it, so check the source for exact 
params. Or search for the blog post introducing it several years ago.

Documentation patches would be welcome.

Regards,
Alex

On Wed, Jun 13, 2018, 01:02 Hanjan, Harinder, 
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level 
> URL to be stored in a different field. I have had some success with 
> the PatternTokenizerFactory (relevant schema bits at the bottom) but 
> the behavior appears to be inconsistent.  Most of the times, the top 
> level URL is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca_
> about-2Dcmlc_=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M
> =N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1n
> vhzANSYzX_MuFCGcxdD4=bAlhGU5kNa_tlJbhmb8vEe3gRIF9vcH7de6UJL-mM28=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarymlc.ca;
> d=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhV
> Hu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFC
> GcxdD4=-4gwWSR2Uut2C-JHJ3c0Uj0Ys0W4APyH7if3WXsEvqU=
>
> Success
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommission.ca_reports.php=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNk
> yKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJR
> D0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=ZfPgYWPLxqnMbfYceg-RObyXzSmmcPTU0t8Z
> 55ZVbY4=
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.calgarypolicec
> ommissio=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30I
> rhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzAN
> SYzX_MuFCGcxdD4=BM-LaN4V7PlZW3_vm6prIX-NS3EW1zPz42Cy25S9HxU=
>
> Fail
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attainyourhome.co
> m_=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeK
> KhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_M
> uFCGcxdD4=bHYfs9IWkicyxYn5tZN0EtKNIA1O9MCyrDMVxG1Kn1g=
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__attai=DwIBaQ=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=9k
> DXPBHblDyQp9yLzYAyGTvboVZDKrzUK3jYYLmJLTI=
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__livea=DwIBaQ=
> jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO
> 9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=Xy
> mwSoyJw0F3EqGH7zaDoSJBIu-oVNFxmnVxOnDghJc=
>
> Fail
>
>
>
>
> Relevant schema:
> 
>
>  multiValued="false"/>
>
>  sortMissingLast="true">
> 
> 
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
> group="0"/>
> 
> 
>
>
> I have tested the Regex and it is matching things fine. Please see 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__regex101.com_r_wN6cZ7_358=DwIBaQ=jdm1Hby_BzoqwoYzPsUCHSCnNps9LuidNkyKDuvdq3M=N30IrhmaeKKhVHu13d-HO9gO9CysWnvGGoKrSNEuM3U=k9FRjoXpHpJRD0Z2_RDYL1nvhzANSYzX_MuFCGcxdD4=U-s-VXfldf8O1uoyOmy_hf3jRuTUml1MMV8YxF-RWUc=.
> So it appears that I have a gap in my understanding of how Solr 
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or 
> entity named above and may contain information that is confidential or 
> legally privileged. If you are not the intended recipient named above 
> or a person responsible for delivering messages or communications to 
> the intended recipient, YOU ARE HEREBY NOTIFIED that any use, 
> distribution, or copying of this communication or any of the 
> information contained in it is strictly prohibited. If you have 
> received this communication in error, please notify us immediately by 
> telephone and then 

Re: Suggestions for debugging performance issue

2018-06-13 Thread Chris Troullis
Hi Susheel,

It's not drastically different no. There are other collections with more
fields and more documents that don't have this issue. And the collection is
not sharded. Just 1 shard with 2 replicas. Both replicas are similar in
response time.

Thanks,
Chris

On Wed, Jun 13, 2018 at 2:37 PM, Susheel Kumar 
wrote:

> Is this collection anyway drastically different than others in terms of
> schema/# of fields/total document etc is it sharded and if so can you look
> which shard taking more time with shard.info=true.
>
> Thnx
> Susheel
>
> On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis 
> wrote:
>
> > Thanks Erick,
> >
> > Seems to be a mixed bag in terms of tlog size across all of our indexes,
> > but currently the index with the performance issues has 4 tlog files
> > totally ~200 MB. This still seems high to me since the collections are in
> > sync, and we hard commit every minute, but it's less than the ~8GB it was
> > before we cleaned them up. Spot checking some other indexes show some
> have
> > tlogs >3GB, but none of those indexes are having performance issues (on
> the
> > same solr node), so I'm not sure it's related. We have 13 collections of
> > various sizes running on our solr cloud cluster, and none of them seem to
> > have this issue except for this one index, which is not our largest index
> > in terms of size on disk or number of documents.
> >
> > As far as the response intervals, just running a default search *:*
> sorting
> > on our id field so that we get consistent results across environments,
> and
> > returning 200 results (our max page size in app) with ~20 fields, we see
> > times of ~3.5 seconds in production, compared to ~1 second on one of our
> > lower environments with an exact copy of the index. Both have CDCR
> enabled
> > and have identical clusters.
> >
> > Unfortunately, currently the only instance we are seeing the issue on is
> > production, so we are limited in the tests that we can run. I did confirm
> > in the lower environment that the doc cache is large enough to hold all
> of
> > the results, and that both the doc and query caches should be serving the
> > results. Obviously production we have much more indexing going on, but we
> > do utilize autowarming for our caches so our response times are still
> > stable across new searchers.
> >
> > We did move the lower environment to the same ESX host as our production
> > cluster, so that it is getting resources from the same pool (CPU, RAM,
> > etc). The only thing that is different is the disks, but the lower
> > environment is running on slower disks than production. And if it was a
> > disk issue you would think it would be affecting all of the collections,
> > not just this one.
> >
> > It's a mystery!
> >
> > Chris
> >
> >
> >
> > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > First, nice job of eliminating all the standard stuff!
> > >
> > > About tlogs: Sanity check: They aren't growing again, right? They
> > > should hit a relatively steady state. The tlogs are used as a queueing
> > > mechanism for CDCR to durably store updates until they can
> > > successfully be transmitted to the target. So I'd expect them to hit a
> > > fairly steady number.
> > >
> > > Your lack of CPU/IO spikes is also indicative of something weird,
> > > somehow Solr just sitting around doing nothing. What intervals are we
> > > talking about here for response? 100ms? 5000ms?
> > >
> > > When you hammer the same query over and over, you should see your
> > > queryResultCache hits increase. If that's the case, Solr is doing no
> > > work at all for the search, just assembling the resopnse packet which,
> > > as you say, should be in the documentCache. This assumes it's big
> > > enough to hold all of the docs that are requested by all the
> > > simultaneous requests. The queryResultCache cache will be flushed
> > > every time a new searcher is opened. So if you still get your poor
> > > response times, and your queryResultCache hits are increasing then
> > > Solr is doing pretty much nothing.
> > >
> > > So does this behavior still occur if you aren't adding docs to the
> > > index? If you turn indexing off as a test, that'd be another data
> > > point.
> > >
> > > And, of course, if it's at all possible to just take the CDCR
> > > configuration out of your solrconfig file temporarily that'd nail
> > > whether CDCR is the culprit or whether it's coincidental. You say that
> > > CDCR is the only difference between the environments, but I've
> > > certainly seen situations where it turns out to be a bad disk
> > > controller or something that's _also_ different.
> > >
> > > Now, assuming all that's inconclusive, I'm afraid the next step would
> > > be to throw a profiler at it. Maybe pull a stack traces.
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis 
> > > wrote:
> > > > Thanks Erick. A little more info:
> > > >
> > > > -We do have 

Re: Suggestions for debugging performance issue

2018-06-13 Thread Susheel Kumar
Is this collection anyway drastically different than others in terms of
schema/# of fields/total document etc is it sharded and if so can you look
which shard taking more time with shard.info=true.

Thnx
Susheel

On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis 
wrote:

> Thanks Erick,
>
> Seems to be a mixed bag in terms of tlog size across all of our indexes,
> but currently the index with the performance issues has 4 tlog files
> totally ~200 MB. This still seems high to me since the collections are in
> sync, and we hard commit every minute, but it's less than the ~8GB it was
> before we cleaned them up. Spot checking some other indexes show some have
> tlogs >3GB, but none of those indexes are having performance issues (on the
> same solr node), so I'm not sure it's related. We have 13 collections of
> various sizes running on our solr cloud cluster, and none of them seem to
> have this issue except for this one index, which is not our largest index
> in terms of size on disk or number of documents.
>
> As far as the response intervals, just running a default search *:* sorting
> on our id field so that we get consistent results across environments, and
> returning 200 results (our max page size in app) with ~20 fields, we see
> times of ~3.5 seconds in production, compared to ~1 second on one of our
> lower environments with an exact copy of the index. Both have CDCR enabled
> and have identical clusters.
>
> Unfortunately, currently the only instance we are seeing the issue on is
> production, so we are limited in the tests that we can run. I did confirm
> in the lower environment that the doc cache is large enough to hold all of
> the results, and that both the doc and query caches should be serving the
> results. Obviously production we have much more indexing going on, but we
> do utilize autowarming for our caches so our response times are still
> stable across new searchers.
>
> We did move the lower environment to the same ESX host as our production
> cluster, so that it is getting resources from the same pool (CPU, RAM,
> etc). The only thing that is different is the disks, but the lower
> environment is running on slower disks than production. And if it was a
> disk issue you would think it would be affecting all of the collections,
> not just this one.
>
> It's a mystery!
>
> Chris
>
>
>
> On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson 
> wrote:
>
> > First, nice job of eliminating all the standard stuff!
> >
> > About tlogs: Sanity check: They aren't growing again, right? They
> > should hit a relatively steady state. The tlogs are used as a queueing
> > mechanism for CDCR to durably store updates until they can
> > successfully be transmitted to the target. So I'd expect them to hit a
> > fairly steady number.
> >
> > Your lack of CPU/IO spikes is also indicative of something weird,
> > somehow Solr just sitting around doing nothing. What intervals are we
> > talking about here for response? 100ms? 5000ms?
> >
> > When you hammer the same query over and over, you should see your
> > queryResultCache hits increase. If that's the case, Solr is doing no
> > work at all for the search, just assembling the resopnse packet which,
> > as you say, should be in the documentCache. This assumes it's big
> > enough to hold all of the docs that are requested by all the
> > simultaneous requests. The queryResultCache cache will be flushed
> > every time a new searcher is opened. So if you still get your poor
> > response times, and your queryResultCache hits are increasing then
> > Solr is doing pretty much nothing.
> >
> > So does this behavior still occur if you aren't adding docs to the
> > index? If you turn indexing off as a test, that'd be another data
> > point.
> >
> > And, of course, if it's at all possible to just take the CDCR
> > configuration out of your solrconfig file temporarily that'd nail
> > whether CDCR is the culprit or whether it's coincidental. You say that
> > CDCR is the only difference between the environments, but I've
> > certainly seen situations where it turns out to be a bad disk
> > controller or something that's _also_ different.
> >
> > Now, assuming all that's inconclusive, I'm afraid the next step would
> > be to throw a profiler at it. Maybe pull a stack traces.
> >
> > Best,
> > Erick
> >
> > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis 
> > wrote:
> > > Thanks Erick. A little more info:
> > >
> > > -We do have buffering disabled everywhere, as I had read multiple posts
> > on
> > > the mailing list regarding the issue you described.
> > > -We soft commit (with opensearcher=true) pretty frequently (15 seconds)
> > as
> > > we have some NRT requirements. We hard commit every 60 seconds. We
> never
> > > commit manually, only via the autocommit timers. We have been using
> these
> > > settings for a long time and have never had any issues until recently.
> > And
> > > all of our other indexes are fine (some larger than this one).
> > > -We do have 

Re: Suggestions for debugging performance issue

2018-06-13 Thread Chris Troullis
Thanks Erick,

Seems to be a mixed bag in terms of tlog size across all of our indexes,
but currently the index with the performance issues has 4 tlog files
totally ~200 MB. This still seems high to me since the collections are in
sync, and we hard commit every minute, but it's less than the ~8GB it was
before we cleaned them up. Spot checking some other indexes show some have
tlogs >3GB, but none of those indexes are having performance issues (on the
same solr node), so I'm not sure it's related. We have 13 collections of
various sizes running on our solr cloud cluster, and none of them seem to
have this issue except for this one index, which is not our largest index
in terms of size on disk or number of documents.

As far as the response intervals, just running a default search *:* sorting
on our id field so that we get consistent results across environments, and
returning 200 results (our max page size in app) with ~20 fields, we see
times of ~3.5 seconds in production, compared to ~1 second on one of our
lower environments with an exact copy of the index. Both have CDCR enabled
and have identical clusters.

Unfortunately, currently the only instance we are seeing the issue on is
production, so we are limited in the tests that we can run. I did confirm
in the lower environment that the doc cache is large enough to hold all of
the results, and that both the doc and query caches should be serving the
results. Obviously production we have much more indexing going on, but we
do utilize autowarming for our caches so our response times are still
stable across new searchers.

We did move the lower environment to the same ESX host as our production
cluster, so that it is getting resources from the same pool (CPU, RAM,
etc). The only thing that is different is the disks, but the lower
environment is running on slower disks than production. And if it was a
disk issue you would think it would be affecting all of the collections,
not just this one.

It's a mystery!

Chris



On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson 
wrote:

> First, nice job of eliminating all the standard stuff!
>
> About tlogs: Sanity check: They aren't growing again, right? They
> should hit a relatively steady state. The tlogs are used as a queueing
> mechanism for CDCR to durably store updates until they can
> successfully be transmitted to the target. So I'd expect them to hit a
> fairly steady number.
>
> Your lack of CPU/IO spikes is also indicative of something weird,
> somehow Solr just sitting around doing nothing. What intervals are we
> talking about here for response? 100ms? 5000ms?
>
> When you hammer the same query over and over, you should see your
> queryResultCache hits increase. If that's the case, Solr is doing no
> work at all for the search, just assembling the resopnse packet which,
> as you say, should be in the documentCache. This assumes it's big
> enough to hold all of the docs that are requested by all the
> simultaneous requests. The queryResultCache cache will be flushed
> every time a new searcher is opened. So if you still get your poor
> response times, and your queryResultCache hits are increasing then
> Solr is doing pretty much nothing.
>
> So does this behavior still occur if you aren't adding docs to the
> index? If you turn indexing off as a test, that'd be another data
> point.
>
> And, of course, if it's at all possible to just take the CDCR
> configuration out of your solrconfig file temporarily that'd nail
> whether CDCR is the culprit or whether it's coincidental. You say that
> CDCR is the only difference between the environments, but I've
> certainly seen situations where it turns out to be a bad disk
> controller or something that's _also_ different.
>
> Now, assuming all that's inconclusive, I'm afraid the next step would
> be to throw a profiler at it. Maybe pull a stack traces.
>
> Best,
> Erick
>
> On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis 
> wrote:
> > Thanks Erick. A little more info:
> >
> > -We do have buffering disabled everywhere, as I had read multiple posts
> on
> > the mailing list regarding the issue you described.
> > -We soft commit (with opensearcher=true) pretty frequently (15 seconds)
> as
> > we have some NRT requirements. We hard commit every 60 seconds. We never
> > commit manually, only via the autocommit timers. We have been using these
> > settings for a long time and have never had any issues until recently.
> And
> > all of our other indexes are fine (some larger than this one).
> > -We do have documentResultCache enabled, although it's not very big. But
> I
> > can literally spam the same query over and over again with no other
> queries
> > hitting the box, so all the results should be cached.
> > -We don't see any CPU/IO spikes when running these queries, our load is
> > pretty much flat on all accounts.
> >
> > I know it seems odd that CDCR would be the culprit, but it's really the
> > only thing we've changed, and we have other environments running the
> 

Re: Suggestions for debugging performance issue

2018-06-13 Thread Erick Erickson
First, nice job of eliminating all the standard stuff!

About tlogs: Sanity check: They aren't growing again, right? They
should hit a relatively steady state. The tlogs are used as a queueing
mechanism for CDCR to durably store updates until they can
successfully be transmitted to the target. So I'd expect them to hit a
fairly steady number.

Your lack of CPU/IO spikes is also indicative of something weird,
somehow Solr just sitting around doing nothing. What intervals are we
talking about here for response? 100ms? 5000ms?

When you hammer the same query over and over, you should see your
queryResultCache hits increase. If that's the case, Solr is doing no
work at all for the search, just assembling the resopnse packet which,
as you say, should be in the documentCache. This assumes it's big
enough to hold all of the docs that are requested by all the
simultaneous requests. The queryResultCache cache will be flushed
every time a new searcher is opened. So if you still get your poor
response times, and your queryResultCache hits are increasing then
Solr is doing pretty much nothing.

So does this behavior still occur if you aren't adding docs to the
index? If you turn indexing off as a test, that'd be another data
point.

And, of course, if it's at all possible to just take the CDCR
configuration out of your solrconfig file temporarily that'd nail
whether CDCR is the culprit or whether it's coincidental. You say that
CDCR is the only difference between the environments, but I've
certainly seen situations where it turns out to be a bad disk
controller or something that's _also_ different.

Now, assuming all that's inconclusive, I'm afraid the next step would
be to throw a profiler at it. Maybe pull a stack traces.

Best,
Erick

On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis  wrote:
> Thanks Erick. A little more info:
>
> -We do have buffering disabled everywhere, as I had read multiple posts on
> the mailing list regarding the issue you described.
> -We soft commit (with opensearcher=true) pretty frequently (15 seconds) as
> we have some NRT requirements. We hard commit every 60 seconds. We never
> commit manually, only via the autocommit timers. We have been using these
> settings for a long time and have never had any issues until recently. And
> all of our other indexes are fine (some larger than this one).
> -We do have documentResultCache enabled, although it's not very big. But I
> can literally spam the same query over and over again with no other queries
> hitting the box, so all the results should be cached.
> -We don't see any CPU/IO spikes when running these queries, our load is
> pretty much flat on all accounts.
>
> I know it seems odd that CDCR would be the culprit, but it's really the
> only thing we've changed, and we have other environments running the exact
> same setup with no issues, so it is really making us tear our hair out. And
> when we cleaned up the huge tlogs it didn't seem to make any difference in
> the query time (I was originally thinking it was somehow searching through
> the tlogs for documents, and that's why it was taking so long to retrieve
> the results, but I don't know if that is actually how it works).
>
> Are you aware of any logger settings we could increase to potentially get a
> better idea of where the time is being spent? I took the eventual query
> response and just hosted as a static file on the same machine via nginx and
> it downloaded lightning fast (I was trying to rule out network as the
> culprit), so it seems like the time is being spent somewhere in solr.
>
> Thanks,
> Chris
>
> On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson 
> wrote:
>
>> Having the tlogs be huge is a red flag. Do you have buffering enabled
>> in CDCR? This was something of a legacy option that's going to be
>> removed, it's been made obsolete by the ability of CDCR to bootstrap
>> the entire index. Buffering should be disabled always.
>>
>> Another reason tlogs can grow is if you have very long times between
>> hard commits. I doubt that's your issue, but just in case.
>>
>> And the final reason tlogs can grow is that the connection between
>> source and target clusters is broken, but that doesn't sound like what
>> you're seeing either since you say the target cluster is keeping up.
>>
>> The process of assembling the response can be long. If you have any
>> stored fields (and not docValues-enabled), Solr will
>> 1> seek the stored data on disk
>> 2> decompress (min 16K blocks)
>> 3> transmit the thing back to your client
>>
>> The decompressed version of the doc will be held in the
>> documentResultCache configured in solrconfig.xml, so it may or may not
>> be cached in memory. That said, this stuff is all MemMapped and the
>> decompression isn't usually an issue, I'd expect you to see very large
>> CPU spikes and/or I/O contention if that was the case.
>>
>> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
>> have to look in the tlogs to get you the very most 

Logging Every document to particular core

2018-06-13 Thread govind nitk
Hi,

Is there any way to log all the data getting indexed to a particular core
only ?


Regards,
govind


Re: Solr 7 + HDFS issue

2018-06-13 Thread Shawn Heisey

On 6/12/2018 10:14 PM, Joe Obernberger wrote:
Thank you Shawn.  It looks like it is being applied.  This could be 
some sort of chain reaction where:


Drive or server fails.  HDFS starts to replicate blocks which causes 
network congestion.  Solr7 can't talk, so initiates a replication 
process which causes more network congestionwhich causes more 
replicas to replicate, and which eventually causes HBase (we run 
HBase+Solr on the same machines) to also not be able to talk.  That is 
my running hypothesis anyway!


I was also thinking that there was a possibility that a lot of 
replications were happening at once.  At 75 megabytes per second each, 
it would only take a few of them to saturate a link at 2 gigabits, even 
if the load sharing between gigabit links is perfect. (and depending on 
the type of bonding in use, it might not be perfect)


75 MB per second is in the neighborhood of 700 megabits per second, so 
if three of those are happening at the same time and the disks can 
actually keep up, it would be enough to fill a 2Gb/s link.


We've made a change to limit how much bandwidth HDFS can use. One 
issue that we have seen is that the replicas fail to replicate, and 
retry, over and over.  I believe they are getting a timeout error; is 
that parameter adjustable? 


To have any idea whether it's adjustable, I would need to know exactly 
what timeout is being exceeded.  Can you share the full error for 
anything you're seeing?


Thanks,
Shawn



Re: Suggestions for debugging performance issue

2018-06-13 Thread Chris Troullis
Thanks Erick. A little more info:

-We do have buffering disabled everywhere, as I had read multiple posts on
the mailing list regarding the issue you described.
-We soft commit (with opensearcher=true) pretty frequently (15 seconds) as
we have some NRT requirements. We hard commit every 60 seconds. We never
commit manually, only via the autocommit timers. We have been using these
settings for a long time and have never had any issues until recently. And
all of our other indexes are fine (some larger than this one).
-We do have documentResultCache enabled, although it's not very big. But I
can literally spam the same query over and over again with no other queries
hitting the box, so all the results should be cached.
-We don't see any CPU/IO spikes when running these queries, our load is
pretty much flat on all accounts.

I know it seems odd that CDCR would be the culprit, but it's really the
only thing we've changed, and we have other environments running the exact
same setup with no issues, so it is really making us tear our hair out. And
when we cleaned up the huge tlogs it didn't seem to make any difference in
the query time (I was originally thinking it was somehow searching through
the tlogs for documents, and that's why it was taking so long to retrieve
the results, but I don't know if that is actually how it works).

Are you aware of any logger settings we could increase to potentially get a
better idea of where the time is being spent? I took the eventual query
response and just hosted as a static file on the same machine via nginx and
it downloaded lightning fast (I was trying to rule out network as the
culprit), so it seems like the time is being spent somewhere in solr.

Thanks,
Chris

On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson 
wrote:

> Having the tlogs be huge is a red flag. Do you have buffering enabled
> in CDCR? This was something of a legacy option that's going to be
> removed, it's been made obsolete by the ability of CDCR to bootstrap
> the entire index. Buffering should be disabled always.
>
> Another reason tlogs can grow is if you have very long times between
> hard commits. I doubt that's your issue, but just in case.
>
> And the final reason tlogs can grow is that the connection between
> source and target clusters is broken, but that doesn't sound like what
> you're seeing either since you say the target cluster is keeping up.
>
> The process of assembling the response can be long. If you have any
> stored fields (and not docValues-enabled), Solr will
> 1> seek the stored data on disk
> 2> decompress (min 16K blocks)
> 3> transmit the thing back to your client
>
> The decompressed version of the doc will be held in the
> documentResultCache configured in solrconfig.xml, so it may or may not
> be cached in memory. That said, this stuff is all MemMapped and the
> decompression isn't usually an issue, I'd expect you to see very large
> CPU spikes and/or I/O contention if that was the case.
>
> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> have to look in the tlogs to get you the very most recent copy, so the
> first place I'd look is keeping the tlogs under control first.
>
> The other possibility (again unrelated to CDCR) is if your spikes are
> coincident with soft commits or hard-commits-with-opensearcher-true.
>
> In all, though, none of the usual suspects seems to make sense here
> since you say that absent configuring CDCR things seem to run fine. So
> I'd look at the tlogs and my commit intervals. Once the tlogs are
> under control then move on to other possibilities if the problem
> persists...
>
> Best,
> Erick
>
>
> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis 
> wrote:
> > Hi all,
> >
> > Recently we have gone live using CDCR on our 2 node solr cloud cluster
> > (7.2.1). From a CDCR perspective, everything seems to be working
> > fine...collections are staying in sync across the cluster, everything
> looks
> > good.
> >
> > The issue we are seeing is with 1 collection in particular, after we set
> up
> > CDCR, we are getting extremely slow response times when retrieving
> > documents. Debugging the query shows QTime is almost nothing, but the
> > overall responseTime is like 5x what it should be. The problem is
> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> > normal, but 200 results is way slower than normal. I can run the exact
> same
> > query multiple times in a row (so everything should be cached), and I
> still
> > see response times way higher than another environment that is not using
> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> > we are using the CDCRUpdateLog. The problem started happening even before
> > we enabled CDCR.
> >
> > In a lower environment we noticed that the transaction logs were huge
> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> > restarting, and that seemed to fix the performance issue. We tried the
> same
> > thing in 

Exception when processing streaming expression

2018-06-13 Thread Christian Spitzlay
Hi,

I am seeing a lot of (reproducible) exceptions in my solr log file
when I execute streaming expressions:

o.a.s.s.HttpSolrCall  Unable to write response, client closed connection or we 
are shutting down
org.eclipse.jetty.io.EofException
at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:292)
at org.eclipse.jetty.io.WriteFlusher.flush(WriteFlusher.java:429)
at org.eclipse.jetty.io.WriteFlusher.write(WriteFlusher.java:322)
at 
org.eclipse.jetty.io.AbstractEndPoint.write(AbstractEndPoint.java:372)
at 
org.eclipse.jetty.server.HttpConnection$SendCallback.process(HttpConnection.java:794)
[…]
at 
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)
at 
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:382)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:708)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
at java.base/java.lang.Thread.run(Thread.java:844)
Caused by: java.io.IOException: Broken pipe
at java.base/sun.nio.ch.FileDispatcherImpl.writev0(Native Method)
at 
java.base/sun.nio.ch.SocketDispatcher.writev(SocketDispatcher.java:51)
at java.base/sun.nio.ch.IOUtil.write(IOUtil.java:148)
at 
java.base/sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:506)
at org.eclipse.jetty.io.ChannelEndPoint.flush(ChannelEndPoint.java:272)
... 69 more


I have read up on the exception message and found 
http://lucene.472066.n3.nabble.com/Unable-to-write-response-client-closed-connection-or-we-are-shutting-down-tt4350349.html#a4350947
but I don’t understand how an early client connect can cause what I am seeing:

What puzzles me is that the response has been delivered in full to the 
client library, including the document with EOF.

So Solr must have already processed the streaming expression and returned the 
result.
It’s just that the log is filled with stacktraces of this exception that 
suggests something went wrong.
I don’t understand why this happens when the query seems to have succeeded.


Best regards,
Christian




RE: 7.3.1 creates thousands of threads after start up

2018-06-13 Thread Markus Jelsma
Hello Shawn,

You mentioned shard handler tweaks, thanks. I see we have an incorrect setting 
there for maximumPoolSize, way too high, but that doesn't account for the 
number of threads created. After reducing the number, for dubious reasons, 
twice the number of threads are created and the node dies.

For a short time, there were two identical collections (just for different 
tests) on the nodes, i have removed one of them, but the number of threads 
created doesn't change one bit. So it appears shard handler config has nothing 
to do with it, or does it?

Regarding memory leaks, of course, the first that came to mind is that i made 
an error which only causes trouble on 7.3, but it is unreproducible so far, 
even if i fully replicate production in a test environment. Since it only leaks 
on commits, first suspect were URPs, and the URPs are the only things i can 
disable in production without affecting customers. Needless to say, it weren't 
the URPs.

But thanks anyway, whenever i have the courage again to tests it, i'll enable 
INFO logging, which is disabled. Maybe it will reveal something.

If anyone has even the weirdest unconventional suggestion on how to reproduce 
my production memory leak in a controlled test environment, let me know/

Thanks,
Markus
 
-Original message-
> From:Shawn Heisey 
> Sent: Sunday 10th June 2018 22:42
> To: solr-user@lucene.apache.org
> Subject: Re: 7.3.1 creates thousands of threads after start up
> 
> On 6/8/2018 8:59 AM, Markus Jelsma wrote:
> > 2018-06-08 14:02:47.382 ERROR (qtp1458849419-1263) [   ] 
> > o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error 
> > trying to proxy request for url: http://idx2:8983/solr/
> > search/admin/ping
> 
> > Caused by: org.eclipse.jetty.io.EofException
> 
> If you haven't tweaked the shard handler config to drastically reduce
> the socket timeout, that is weird.  The only thing that comes to mind is
> extreme GC pauses that cause the socket timeout to be exceeded.
> 
> > We operate three distinct type of Solr collections, they only share the 
> > same Zookeeper quorum. The other two collections do not seem to have this 
> > problem, but i don't restart those as often as i restart this collection, 
> > as i am STILL trying to REPRODUCE the dreaded memory leak i reported having 
> > on 7.3 about two weeks ago. Sorry, but i drives me nuts!
> 
> I've reviewed the list messages about the leak.  As you might imagine,
> my immediate thought is that the custom plugins you're running are
> probably the cause, because we are not getting OOME reports like I would
> expect if there were a leak in Solr itself.  It would not be unheard of
> for a custom plugin to experience no leaks with one Solr version but
> leak when Solr is upgraded, requiring a change in the plugin to properly
> close resources.  I do not know if that's what's happening.
> 
> A leak could lead to GC pause problems, but it does seem really odd for
> that to happen on a Solr node that's just been started.  You could try
> bumping the heap size by 25 to 50 percent and see if the behavior
> changes at all.  Honestly I don't expect it to change, and if it
> doesn't, then I do not know what the next troubleshooting step should
> be.  I could review your solr.log, though I can't be sure I would see
> something you didn't.
> 
> Thanks,
> Shawn
> 
> 


Re: Autoscaling and inactive shards

2018-06-13 Thread Jan Høydahl
Ok, get the meaning of preferences.

Would there be a way to write a generic rule that would suggest moving shards 
to obtain balance, without specifying absolute core counts? I.e. if you have 
three nodes
A: 3 cores
B: 5 cores
C: 3 cores

Then that rule would suggest two moves to end up with 4 cores on all three 
(unless that would violate disk space or load limits)?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 12. jun. 2018 kl. 08:10 skrev Shalin Shekhar Mangar :
> 
> Hi Jan,
> 
> Comments inline:
> 
> On Tue, Jun 12, 2018 at 2:19 AM Jan Høydahl  > wrote:
> 
>> Hi
>> 
>> I'm trying to have Autoscaling move a shard to another node after manually
>> splitting.
>> We have two nodes, one has a shard1 and the other node is empty.
>> 
>> After SPLITSHARD you have
>> 
>> * shard1 (inactive)
>> * shard1_0
>> * shard1_1
>> 
>> For autoscaling we have the {"minimize" : "cores"} cluster preference
>> active. Because of that I'd expect that Autoscaling would suggest to move
>> e.g. shard1_1 to the other (empty) node, but it doesn't. Then I create a
>> rule just to test {"cores": "<2", "node": "#ANY"}, but still no
>> suggestions. Not until I delete the inactive shard1, then it suggests to
>> move one of the two remaining shards to the other node.
>> 
>> So my two questions are
>> 1. Is it by design that inactive shards "count" wrt #cores?
>>   I understand that it consumes disk but it is not active otherwise,
>>   so one could argue that it should not be counted in core/replica rules?
>> 
> 
> Today, inactive slices also count towards the number of cores -- though
> technically correct, it is probably an oversight.
> 
> 
>> 2. Why is there no suggestion to move a shard due to the "minimize cores"
>> reference itself?
>> 
> 
> The /autoscaling/suggestions end point only suggests if there are policy
> violations. Preferences such as minimize:cores are more of a sorting order
> so they aren't really being violated. After you add the rule, the framework
> still cannot give a suggestion that satisfies your rule. This is because
> even if shard1_1 is moved to node2, node1 still has shard1 and shard1_0. So
> the system ends up not suggesting anything. You should get a suggestion if
> you add a third node to the cluster though.
> 
> Also see SOLR-11997  > which
> will tell users that a suggestion could not be returned because we cannot
> satisfy the policy. There are a slew of other improvements to suggestions
> planned that will return suggestions even when there are no policy
> violations.
> 
> 
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com 
>> 
>> 
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.



Re: Extracting top level URL when indexing document

2018-06-13 Thread Alexandre Rafalovitch
Try URLClassifyProcessorFactory in the processing chain instead, configured
in solrconfig.xml

There is very little documentation for it, so check the source for exact
params. Or search for the blog post introducing it several years ago.

Documentation patches would be welcome.

Regards,
Alex

On Wed, Jun 13, 2018, 01:02 Hanjan, Harinder, 
wrote:

> Hello!
>
> I am indexing web documents and have a need to extract their top-level URL
> to be stored in a different field. I have had some success with the
> PatternTokenizerFactory (relevant schema bits at the bottom) but the
> behavior appears to be inconsistent.  Most of the times, the top level URL
> is extracted just fine but for some documents, it is being cut off.
>
> Examples:
> URL
>
> Extracted URL
>
> Comment
>
> http://www.calgaryarb.ca/eCourtPublic/15M2018.pdf
>
> http://www.calgaryarb.ca
>
> Success
>
> http://www.calgarymlc.ca/about-cmlc/
>
> http://www.calgarymlc.ca
>
> Success
>
> http://www.calgarypolicecommission.ca/reports.php
>
> http://www.calgarypolicecommissio
>
> Fail
>
> https://attainyourhome.com/
>
> https://attai
>
> Fail
>
> https://liveandplay.calgary.ca/DROPIN/page/dropin
>
> https://livea
>
> Fail
>
>
>
>
> Relevant schema:
> 
>
>  multiValued="false"/>
>
>  sortMissingLast="true">
> 
> 
> class="solr.PatternTokenizerFactory"
>
> pattern="^https?://(?:[^@/n]+@)?(?:www.)?([^:/n]+)"
> group="0"/>
> 
> 
>
>
> I have tested the Regex and it is matching things fine. Please see
> https://regex101.com/r/wN6cZ7/358.
> So it appears that I have a gap in my understanding of how Solr
> PatternTokenizerFactory works. I would appreciate any insight on the issue.
> hostname field will be used in facet queries.
>
> Thank you!
> Harinder
>
> 
> NOTICE -
> This communication is intended ONLY for the use of the person or entity
> named above and may contain information that is confidential or legally
> privileged. If you are not the intended recipient named above or a person
> responsible for delivering messages or communications to the intended
> recipient, YOU ARE HEREBY NOTIFIED that any use, distribution, or copying
> of this communication or any of the information contained in it is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by telephone and then destroy or delete this communication,
> or return it to us by mail if requested by us. The City of Calgary thanks
> you for your attention and co-operation.
>