Simulate "this IndexReader is closed" ?

2020-06-22 Thread Richard Goodman
Hi there,

I've spent time implementing the solr prometheus exporter into our Solr
environment. During this, I did come across an issue where when I was
getting the core level metris, I was getting exceptions.

Digging into this further, I realised it's actually on the Solr side of
this, in particular, the metrics that come from the following group within
each core;

SEARCHER.*


An example of the output I was getting;

{
  "responseHeader":{
"status":500,
"QTime":44},
  "error":{
"msg":"this IndexReader is closed",
"trace":"org.apache.lucene.store.AlreadyClosedException: this
IndexReader is closed\n\tat
org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:257)\n\tat
org.apache.lucene.index.StandardDirectoryReader.getVersion(StandardDirectoryReader.java:339)\n\tat
org.apache.lucene.index.FilterDirectoryReader.getVersion(FilterDirectoryReader.java:127)\n\tat
org.apache.lucene.index.FilterDirectoryReader.getVersion(FilterDirectoryReader.java:127)\n\tat
org.apache.solr.search.SolrIndexSearcher.lambda$initializeMetrics$13(SolrIndexSearcher.java:2268)


I changed my metric calls to include "=^(?!SEARCHER).*" and the
results were coming through *(minus the SEARCHER metrics)*.

This was enough to unblock me from getting the rest of the metrics,
however, I want to revisit this and see what I can do, as from my point of
view, this is a bug within Solr, because it breaks the entire metrics API *(for
example, if you just hit /solr/admin/metrics and the IndexReader is Closed,
it'll return this message and 0 metrics will be collected/displayed)*.

My problem is, I'm not entirely sure how to replica this error? and was
hoping I could find some guidance. I saw that the file "
org/apache/solr/search/SolrIndexSearcher.java" has the metrics in them, but
got a bit lost from first glance.

If anyone has any information that could help me;
1. Replicate the issue
2. Explain what exactly does it mean when IndexReader is closed

I would be really grateful,

Kind Regards,
Richard Goodman


Re: solr core metrics & prometheus exporter - indexreader is closed

2020-05-11 Thread Richard Goodman
Hey Dwane,

Thanks for your email, gah I should have mentioned that I had applied the
patches from 8.x branches onto the exporter already *(such as the fixed
thread pooling that you mentioned). *I still haven't gotten to the bottom
of the IndexReader is closed issue, I found that if that was present on an
instance, even calling just http://ip.address:port/solr/admin/metrics would
return that and 0 metrics. If I added the following parameter to the
call; =^(?!SEARCHER).*
It was all fine. I'm trying to wrap my head around the relationship between
a solr core, and an index searcher / reader in the code, but it's quite
complicated, similarly, trying to understand how I could replicate this for
testing purposes. So if you have any guidance/advice on that area, would be
greatly appreciated.

Cheers,

On Wed, 6 May 2020 at 21:36, Dwane Hall  wrote:

> Hey Richard,
>
> I noticed this issue with the exporter in the 7.x branch. If you look
> through the release notes for Solr since then there have been quite a few
> improvements to the exporter particularly around thread safety and
> concurrency (and the number of nodes it can monitor).  The version of the
> exporter can run independently to your Solr version so my advice would be
> to download the most recent Solr version, check and modify the exporter
> start script for its library dependencies, extract these files to a
> separate location, and run this version against your 7.x instance. If you
> have the capacity to upgrade your Solr version this will save you having to
> maintain the exporter separately. Since making this change the exporter has
> not missed a beat and we monitor around 100 Solr nodes.
>
> Good luck,
>
> Dwane
> --
> *From:* Richard Goodman 
> *Sent:* Tuesday, 5 May 2020 10:22 PM
> *To:* solr-user@lucene.apache.org 
> *Subject:* solr core metrics & prometheus exporter - indexreader is closed
>
> Hi there,
>
> I've been playing with the prometheus exporter for solr, and have created
> my config and have deployed it, so far, all groups were running fine (node,
> jetty, jvm), however, I'm repeatedly getting an issue with the core group;
>
> WARN  - 2020-05-05 12:01:24.812; org.apache.solr.prometheus.scraper.Async;
> Error occurred during metrics collection
> java.util.concurrent.ExecutionException:
> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error
> from server at http://127.0.0.1:8083/solr: Server Error
>
> request:
> http://127.0.0.1:8083/solr/admin/metrics?group=core=json=2.2
> at
>
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> ~[?:1.8.0_141]
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> ~[?:1.8.0_141]
> at
> org.apache.solr.prometheus.scraper.Async.lambda$null$1(Async.java:45)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
> at
> java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
> ~[?:1.8.0_141]
> at
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
> ~[?:1.8.0_141]
> at
>
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
> ~[?:1.8.0_141]
> at
> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
> ~[?:1.8.0_141]
> at
>
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
> ~[?:1.8.0_141]
> at
>
> java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
> ~[?:1.8.0_141]
> at
>
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
> ~[?:1.8.0_141]
> at
> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> ~[?:1.8.0_141]
> at
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
> ~[?:1.8.0_141]
> at
>
> org.apache.solr.prometheus.scraper.Async.lambda$waitForAllSuccessfulResponses$3(Async.java:43)
> ~[solr-prometheus-exporter-7.7.2-SNAPSHOT.jar:7.7.2-SNAPSHOT
> e5d04ab6a061a02e47f9e6df62a3cfa69632987b - jenkins - 2019-11-22 16:23:03]
> at
>
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> ~[?:1.8.0_141]
> at
>
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> ~[?:1.8.0_141]
> at
>
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> ~[?:1.8.0_141]
> at
>
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
> ~[?:1.8.0_141]
>

solr core metrics & prometheus exporter - indexreader is closed

2020-05-05 Thread Richard Goodman
a:121)\n\tat
org.apache.solr.handler.admin.MetricsHandler.handleRequestBody(MetricsHandler.java:101)\n\tat
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdmin(HttpSolrCall.java:736)\n\tat
org.apache.solr.servlet.HttpSolrCall.handleAdminRequest(HttpSolrCall.java:717)\n\tat
org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:496)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:395)\n\tat
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341)\n\tat
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)\n\tat
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)\n\tat
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)\n\tat
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1557)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)\n\tat
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)\n\tat
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)\n\tat
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)\n\tat
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)\n\tat
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)\n\tat
org.eclipse.jetty.server.Server.handle(Server.java:502)\n\tat
org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364)\n\tat
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)\n\tat
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)\n\tat
org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)\n\tat
org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)\n\tat
org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)\n\tat
org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765)\n\tat
org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683)\n\tat
java.lang.Thread.run(Thread.java:748)\n",
"code":500}}


Because of these errors, when I actually then go to the endpoint the
prometheus exporter is running for the core metrics, I get 0 metrics back
because of this, I am yet to further investigate the prometheus exporter
to determine if in the scenario some metrics can not be gathered and throw
an error, if no metrics are recorded at all.

Whilst I was working on SOLR-14325
<https://issues.apache.org/jira/browse/SOLR-14325>, Andrzej noted that the
core level metrics are only reported if there is an open SolrIndexSearcher.
I started looking at the code for this, but wanted to know if anyone else
has encountered this issue before, it seems to be very frequent with this
cluster I am testing (96 instances, each instance having around 450GB of
indexes on disk w/ 3 way replication).

I guess also, would it bring up a question of having a better response
rather than a 500 status error if no metrics are available?

Kind regards,

-- 

Richard Goodman


auotscaling tlog+pull and replicationFactor

2020-04-02 Thread Richard Goodman
Hi there,

I'm currently using solr v7.7.2 *(I applied the patch SOLR-13674
<https://issues.apache.org/jira/browse/SOLR-13674> to my build to preserve
replica types for triggers)* and trying to set up a cluster with the
combination of TLOG+PULL replicas and utilising the solr autoscaling
feature to maintain the stability of the cluster.

I'm getting confused though with the documentation.

When creating a collection, if you specify "replicationFactor=3" then it'll
create 3 NRT replicas, and from my understanding, this information is
preserved within autoscaling, i.e. if a node went down, it would attempt to
add another replica to preserve the replicationFactor.

However, because I'm using TLOG + PULL, I don't add the
"replicationFactor=3" into my collection creation, otherwise it ends up
creating a collection with 3 NRT replicas, 1 TLOG and 2 PULL replicas,
which of course I do not want.

Instead, I add "=1=2" which satisfies. However,
when it comes to me creating policies, I'm not getting violations when
expected.

I created the following policies:
{
  "replica": "<2",
  "shard": "#EACH",
  "type": "TLOG",
  "node": "#ANY"
},
{
  "replica": "<3",
  "shard": "#EACH",
  "type": "PULL",
  "node": "#ANY"
}

And there appears to be 0 violations, however, when I add another replica,
there are still 0 violations despite there now being 4 replicas for a given
shard of a collection;

"violations":[],
"config":{
  "cluster-preferences":[{
  "minimize":"cores",
  "precision":1}
,{
  "maximize":"freedisk"}],
  "cluster-policy":[{
  "replica":"<2",
  "shard":"#EACH",
  "type":"TLOG",
  "node":"#ANY"}
,{
  "replica":"<3",
  "shard":"#EACH",
  "type":"PULL",
  "node":"#ANY"}]}},

And here is the information about a collection-shard in which I added an
extra replica to which I expected there to be a violation:
{
  "replica_p35": {
"core": "collection_four_shard1_replica_p35",
"shard": "shard1",
"collection": "collection_four",
"state": "active",
"shard_name": "collection_four_shard1",
"num_shards": 6,
"type": "PULL"
  },
  "replica_p31": {
"core": "collection_four_shard1_replica_p31",
"shard": "shard1",
"collection": "collection_four",
"state": "active",
"shard_name": "collection_four_shard1",
"num_shards": 6,
"type": "PULL"
  },
  "replica_p87": {
"core": "collection_four_shard1_replica_p87",
"shard": "shard1",
"collection": "collection_four",
"state": "active",
"shard_name": "collection_four_shard1",
"num_shards": 6,
"type": "PULL"
  },
  "replica_t75": {
"core": "collection_four_shard1_replica_t75",
"shard": "shard1",
"collection": "collection_four",
"state": "active",
"shard_name": "collection_four_shard1",
"type": "TLOG",
"leader": "true"
  }
}

Am I missing something to preserve replicationFactor for a collection, and
trying to make collections have 1 TLOG replica and 2 PULL replicas?

I tried adding the following
{"replicas": "<4", "shard":"#EACH", "node": "#ANY"}

However, still no luck

Equally, how would I then go about setting up triggers to only create a
PULL if a PULL goes down, and equally if the TLOG goes down? Would having a
trigger for each type be needed?

Any guidance on this would be greatly appreciated
Thanks,
-- 

Richard Goodman


Re: Metrics API - Documentation

2019-10-15 Thread Richard Goodman
Many thanks both for your responses, they've been helpful.

@Andrzej - Sorry I wasn't clear on the "A latency of 1mil" as I wasn't
aware the image wouldn't come through. But following your bullet points
helped me present a better unit for measurement in the axis.

In regards to contributing, would absolutely love to help there, just not
sure what the correct direction is? I wasn't sure if the web page source
code / contributions are in the apache-lucene repository?

Thanks,


On Tue, 8 Oct 2019 at 11:04, Andrzej Białecki  wrote:

> Hi,
>
> Starting with Solr 7.0 all JMX metrics are actually internally driven by
> the metrics API - JMX (or Prometheus) is just a way of exposing them.
>
> I agree that we need more documentation on metrics - contributions are
> welcome :)
>
> Regarding your specific examples (btw. our mailing lists aggressively
> strip all attachments - your graphs didn’t make it):
>
> * time units in time-based counters are in nanoseconds. This is just a
> unit of value, not necessarily precision. In this specific example
> `ADMIN./admin/collections.totalTime` (and similarly named metrics for all
> other request handlers) represents the total elapsed time spent processing
> requests.
> * time-based histograms are expressed in milliseconds, where it is
> indicated by the “_ms” suffix.
> * 1-, 5- and 15-min rates represent an exponentially weighted moving
> average over that time window, expressed in events/second.
> * handlerStart is initialised with System.currentTimeMillis() when this
> instance of request handler is first created.
> * details on GC, memory buffer pools, and similar JVM metrics are
> documented in JDK documentation on Management Beans. For example:
>
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true
> <
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/GarbageCollectorMXBean.html?is-external=true
> >
> * "A latency of 1mil” - no idea what that is, I don’t think Solr API uses
> this abbreviation anywhere.
>
> Hope this helps.
>
> —
>
> Andrzej Białecki
>
> > On 7 Oct 2019, at 13:41, Emir Arnautović 
> wrote:
> >
> > Hi Richard,
> > We do not use API to collect metrics but JMX, but I believe that those
> are the same (did not verify it in code). You can see how we handled those
> metrics into reports/charts or even use our agent to send data to
> Prometheus:
> https://github.com/sematext/sematext-agent-integrations/tree/master/solr <
> https://github.com/sematext/sematext-agent-integrations/tree/master/solr>
> >
> > You can also see some links to Solr metric related blog posts in this
> repo. If you find out that managing your own monitoring stack is
> overwhelming, you can try our Solr integration.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 7 Oct 2019, at 12:40, Richard Goodman 
> wrote:
> >>
> >> Hi there,
> >>
> >> I'm currently working on using the prometheus exporter to provide some
> detailed insights for our Solr Cloud clusters.
> >>
> >> Using the provided template killed our prometheus server, as well as
> the exporter due to the size of our clusters (each cluster is around 96
> nodes, ~300 collections with 3way replication and 16 shards), so you can
> imagine the amount of data that comes through /admin/metrics and not
> filtering it down first.
> >>
> >> I've began working on writing my own template to reduce the amount of
> data being requested and it's working fine, and I'm starting to build some
> nice graphs in Grafana.
> >>
> >> The only difficulty I'm having with this, is I'm struggling to find
> decent documentation on the metrics themselves. I was using the resources
> metrics reporting - metrics-api <
> https://lucene.apache.org/solr/guide/7_7/metrics-reporting.html#metrics-api>
> and monitoring solr with prometheus and grafana <
> https://lucene.apache.org/solr/guide/7_7/monitoring-solr-with-prometheus-and-grafana.html>
> but there is a lack of information on most metrics.
> >>
> >> For example:
> >> "ADMIN./admin/collections.totalTime":6715327903,
> >> I understand this is a counter, however, I'm not sure what unit this
> would be represented when displaying it, for example:
> >>
> >>
> >>
> >> A latency of 1mil, not sure if this means milliseconds, million, etc.,
> >> Another example would be the GC metrics:
> >>  "gc

Unable to log into Jira

2019-10-15 Thread Richard Goodman
Hey,

Sorry if this is the wrong group, I tried to email us...@infra.apache.org a
few weeks ago but haven't heard anything.

I am unable to log into my account, with it saying my password is
incorrect. But what is more odd is my name on the account has changed from
Richard Goodman to Alex Goodman.

I can send a forgot username which comes through to my registered email,
which is this one. However, if I do a forgot password, the email never
shows up 

Does anyone know which contact to use in order to help me sort this issue
out?

Thanks,

Richard Goodman


Metrics API - Documentation

2019-10-07 Thread Richard Goodman
Hi there,

I'm currently working on using the prometheus exporter to provide some
detailed insights for our Solr Cloud clusters.

Using the provided template killed our prometheus server, as well as the
exporter due to the size of our clusters *(each cluster is around 96 nodes,
~300 collections with 3way replication and 16 shards)*, so you can imagine
the amount of data that comes through /admin/metrics and not filtering it
down first.

I've began working on writing my own template to reduce the amount of data
being requested and it's working fine, and I'm starting to build some nice
graphs in Grafana.

The only difficulty I'm having with this, is I'm struggling to find decent
documentation on the metrics themselves. I was using the resources metrics
reporting - metrics-api
<https://lucene.apache.org/solr/guide/7_7/metrics-reporting.html#metrics-api>
 and monitoring solr with prometheus and grafana
<https://lucene.apache.org/solr/guide/7_7/monitoring-solr-with-prometheus-and-grafana.html>
but
there is a lack of information on most metrics.

For example:

"ADMIN./admin/collections.totalTime":6715327903,

I understand this is a counter, however, I'm not sure what unit this would
be represented when displaying it, for example:

[image: image.png]

A latency of 1mil, not sure if this means milliseconds, million, etc.,
Another example would be the GC metrics:

  "gc.ConcurrentMarkSweep.count":7,
  "gc.ConcurrentMarkSweep.time":1247,
  "gc.ParNew.count":16759,
  "gc.ParNew.time":884173,

Which when displayed, doesn't give the clearest insight as to what the unit is:

[image: image.png]

If anyone has any advice / guidance, that would be greatly
appreciated. If there isn't documentation for the API, then this would
also be something I'll look into help contributing with too.

Thanks,

-- 

Richard Goodman


HowtoConfigureIntelliJ link is broken

2019-07-18 Thread Richard Goodman
Hi there,

I went to set up the repo with intellij, but it was having some problems
figuring out the source folders etc., So I went to navigate to the
following link <https://wiki.apache.org/lucene-java/HowtoConfigureIntelliJ>
as I remember from the past there were a few commands that helped, however,
it appears to be broken? I used a website archiver to retrieve the original
contents, but wasn't sure if it had been raised.

Thanks,

-- 

Richard Goodman|Data Infrastructure engineer

richa...@brandwatch.com


NEW YORK   | BOSTON   | BRIGHTON   | LONDON   | BERLIN |   STUTTGART |
PARIS   | SINGAPORE | SYDNEY

<https://www.brandwatch.com/blog/digital-consumer-intelligence/>


Suggestions API for system properties

2019-04-02 Thread Richard Goodman
Hi there,

I have been slowly building up my cluster policies to re-produce the rules
we set on our collections when they are created *(pre v7)*. One of these
rules was rack awareness, which I've implemented by the following:

{
  "replica": "#EQUAL",
  "shard": "#EACH",
  "sysprop.racklocation": "#EACH"
}

Whilst this works and shows valid violations as shown below:

"collection":"collection_name",
"shard":"shard2",
"tagKey":"/rack/001",
"violation":{
  "replica":{
"NRT":2,
"count":2},
  "delta":1.0},
"clause":{
  "replica":"#EQUAL",
  "shard":"#EACH",
  "sysprop.racklocation":"#EACH",
  "collection":"collection_name"},
"violatingReplicas":[{
"core_node45":{
  "core":"collection_name_shard2_replica3",
  "shard":"shard2",
  "collection":"collection_name",
  "node_name":"127.0.0.1:8080_solr",
  "type":"NRT",
  "base_url":"http://127.0.0.1:8080/solr;,
  "state":"active",
  "force_set_state":"false",
  "INDEX.sizeInGB":12.280795318074524}}
,{
"core_node2":{
  "core":"collection_name_shard2_replica1",
  "shard":"shard2",
  "collection":"collection_name",
  "node_name":"127.0.0.2:8083_solr",
  "type":"NRT",
  "leader":"true",
  "base_url":"http://127.0.0.2:8083/solr;,
  "state":"active",
  "force_set_state":"false",
  "INDEX.sizeInGB":12.24499356560409}}]},

As you can see there are two replicas which are on rack "rack/001" which
isn't allowed. However when going onto the /autoscaling/suggestions
endpoint, nothing is returned:
{
"responseHeader":{
"status":0,
"QTime":43848},
"suggestions":[],
"WARNING":"This response format is experimental. It is likely to change in
the future."}

I experimented by explicitly stating the racks that are present in the
cluster, i.e.
{
  "replica": "#EQUAL",
  "shard": "#EACH",
  "sysprop.racklocation": ["/rack/001", "/rack/002", "/rack/003",
"/rack/004"]
}

With hopes that Solr would be able to use this to deduce where to place
violating replicas, however, this still doesn't work.

I was wondering if anyone had any similar experience with using system
properties for cluster policies, and how it affects the suggestions
endpoint, as I'm having difficulty of getting results from this.

Cheers,
Richard Goodman


Re: Autoscaling rack awareness

2019-03-27 Thread Richard Goodman
So I managed to get this working by the following policy:

{"replica":"<2","shard":"#EACH","sysprop.racklocation": "#EACH"}


On Tue, 26 Mar 2019 at 14:03, Richard Goodman 
wrote:

> Hi, I'm currently running into some trouble trying to set up rack
> awareness as a cluster policy.
>
> I run my cluster with 3 way replication, currently a few collection-shards
> have 4 replicas, which shows as violations under my current set policies:
>
> {
> "set-cluster-policy":[
> {
> "replica":"<2",
> "shard":"#EACH",
> "node":"#ANY"
> },
> {
> "replica":0,
> "freedisk":"<50",
> "strict":false
> }
> ]
> }
>
> {
> "collection":"collection_name_one",
> "shard":"shard12",
> "node":"1.2.3.4:8080_solr",
> "tagKey":"1.2.3.4:8080_solr",
> "violation":{
>
> "replica":"org.apache.solr.client.solrj.cloud.autoscaling.ReplicaCount:{\n
> \"NRT\":2,\n  \"PULL\":0,\n  \"TLOG\":0,\n  \"count\":2}",
>   "delta":1},
> "clause":{
>   "replica":"<2",
>   "shard":"#EACH",
>   "node":"#ANY",
>   "collection":"collection_name_one"}
> },
>
> I want to implement rack awareness as a policy, there are examples of
> availability zone policies, however, not really anything for rack
> awareness. Currently we set this when creating a collection:
>
> sysprop.racklocation:*,shard:*,replica:<2
>
> So I tried to implement this via the following policy rule
>
> {"replica": "<2", "shard": "#EACH", "sysprop.racklocation": "*"}
>
> However, this hasn't worked *(because with the extra replication I have
> atm, it would certainly raise this as a violation)*, so not sure how I
> can implement this?
> I saw in the 7.7 docs this following example:
> {"replica":"#ALL", "shard":"shard1", "sysprop.rack":"730"}
> However, this forces shard 1 of all replicas to belong to a certain rack,
> which I don't want to do, I'd rather the replicas have free choice of where
> they are placed, providing if two replicas appear on the same racklocation,
> it would raise a violation.
>
> Has anyone had experience of setting something like this up, or have any
> advice / see an error in my policy set up?
>
> *(Currently running solr 7.4)*
>
> Thanks,
> Richard
>


-- 

Richard Goodman|Data Infrastructure Engineer

richa...@brandwatch.com


NEW YORK   | BOSTON  | BRIGHTON   | LONDON   | BERLIN   |   STUTTGART   |
SINGAPORE   | SYDNEY | PARIS


<https://www.brandwatch.com/blog/brandwatch-and-crimson-hexagon/>


Autoscaling rack awareness

2019-03-26 Thread Richard Goodman
Hi, I'm currently running into some trouble trying to set up rack awareness
as a cluster policy.

I run my cluster with 3 way replication, currently a few collection-shards
have 4 replicas, which shows as violations under my current set policies:

{
"set-cluster-policy":[
{
"replica":"<2",
"shard":"#EACH",
"node":"#ANY"
},
{
"replica":0,
"freedisk":"<50",
"strict":false
}
]
}

{
"collection":"collection_name_one",
"shard":"shard12",
"node":"1.2.3.4:8080_solr",
"tagKey":"1.2.3.4:8080_solr",
"violation":{

"replica":"org.apache.solr.client.solrj.cloud.autoscaling.ReplicaCount:{\n
\"NRT\":2,\n  \"PULL\":0,\n  \"TLOG\":0,\n  \"count\":2}",
  "delta":1},
"clause":{
  "replica":"<2",
  "shard":"#EACH",
  "node":"#ANY",
  "collection":"collection_name_one"}
},

I want to implement rack awareness as a policy, there are examples of
availability zone policies, however, not really anything for rack
awareness. Currently we set this when creating a collection:

sysprop.racklocation:*,shard:*,replica:<2

So I tried to implement this via the following policy rule

{"replica": "<2", "shard": "#EACH", "sysprop.racklocation": "*"}

However, this hasn't worked *(because with the extra replication I have
atm, it would certainly raise this as a violation)*, so not sure how I can
implement this?
I saw in the 7.7 docs this following example:
{"replica":"#ALL", "shard":"shard1", "sysprop.rack":"730"}
However, this forces shard 1 of all replicas to belong to a certain rack,
which I don't want to do, I'd rather the replicas have free choice of where
they are placed, providing if two replicas appear on the same racklocation,
it would raise a violation.

Has anyone had experience of setting something like this up, or have any
advice / see an error in my policy set up?

*(Currently running solr 7.4)*

Thanks,
Richard