Slack digest for #general - 2018-01-25

Apache Pulsar Slack Thu, 25 Jan 2018 17:37:43 -0800

2018-01-25 04:50:44 UTC - Jaebin Yoon: I'm seeing the bunch of these messages 
in the broker logs and producers and consumers get disconnected and connect 
again. ```2018-01-25 04:48:34,147 - WARN  - 
[BookKeeperClientWorker-24-1:LedgerHandle@919] - Write did not succeed to 
100.82.113.61:3181, bookieIndex 0, but we have already fixed it.
2018-01-25 04:48:34,147 - WARN  - 
[BookKeeperClientWorker-24-1:PendingAddOp@238] - Write did not succeed: L211502 
E56322 on 100.82.113.61:3181, rc = -23
2018-01-25 04:48:34,147 - WARN  - 
[BookKeeperClientWorker-24-1:LedgerHandle@919] - Write did not succeed to 
100.82.113.61:3181, bookieIndex 0, but we have already fixed it.
2018-01-25 04:48:34,148 - WARN  - 
[BookKeeperClientWorker-24-1:PendingAddOp@238] - Write did not succeed: L211502 
E52420 on 100.82.113.61:3181, rc = -23
2018-01-25 04:48:34,148 - WARN  - 
[BookKeeperClientWorker-24-1:LedgerHandle@919] - Write did not succeed to 
100.82.113.61:3181, bookieIndex 0, but we have already fixed it.```
What would be issues?
----
2018-01-25 04:55:34 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file: 
<https://apache-pulsar.slack.com/files/U8CM86831/F8YAFQN0L/-.txt|Untitled>
----
2018-01-25 06:52:24 UTC - Sijie Guo: @Jaebin Yoon is the bookie logs from 
`100.82.113.61`?
----
2018-01-25 06:59:40 UTC - Jaebin Yoon: @Sijie Guo I'm not sure. After sometime 
I've seen bunch of "Write did not succeed.." exceptions in brokers, I've seen 
the above exceptions on the multiple bookies. 
@Matteo Merli actually I found one issue with our consumer code. It didn't 
acknowledge messages.. (I think we forgot to add that when we switched from 
kafka wrapper to pulsar client because of the high partition issues). This 
would explain why the consumer queue was almost always full, leading to many 
different issues. I think this definitely stressed the brokers and bookies 
since it always asked the old messages again and again.
----
2018-01-25 07:15:18 UTC - Jaebin Yoon: But the traffic was not huge..(1k ~ 10k 
qps from one producer) so it was strange that it could make things very 
unstable. I'll see more if I can see those issues again with the usual setup.
----
2018-01-25 07:28:44 UTC - Sijie Guo: it sounds like the refcnt issue casues 
“Write did not succeed”. @Matteo Merli is that similar as the issue you fixed 
before?
----
2018-01-25 15:08:55 UTC - Matteo Merli: @Jaebin Yoon for the general 
instability, in particular when replaying old data, I'm starting to suspect 
some config/env non-optimal setting. Can you verify that, in bookies, journal 
and storage are set to different disk devices? I saw earlier you mentioned 
about the 12 HDD in raid 0. The normal confit there would be to dedicate 1 or 2 
disks to journal and the rest to storage, with 2 Raid groups
----
2018-01-25 17:13:58 UTC - Jaebin Yoon: The current setup is the both journal 
and storage are on raid 0 hard disk over 12 HDD. I'll check if we can separate 
one out of 12 for journal. that configuration is something hard to change. 
Usually how big should the journal disk be?
----
2018-01-25 17:31:12 UTC - Matteo Merli: The journal doesn’t have to be big, 
though the key is for it to be a dedicated disk. That’s what allows to decouple 
read/writes completely in the bookies.


Journal is composed of 2 GB files, and it gets “trimmed” whenever the data is 
flushed in the storage device. (default config is to to keep the last 5 
journals around, so most of times the space required is ~10Gb). Of course the 
IO bandwidth of the journal is what’s typically critical, but without fsync, 
even a single HDD should be giving a very decent throughput.
----
2018-01-25 17:31:25 UTC - Matteo Merli: (providing the HDD is dedicated)
----
2018-01-25 17:40:01 UTC - Jaebin Yoon: Ok. I'll look for the way to set a 
separate disk for the journal before scaling up more.
----
2018-01-25 17:50:50 UTC - Jaebin Yoon: I got this error and not sure why I'm 
getting this since the consumers now are acknowledging messages. 
```2018-01-25 17:21:13,504 WARN  org.apache.pulsar.client.impl.ClientCnx:359 
[pulsar-client-io-1-1] [handleError] [id: 0x45292a46, L:/100.82.118.47:46366 - 
R:<http://prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650|prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650>]
 Received error from server: Cannot create producer on topic with backlog quota 
exceeded
2018-01-25 17:21:13,504 WARN  org.apache.pulsar.client.impl.ClientCnx:362 
[pulsar-client-io-1-1] [handleError] [id: 0x45292a46, L:/100.82.118.47:46366 - 
R:<http://prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650|prod.pulsarbroker.us-east-1.dynprod.netflix.net/100.82.102.123:6650>]
 Producer creation has been blocked because backlog quota exceeded for producer 
topic```
----
2018-01-25 17:51:34 UTC - Matteo Merli: You can check the topic stats to see 
the backlog and if there are any dangling subscriptions
----
2018-01-25 17:51:56 UTC - Matteo Merli: `bin/pulsar-admin persistent stats 
$TOPIC`
----
2018-01-25 17:52:13 UTC - Jaebin Yoon: ah.. i see. if there are subscriptions 
created for the topic but not reading from it, it would affect the producing 
side.
----
2018-01-25 17:52:33 UTC - Matteo Merli: that’s the default behavior, you can 
configure to drop all data
----
2018-01-25 17:52:44 UTC - Matteo Merli: all data ---&gt; “old data”
----
2018-01-25 17:54:04 UTC - Matteo Merli: eg: 
`bin/pulsar-admin namespaces set-backlog-quota $MY_NAMESACE --limit 1T --policy 
consumer_backlog_eviction`
----
2018-01-25 17:54:05 UTC - Jaebin Yoon: ok. got it. thanks! we're so used to 
with creating ad-hoc consumers and leaving it there.
----
2018-01-25 17:54:55 UTC - Jaebin Yoon: it makes sense.
----
2018-01-25 17:55:14 UTC - Matteo Merli: that will start “trimming” the backlog 
once the quota is reached. all the defaults we have are made to never delete 
data unexpectedly
----
2018-01-26 00:54:38 UTC - Jaebin Yoon: While doing the load tests, all consumer 
connections got closed and I see the following exceptions on the broker. 
```2018-01-26 00:38:08,489 - ERROR - [pulsar-web-61-28:PulsarWebResource@381] - 
[null] Failed to validate namespace bundle 
netflix/prod/ns1/0x84000000_0x86000000
java.lang.IllegalArgumentException: Invalid upper boundary for bundle
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
        at 
org.apache.pulsar.common.naming.NamespaceBundles.validateBundle(NamespaceBundles.java:110)
        at 
org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleRange(PulsarWebResource.java:378)
        at 
org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleOwnership(PulsarWebResource.java:404)
        at 
org.apache.pulsar.broker.admin.Namespaces.splitNamespaceBundle(Namespaces.java:876)
        at sun.reflect.GeneratedMethodAccessor127.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
        at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
        at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
        at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:143)
        at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
        at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
        at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
        at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
        at 
org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)```
----
2018-01-26 00:56:48 UTC - Jaebin Yoon: It seems it tried to split the bundle 
from 4 to the bigger one when the traffic gets heavy.
----
2018-01-26 00:58:02 UTC - Matteo Merli: &gt; all consumer connections got 
closed 

That is expected since the default configuration is to “unload” immediately the 
newly split bundles to give them a chance to get immediately reassigned to a 
different broker
----
2018-01-26 01:00:32 UTC - Matteo Merli: I think the exception that gets printed 
is related to the fact the bundles list gets updated by all the split happening 
in parallel. I don’t think that should pose a big problem, the split for that 
particular bundle should have been re-attempted at the next iteration
----
2018-01-26 01:01:57 UTC - Matteo Merli: (note that it should be avoided.. )
----
2018-01-26 01:01:57 UTC - Jaebin Yoon: somehow the consumers stopped consuming 
right after that happened.. So in this case, the bundle should be 8 then.. i'll 
check if that is the case.
----
2018-01-26 01:02:44 UTC - Matteo Merli: did they reconnected?
----
2018-01-26 01:07:20 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file: 
<https://apache-pulsar.slack.com/files/U8CM86831/F8YV113EE/-.txt|Untitled>
----
2018-01-26 01:07:23 UTC - Matteo Merli: the exception itself is thrown in the 
validation phase, so it shouldn’t affect the running state in any way, the 
split for that particular bundle has currently failed
----
2018-01-26 01:07:56 UTC - Jaebin Yoon: so it seems it failed to reconnect
----
2018-01-26 01:08:01 UTC - Matteo Merli: Oh I see, you got an exception in the 
`consumer.acknowledge()`
----
2018-01-26 01:09:00 UTC - Matteo Merli: which is to be expected from an API 
perspective
----
2018-01-26 01:09:28 UTC - Jaebin Yoon: I see..
----
2018-01-26 01:10:01 UTC - Jaebin Yoon: so need to catch that retry message?
----
2018-01-26 01:10:44 UTC - Matteo Merli: `acknowledge()` will fail when non 
connected to broker. there’s no much else to do
----
2018-01-26 01:11:09 UTC - Matteo Merli: retry the acnkowledgement is not needed 
either, because the message will be replayed anyway by the broker
----
2018-01-26 01:11:38 UTC - Matteo Merli: the easiest way to deal with it is to 
just use `consumer.acknowledgeAsync()` and not bother to track the future
----
2018-01-26 01:12:18 UTC - Jaebin Yoon: ok. let me try that. thanks!
----
2018-01-26 01:13:33 UTC - Jaebin Yoon: regarding the the split failure, it 
failed to split because of the race condition and it is ok?
----
2018-01-26 01:14:23 UTC - Jaebin Yoon: it should be split .. if it fails, will 
it retry?
----
2018-01-26 01:14:57 UTC - Matteo Merli: yes, the same logic that leads to the 
split is checked periodically (traffic, # of topics, …)
----
2018-01-26 01:15:34 UTC - Jaebin Yoon: alright. cool. thanks!
----
2018-01-26 01:26:48 UTC - Matteo Merli: btw: I’ve been merging most of the 
fixes to issues you encountered:

Default limit on number of lookup with 10K partitions:
<https://github.com/apache/incubator-pulsar/pull/1116>

Fix for Kafka consumer wrapper slow consumption (this will be merged shortly):
<https://github.com/apache/incubator-pulsar/pull/1115>

Fix for the race condition in producer code:
<https://github.com/apache/incubator-pulsar/pull/1108>

Reducing the exception printed in logs after disconnections:
<https://github.com/apache/incubator-pulsar/pull/1107>

Load balancer not being enabled by default in embedded broker:
<https://github.com/apache/incubator-pulsar/pull/1104>


My next step is to test out with you VM configuration to verify in the same 
conditions. So far my test VMs were the ones that I deploy from the messaging 
benchmark deployment scripts and I cannot see anymore issue at this point here
----

Slack digest for #general - 2018-01-25

Reply via email to