Slack digest for #general - 2018-01-26

Apache Pulsar Slack Fri, 26 Jan 2018 14:20:57 -0800

2018-01-26 00:54:38 UTC - Jaebin Yoon: While doing the load tests, all consumer 
connections got closed and I see the following exceptions on the broker. 
```2018-01-26 00:38:08,489 - ERROR - [pulsar-web-61-28:PulsarWebResource@381] - 
[null] Failed to validate namespace bundle 
netflix/prod/ns1/0x84000000_0x86000000
java.lang.IllegalArgumentException: Invalid upper boundary for bundle
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:122)
        at 
org.apache.pulsar.common.naming.NamespaceBundles.validateBundle(NamespaceBundles.java:110)
        at 
org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleRange(PulsarWebResource.java:378)
        at 
org.apache.pulsar.broker.web.PulsarWebResource.validateNamespaceBundleOwnership(PulsarWebResource.java:404)
        at 
org.apache.pulsar.broker.admin.Namespaces.splitNamespaceBundle(Namespaces.java:876)
        at sun.reflect.GeneratedMethodAccessor127.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81)
        at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144)
        at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161)
        at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$VoidOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:143)
        at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99)
        at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389)
        at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347)
        at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102)
        at 
org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326)```
----
2018-01-26 00:56:48 UTC - Jaebin Yoon: It seems it tried to split the bundle 
from 4 to the bigger one when the traffic gets heavy.
----
2018-01-26 00:58:02 UTC - Matteo Merli: &gt; all consumer connections got 
closed

That is expected since the default configuration is to “unload” immediately the
newly split bundles to give them a chance to get immediately reassigned to a
different broker
----
2018-01-26 01:00:32 UTC - Matteo Merli: I think the exception that gets printed
is related to the fact the bundles list gets updated by all the split happening
in parallel. I don’t think that should pose a big problem, the split for that
particular bundle should have been re-attempted at the next iteration
----
2018-01-26 01:01:57 UTC - Matteo Merli: (note that it should be avoided.. )
----
2018-01-26 01:01:57 UTC - Jaebin Yoon: somehow the consumers stopped consuming
right after that happened.. So in this case, the bundle should be 8 then.. i'll
check if that is the case.
----
2018-01-26 01:02:44 UTC - Matteo Merli: did they reconnected?
----
2018-01-26 01:07:20 UTC - Jaebin Yoon: @Jaebin Yoon uploaded a file:
<https://apache-pulsar.slack.com/files/U8CM86831/F8YV113EE/-.txt|Untitled>
----
2018-01-26 01:07:23 UTC - Matteo Merli: the exception itself is thrown in the
validation phase, so it shouldn’t affect the running state in any way, the
split for that particular bundle has currently failed
----
2018-01-26 01:07:56 UTC - Jaebin Yoon: so it seems it failed to reconnect
----
2018-01-26 01:08:01 UTC - Matteo Merli: Oh I see, you got an exception in the
`consumer.acknowledge()`
----
2018-01-26 01:09:00 UTC - Matteo Merli: which is to be expected from an API
perspective
----
2018-01-26 01:09:28 UTC - Jaebin Yoon: I see..
----
2018-01-26 01:10:01 UTC - Jaebin Yoon: so need to catch that retry message?
----
2018-01-26 01:10:44 UTC - Matteo Merli: `acknowledge()` will fail when non
connected to broker. there’s no much else to do
----
2018-01-26 01:11:09 UTC - Matteo Merli: retry the acnkowledgement is not needed
either, because the message will be replayed anyway by the broker
----
2018-01-26 01:11:38 UTC - Matteo Merli: the easiest way to deal with it is to
just use `consumer.acknowledgeAsync()` and not bother to track the future
----
2018-01-26 01:12:18 UTC - Jaebin Yoon: ok. let me try that. thanks!
----
2018-01-26 01:13:33 UTC - Jaebin Yoon: regarding the the split failure, it
failed to split because of the race condition and it is ok?
----
2018-01-26 01:14:23 UTC - Jaebin Yoon: it should be split .. if it fails, will
it retry?
----
2018-01-26 01:14:57 UTC - Matteo Merli: yes, the same logic that leads to the
split is checked periodically (traffic, # of topics, …)
----
2018-01-26 01:15:34 UTC - Jaebin Yoon: alright. cool. thanks!
----
2018-01-26 01:26:48 UTC - Matteo Merli: btw: I’ve been merging most of the
fixes to issues you encountered:

Default limit on number of lookup with 10K partitions:
<https://github.com/apache/incubator-pulsar/pull/1116>

Fix for Kafka consumer wrapper slow consumption (this will be merged shortly):
<https://github.com/apache/incubator-pulsar/pull/1115>

Fix for the race condition in producer code:
<https://github.com/apache/incubator-pulsar/pull/1108>

Reducing the exception printed in logs after disconnections:
<https://github.com/apache/incubator-pulsar/pull/1107>

Load balancer not being enabled by default in embedded broker:
<https://github.com/apache/incubator-pulsar/pull/1104>

My next step is to test out with you VM configuration to verify in the same
conditions. So far my test VMs were the ones that I deploy from the messaging
benchmark deployment scripts and I cannot see anymore issue at this point here
----
2018-01-26 02:25:41 UTC - Jaebin Yoon: Oh that's great! Thanks @Matteo Merli
for quickly addressing the issues. Let me give you the exact test setup and
configuration i have. The most problems I've seen were from the stressful
condition where there were many partitions and consumers kept asking old data
because they didn't ack messages. After consumers were following up the
producers closely, everything got smoother. Apparently that reduces the traffic
between brokers and bookies. I'm still using one mounting point over multiple
HDD for both journals and ledgers, which makes things worse in reading old data.
----
2018-01-26 03:36:48 UTC - Jaebin Yoon: I'm seeing lots of this exception in the
broker log. It seems the load balancing doesn't kick in because of this.
```2018-01-26 03:33:38,823 - WARN -
[pulsar-1-2:SimpleResourceAllocationPolicies@56] - GetIsolationPolicies: Unable
to get the namespaceIsolationPolicies [{}]
java.lang.NullPointerException
at
com.google.common.base.Preconditions.checkNotNull(Preconditions.java:770)
at com.google.common.base.Joiner.toString(Joiner.java:454)
at com.google.common.base.Joiner.appendTo(Joiner.java:109)
at com.google.common.base.Joiner.appendTo(Joiner.java:154)
at com.google.common.base.Joiner.appendTo(Joiner.java:141)
at com.google.common.base.Joiner.appendTo(Joiner.java:168)
at
org.apache.pulsar.broker.web.PulsarWebResource.path(PulsarWebResource.java:101)
at
org.apache.pulsar.broker.loadbalance.impl.SimpleResourceAllocationPolicies.getIsolationPolicies(SimpleResourceAllocationPolicies.java:54)
at
org.apache.pulsar.broker.loadbalance.impl.SimpleResourceAllocationPolicies.isSharedBroker(SimpleResourceAllocationPolicies.java:97)
at
org.apache.pulsar.broker.loadbalance.impl.LoadManagerShared.applyPolicies(LoadManagerShared.java:129)
at
org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl.selectBrokerForAssignment(ModularLoadManagerImpl.java:659)
at
org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerWrapper.getLeastLoaded(ModularLoadManagerWrapper.java:67)
at
org.apache.pulsar.broker.namespace.NamespaceService.getLeastLoadedFromLoadManager(NamespaceService.java:463)
at
org.apache.pulsar.broker.namespace.NamespaceService.searchForCandidateBroker(NamespaceService.java:338)
at
org.apache.pulsar.broker.namespace.NamespaceService.lambda$15(NamespaceService.java:301)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)```
----
2018-01-26 03:39:48 UTC - Jaebin Yoon: And the split failures repeating..
resulting in consumer failing to connect to the broker.
----
2018-01-26 03:43:27 UTC - Jaebin Yoon: Producers and consumers connected back
eventually but the brokers got unstable for 4~5 min for this
----
2018-01-26 03:47:01 UTC - Jaebin Yoon: After that I got these exceptions on
the brokers :
```2018-01-26 03:42:48,926 - WARN -
[pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split
namespace bundle netflix/prod/ns1/0x12c80000_0x12d00000 because namespace
netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,926 - WARN -
[pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split
namespace bundle netflix/prod/ns1/0xf21b9fff_0xf21bafff because namespace
netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,926 - WARN -
[pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split
namespace bundle netflix/prod/ns1/0x1b780000_0x1b782000 because namespace
netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,927 - WARN -
[pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split
namespace bundle netflix/prod/ns1/0x65c80000_0x65cc0000 because namespace
netflix/prod/ns1 has too many bundles: 128
2018-01-26 03:42:48,927 - WARN -
[pulsar-modular-load-manager-60-1:BundleSplitterTask@98] - Could not split
namespace bundle netflix/prod/ns1/0x1c100000_0x1c200000 because namespace
netflix/prod/ns1 has too many bundles: 128```
----
2018-01-26 03:48:01 UTC - Jaebin Yoon: Some how it split but kept failing in
validation so it seems it kept doing that again and again without actually
distributing topics to bundles
----
2018-01-26 03:50:15 UTC - Jaebin Yoon: Maybe my namespace setup is missing some
policies to make this load balancing work?
----
2018-01-26 03:52:39 UTC - Matteo Merli: From the log message it looks like it
already split that into 128 bundles, which is the default limit for auto split.
Will look into the previous exception
----
2018-01-26 03:53:42 UTC - Jaebin Yoon: How do I check if the topic partitions
got assigned to different bundles? I have only 10 partitions for my topic now.
----
2018-01-26 03:54:55 UTC - Matteo Merli: The assignment is based on the hash of
the topic name
----
2018-01-26 03:55:42 UTC - Matteo Merli: You can get the load report for the
broker which contains all the stats per/bundle as well
----
2018-01-26 03:56:10 UTC - Matteo Merli: E.g.: Pulsar-admin broker-stats
load-report
----
2018-01-26 04:10:27 UTC - Jaebin Yoon: ok. I see the 10 topic got distributed
over 8 brokers out of 10. Not sure why it ended up split upto 128.
----
2018-01-26 04:55:57 UTC - Jaebin Yoon: So 128 is the auto split limit but you
can set any high number manually, right? The bundle is the broker distribution
unit so it definitely needs to be more than 128.
----
2018-01-26 04:56:59 UTC - Matteo Merli: Correct, that's the default for the
auto split limit.
----
2018-01-26 04:57:48 UTC - Matteo Merli: You can increase the default and you
can also precreate a namespace with a larger amount to begin with
----
2018-01-26 04:58:50 UTC - Jaebin Yoon: I see. it seems to be better starting
with bigger one. Auto-split can make the data pipeline unstable for a while.
----
2018-01-26 04:59:08 UTC - Jaebin Yoon: regarding the separate disk for bookie
journal, can i use the loopback device for the test?
----
2018-01-26 05:00:06 UTC - Jaebin Yoon: do you think it would work for tests?
It's hard to attach a separate hard disk in our env now.. and we don't survive
machine when it's down any way
----
2018-01-26 05:00:35 UTC - Jaebin Yoon: when the machine is gone, it's gone. we
cannot reuse. even no rebooting.
----
2018-01-26 05:01:23 UTC - Jaebin Yoon: eventually, i think we should have a
separate disk for the journal but for testing, do you see any issue with using
loopback device?
----
2018-01-26 05:02:04 UTC - Jaebin Yoon: I just want to separate io for read,
write to make sure that's not the cause of system instability.
----
2018-01-26 05:03:37 UTC - Matteo Merli: What do you mean by loopback device?
The disk that comes by default with the vm?
----
2018-01-26 05:04:06 UTC - Jaebin Yoon: like /tmp .. memory based device.
----
2018-01-26 05:04:19 UTC - Jaebin Yoon: memory based block device.
----
2018-01-26 05:04:34 UTC - Matteo Merli: Oh I see, dev/shm ?
----
2018-01-26 05:04:38 UTC - Jaebin Yoon: yeah
----
2018-01-26 05:04:46 UTC - Matteo Merli: That should be good
----
2018-01-26 05:05:11 UTC - Jaebin Yoon: ok. i'll use that for testing. that'll
help us do io isolation for testing.
----
2018-01-26 05:05:27 UTC - Matteo Merli: You can reduce the number of journal
files retained from 5 to 1 as well
----
2018-01-26 05:06:02 UTC - Matteo Merli: Let me search he conf name
----
2018-01-26 05:07:26 UTC - Matteo Merli: journalMaxBackups=1 in bookkeeper.conf
----
2018-01-26 05:08:07 UTC - Jaebin Yoon: i see. ok.
----
2018-01-26 05:08:21 UTC - Jaebin Yoon: i'll change that. thanks!
----
2018-01-26 05:08:56 UTC - Matteo Merli: Just to use less memory
----
2018-01-26 05:09:08 UTC - Jaebin Yoon: yup
----
2018-01-26 06:20:33 UTC - Jaebin Yoon: I meant "ramdisk" when I said "loopback"
device or disk. I haven't used those terms for long time so got confused.
Anyway, I think it might not be a terrible idea to use that even for production
env where you cannot recover machines with rebooting as long as machines have
enough memory to hold a journal. I'll think about this setup more for our
environment.
----
2018-01-26 06:27:03 UTC - Jaebin Yoon: To upgrade bookies, I think we should do
the rolling upgrade for the live cluster. After upgrading one bookie, what
metics should I look for to know that I can upgrade the next one so that I
don't lose any data?
----
2018-01-26 07:13:47 UTC - Matteo Merli: In general we were doing
rolling-upgrade with these steps:
* PRE : verify there are no under-replicated ledgers `bin/bookkepeer shell
listunderreplicated` should be empty
* PRE : disable auto-recovery since we don’t want to trigger data copy when
the bookie will come back shortly
the setting here is cluster wide: `bin/bookkeeper shell autorecovery
-disable`
* Do the upgrade..
* POST: do bookie sanity to check health: `bin/bookkeeper shell bookie-sanity`
(also this should be run every x minute on each bookie for alerting)
* POST: re-enable auto-recovery: `bin/bookkeeper shell autorecovery -enable`
to leave the cluster setting in initial state
----
2018-01-26 08:05:30 UTC - Jaebin Yoon: thanks a lot. So these steps are for
each bookie, right? Let me try.
----
2018-01-26 08:15:30 UTC - Jaebin Yoon: Ah this is a cluster setting. So
basically "do the upgrade" means rolling restart all bookies I guess.
----
2018-01-26 08:17:54 UTC - Jaebin Yoon: What if we need to provision new
machines replacing the existing ones? In this case, you shouldn't disable auto
recovery but kill one bookie and provision new one and check
"listunderreplicated" empty before doing the next one.
----
2018-01-26 09:21:42 UTC - Jaebin Yoon: Do I need to run a separate autorecovery
service ? (I'm seeing there is bin/bookkeeper autorecovery).
I brought up new bookies and kill the existing bookies one by one but I don't
see 'underreplicated' entry in bookkeeper shell command but not sure if the
data get copied to others.
----
2018-01-26 16:43:22 UTC - Matteo Merli: &gt; Ah this is a cluster setting. So
basically “do the upgrade” means rolling restart all bookies I guess.

Sorry I didn’t specify. the steps are per-bookie. the flag to enable/disable
auto-recovery is per-cluster.

Given that, since all the automation we had was per-host, we can flip the flag
off and on for each host. It’s also possible to turn it off before the upgrade
and back on at the very end
----
2018-01-26 16:46:18 UTC - Matteo Merli: &gt; What if we need to provision new
machines replacing the existing ones? In this case, you shouldn’t disable auto
recovery but kill one bookie and provision new one and check
“listunderreplicated” empty before doing the next one.

That’s correct. The auto-recovery should be always “on”, except when there’s
planned maintenance and you want to avoid the data getting copied around. (btw:
for that there’s also and over-replication check that happens periodically,
default every 24h, to check if we have extra copies of some ledgers and get rid
of them)&gt;
----
2018-01-26 16:56:19 UTC - Matteo Merli: &gt; Do I need to run a separate
autorecovery service ? (I’m seeing there is bin/bookkeeper autorecovery).
&gt; I brought up new bookies and kill the existing bookies one by one but I
don’t see ‘underreplicated’ entry in bookkeeper shell command but not sure if
the data get copied to others.

Yes, I think I forgot to mention this. The auto-recovery service is a logically
separated component, completely stateless and can be started in a few different
ways :
* In same JVM process as bookies (`autoRecoveryDaemonEnabled=true` in
bookkeeper.conf). It’s easy to start but in general you want to avoid the
recovery process to interfere with bookie process and also to pollute the
bookie logs since it’s kind of noisy)
* In a different process running alongside each bookie process
(`bin/bookkeeper autorecovery`). This process needs little memory (-Xmx512M)
and no other special tuning
* Running a pool of auto-recovery workers independently from the bookies. This
is generally a good option when running with any orchestration framework, where
it’s easy to manage stateless components.
----
2018-01-26 17:06:39 UTC - Matteo Merli: regarding auto-recovery, there’s more
documentation here:
<http://bookkeeper.apache.org/docs/latest/admin/autorecovery/>
----
2018-01-26 17:33:35 UTC - Jaebin Yoon: Thanks @Matteo Merli. I don't see that
option (autoRecoveryDaemonEnabled) is specified in the bookkeeper.conf. (maybe
we should add the default value in bookkeeper.conf so that people get aware of
this option?) Without that service running, it seems no data copying is done
when a bookie goes down (nobody updates underreplicated ledgers. So checking
listunderreplicated doesn't do anything). I think I lost my all my test data.
^^ it's ok since those are just test data.
Can you explain how this auto-recovery work in highlevel ? I would like to
confirm how the data flows.
Does the auto-recovery worker just talk to zookeeper and find out any missing
bookies and update underreplicated list in znode ? Or it involves in moving
ledgers from one bookie to the other? who would move actual ledgers?
----
2018-01-26 17:43:53 UTC - Matteo Merli: &gt; I don’t see that option
(autoRecoveryDaemonEnabled) is specified in the bookkeeper.conf.

Yes, for some reason the options for auto-recovery with default values are not
there in the `bookkeeper.conf` included with Pulsar. We should fix that.

&gt; Can you explain how this auto-recovery work in highlevel ? I would like
to confirm how the data flows.
&gt; Does the auto-recovery worker just talk to zookeeper and find out any
missing bookies and update underreplicated list in znode ? Or it involves in
moving ledgers from one bookie to the other? who would move actual ledgers?

There 2 parts: the auditor and the workers.
* Auditor monitors bookies to check for missing ones
* When it finds one, it will mark all ledgers that have a copy on that bookie
as “under-replicated”
* Workers will pick up tasks to do, in the form of ledgers to replicate
* When a worker is done replicating a particular ledger, it will update the
metadata (replacing the mention of the failed bookie with the new bookie that
contains the copy of the data now) and then it will clear the under-replication
status for that ledger
* The workers keep waiting for next ledger to replicate

<http://bookkeeper.apache.org/docs/latest/admin/autorecovery/#autorecovery-architecture>
----
2018-01-26 17:48:04 UTC - Jaebin Yoon: Oh I should've read the doc first. It
seems the process is well documented there. So the bookies themselves have no
ability to replicate so these auto-recovery workers read/write from/to bookies.
So potentially the network can be saturated between these auto-recovery nodes
and bookies when a bookie goes down.
----
2018-01-26 17:48:36 UTC - Matteo Merli: there’s a throttling at the worker level
----
2018-01-26 17:48:57 UTC - Matteo Merli: the max number of entries to replicate
at a given time
----
2018-01-26 17:49:41 UTC - Matteo Merli: These are the options:
<https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/conf/bk_server.conf#L213>
----
2018-01-26 17:49:52 UTC - Jaebin Yoon: i see. I'll dig more into this recovery
since this would be important part for the environment when the machines can go
away easily.
----
2018-01-26 17:50:26 UTC - Matteo Merli: This conf file is from master. the only
option that is not there in the current version is the
`lostBookieRecoveryDelay`. We’ll have that once we switch to bk 4.7
----
2018-01-26 18:02:02 UTC - Jaebin Yoon: Since I lost all ledgers for the
existing topic, I need to clean them up. Currently nobody produces, consumes
from the topic. Can I just delete the topic that lost data and recreate it to
clean up the previous meta data?
----
2018-01-26 18:22:15 UTC - Jaebin Yoon: Another question. It seems that this
auto-recovery takes care of the case where a bookie is gone. How about the
under-replicated case because of the broker fails to duplicate because of the
temporary connection issue?
For example, if write quorum = 2, ack quorum =1, then the message may fail to
be replicated to two bookies. This will lead to inconsistency in the duplicated
ledgers? or Any mechanism to handle this kind of failure?
----
2018-01-26 19:20:43 UTC - Matteo Merli: &gt; For example, if write quorum = 2,
ack quorum =1, then the message may fail to be replicated to two bookies. This
will lead to inconsistency in the duplicated ledgers? or Any mechanism to
handle this kind of failure?

If the write operation eventually fails on the second bookie, the BK client
will take care of it by replacing the failed bookie with a different one from
the cluster and re-sending all the entries that were not acknowledged by that
particular bookie
----
2018-01-26 19:30:04 UTC - Matteo Merli: &gt; Since I lost all ledgers for the
existing topic, I need to clean them up. Currently nobody produces, consumes
from the topic. Can I just delete the topic that lost data and recreate it to
clean up the previous meta data?

Yes, using the `bin/pulsar-admin persistent delete-partitioned-topic $MY_TOPIC`
should get the trick
----

Slack digest for #general - 2018-01-26

Reply via email to