2018-09-21 18:29:38 UTC - Grant Wu: ```
2018-09-21T18:22:23.046775688Z 18:22:23.046
[bookkeeper-ml-workers-OrderedExecutor-15-0] WARN
org.apache.bookkeeper.mledger.impl.OpReadEntry -
[public/default/persistent/symphony][be645d03-95dc-4c78-acec-7ee0801c489f] read
failed from ledger at position:4231:4789 : Bookie operation timeout
2018-09-21T18:22:23.046796482Z 18:22:23.046
[broker-topic-workers-OrderedScheduler-4-0] ERROR
org.apache.pulsar.broker.service.persistent.PersistentDispatcherSingleActiveConsumer
- [<persistent://public/default/symphony> /
be645d03-95dc-4c78-acec-7ee0801c489f-Consumer{subscription=PersistentSubscription{topic=<persistent://public/default/symphony>,
name=be645d03-95dc-4c78-acec-7ee0801c489f}, consumerId=3727,
consumerName=02d87, address=/10.244.1.14:40032}] Error reading entries at
4231:4789 : Bookie operation timeout - Retrying to read in 15.0 seconds
```
this sounds like disk error, correct?
----
2018-09-21 18:31:53 UTC - Matteo Merli: read timeout
----
2018-09-21 18:32:04 UTC - Matteo Merli: broker reading from bookies
----
2018-09-21 18:32:20 UTC - Grant Wu: Well yes
----
2018-09-21 18:32:28 UTC - Grant Wu: Could the read timeout be caused by disk
error?
----
2018-09-21 18:32:31 UTC - Grant Wu: I think this is standalone
----
2018-09-21 18:33:24 UTC - Grant Wu: I am also getting a ton of
```
18:32:48.046 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR
org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
18:32:49.046 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR
org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
18:32:50.046 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR
org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
18:32:51.046 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR
org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
18:32:52.046 [BookKeeperClientScheduler-OrderedScheduler-0-0] ERROR
org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
```
----
2018-09-21 18:35:14 UTC - Grant Wu: This is standalone deployment, I believe
----
2018-09-21 18:35:40 UTC - Sijie Guo: @Grant Wu what is the version?
----
2018-09-21 18:36:25 UTC - Matteo Merli: can you scroll up in the logs? JVM
optimizes out the repeated exception, but the first time it happens there
should be the full stack trace
----
2018-09-21 18:37:23 UTC - Grant Wu: that I can get you, hold on
----
2018-09-21 18:44:10 UTC - Grant Wu:
----
2018-09-21 18:44:26 UTC - Grant Wu: @Matteo Merli sorry about the formatting.
If you’d like, I can try to get rid of some of the noise
----
2018-09-21 18:44:53 UTC - Grant Wu: It’s `2.1.0`
----
2018-09-21 18:49:49 UTC - Matteo Merli: @Grant Wu This looks like the issue
reported in <https://github.com/apache/bookkeeper/pull/1618> . This was fixed
in BK-4.7.2 and picked up in Pulsar 2.1.1
----
2018-09-21 18:50:40 UTC - Grant Wu: Okay, great. I will tell us to move to
that ASAP.
----
2018-09-21 18:50:52 UTC - Grant Wu: Any thoughts on the bookie issues, though?
----
2018-09-21 18:51:23 UTC - Matteo Merli: That was the bookie issue, because of
the NPE the read operation doesn’t complete and gets timedout
----
2018-09-21 18:51:47 UTC - Grant Wu: Ah. Fascinating…
----
2018-09-21 18:53:30 UTC - Grant Wu: Thanks a ton!
----
2018-09-21 18:54:10 UTC - Grant Wu: Could we be mis-using Pulsar to hit upon
this bug? The message says “but this code it only hit in the rare case where a
client is trying to read or write the same entry from the same ledger more than
once at the same time.”
----
2018-09-21 18:54:24 UTC - Grant Wu: So I am curious as to what we’re doing is
strange
----
2018-09-21 18:55:04 UTC - Matteo Merli: Nothing wrong, that’s an internal read,
it should never throw NPE
----
2018-09-21 18:55:51 UTC - David Orr: @David Orr has joined the channel
----
2018-09-21 19:04:02 UTC - Grant Wu: Ah, it appears my coworker has joined.
:wave:
----
2018-09-21 19:04:45 UTC - Grant Wu: I am spreading the Pulsar slack’s
reputation for quick triage, it seems
bug : Matteo Merli, Sijie Guo
+1 : David Orr
----
2018-09-21 19:34:47 UTC - Dave Southwell: I'm seeing this error `Caused by:
org.apache.pulsar.client.api.PulsarClientException$ProducerBlockedQuotaExceededError:
Cannot create producer on topic with backlog quota exceeded` but I don't see
any quotas being set, nor can I see how much of the quota is used currently.
I'm certain this is something very simple, and I'm just overlooking it, but
I've run out of places I can think of to look. This is my command that's
resulting in this error. `bin/pulsaperf produce
<persistent://public/default/exclamation-output> --rate 50000`
----
2018-09-21 19:36:03 UTC - Grant Wu:
<http://pulsar.incubator.apache.org/docs/en/cookbooks-retention-expiry/#backlog-quotas>
----
2018-09-21 19:36:29 UTC - Grant Wu:
<http://pulsar.incubator.apache.org/docs/en/cookbooks-retention-expiry/#get-backlog-threshold-and-backlog-retention-policy>
may help
----
2018-09-21 19:36:33 UTC - Grant Wu: I actually don’t know what the defaults are.
----
2018-09-21 19:36:36 UTC - Matteo Merli: @Dave Southwell Topics have a default
quota if not specified. The default is set in the broker configuration and the
default is 25 GB
----
2018-09-21 19:37:22 UTC - Dave Southwell: Is there a way to see that through
the `pulsar-admin` command?
----
2018-09-21 19:37:41 UTC - Grant Wu: I believe my second link has it…
----
2018-09-21 19:37:57 UTC - Grant Wu: `pulsar-admin namespaces get-backlog-quotas
my-tenant/my-ns`
----
2018-09-21 19:38:17 UTC - Dave Southwell: I thought the same, but I just get `{
}` as a result.
----
2018-09-21 19:38:30 UTC - Matteo Merli: Yes: `pulsar-admin namespaces
set-backlog-quota $NAMESPACE --limit 2G --policy producer_request_hold`
----
2018-09-21 19:38:30 UTC - Grant Wu: Ah, well, I’ll let merlimat figure that out
then
----
2018-09-21 19:38:43 UTC - Grant Wu: unless `{ }` means “default”?
----
2018-09-21 19:39:02 UTC - Dave Southwell: ....indeed `{ }` may just equate to
default.
----
2018-09-21 19:39:11 UTC - Grant Wu: if true I’d say that’s a bit unfortunate
----
2018-09-21 19:40:13 UTC - Dave Southwell: I'm also curious how I can find out
how much of the backlog quota is currently used. I'm guessing that metric must
exist somewhere.
----
2018-09-21 19:40:17 UTC - Matteo Merli: yes, `get-backlog-quotas` only gives
you the namespace specific quota, but doesn’t report from the system default.
It could definitely be improved
----
2018-09-21 19:40:40 UTC - Matteo Merli: `pulsar-admin topics stats $TOPIC`
includes a `storage` value
----
2018-09-21 19:41:10 UTC - Matteo Merli: (it also gets exported to Prometheus,
to setup alerts and such)
----
2018-09-21 19:51:41 UTC - Dave Southwell: This is great info. Thanks very much!
----
2018-09-21 19:56:31 UTC - Dave Southwell: If I'm reading it correctly, my
backlog size is lower than 25Gb, at 11.9Gb. I would think then that I'm still
under the default 25 Gb default quota. ```{
"msgRateIn" : 0.0,
"msgThroughputIn" : 0.0,
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"averageMsgSize" : 0.0,
"storageSize" : 12829343430,
"publishers" : [ ],
"subscriptions" : {
"my-subscription" : {
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateRedeliver" : 0.0,
"msgBacklog" : 0,
"blockedSubscriptionOnUnackedMsgs" : false,
"unackedMessages" : 0,
"type" : "Shared",
"msgRateExpired" : 0.0,
"consumers" : [ ]
},
"__compaction" : {
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateRedeliver" : 0.0,
"msgBacklog" : 114384,
"blockedSubscriptionOnUnackedMsgs" : false,
"unackedMessages" : 0,
"type" : "Exclusive",
"msgRateExpired" : 0.0,
"consumers" : [ ]
},
"my-subscription-1" : {
"msgRateOut" : 0.0,
"msgThroughputOut" : 0.0,
"msgRateRedeliver" : 0.0,
"msgBacklog" : 0,
"blockedSubscriptionOnUnackedMsgs" : false,
"unackedMessages" : 0,
"type" : "Exclusive",
"msgRateExpired" : 0.0,
"consumers" : [ ]
}
},
"replication" : { },
"deduplicationStatus" : "Disabled"
}```
----
2018-09-21 19:57:43 UTC - Matteo Merli: actually, I was wrong on the default
setting:
```
# Default per-topic backlog quota limit
backlogQuotaDefaultLimitGB=10
```
----
2018-09-21 19:57:57 UTC - Matteo Merli: in `broker.conf` (or `standalone.conf`)
----
2018-09-21 19:58:24 UTC - Dave Southwell: :+1: Thanks!
----
2018-09-21 19:59:33 UTC - Matteo Merli: (and it doesn’t kick immediately, the
check is done periodically. that’s why it goes up a bit to 11 GB)
----
2018-09-21 20:01:48 UTC - Matteo Merli: For compaction you should setup
automatic “size-based” compaction
----
2018-09-21 20:02:28 UTC - Matteo Merli: `bin/pulsar-admin namespaces
set-compaction-threshold --threshold 100M my-tenant/my-namespace`
----
2018-09-21 20:03:06 UTC - Dave Southwell: What changes the `storageSize` as
compared to the number of messages in the backlog, or are those things not
related? I've let this thing sit over night and the `storageSize` is the same
as it was yesterday and the backlog has been empty for ~24 hrs.
----
2018-09-21 20:03:44 UTC - Dave Southwell: Oh, maybe because I haven't set
anything it just leaves it as it was.
----
2018-09-21 20:04:16 UTC - Matteo Merli: In your stats, the __compaction
internal subscription is keeping the data from being deleted
----
2018-09-21 20:12:01 UTC - Dave Southwell: Oh, Ok. Is there a way to remove the
compaction subscription? Also, when I run the `compaction-status` command with
`-w` I get an error saying `Error in compaction` `null` `Reason: Error
compacting: java.util.concurrent.TimeoutException: Timeout`
----
2018-09-21 20:14:24 UTC - Matteo Merli: You can remove it, if you’re not going
to use compaction (`pulsar-admin topics unsubscribe $TOPIC -s __compaction`) or
enable automatic compaction as shown above
+1 : Dave Southwell
----
2018-09-21 22:30:00 UTC - Grant Wu: Do these bookie issues still happen on
standalone?
----
2018-09-21 22:30:18 UTC - Grant Wu: The infra guy here is saying that Pulsar
does not use a bookie in standalone mode…? Not sure what he means
----
2018-09-21 22:32:02 UTC - Matteo Merli: It's embedded in same process, but it's
the almost the same as a regular cluster
----
2018-09-21 22:32:34 UTC - Grant Wu: So we should bump to 2.1.1 anyways right
----