Slack digest for #general - 2018-10-04

Apache Pulsar Slack Thu, 04 Oct 2018 02:11:41 -0700

2018-10-03 10:11:27 UTC - Eitan Adler: @Eitan Adler has joined the channel
----
2018-10-03 21:17:19 UTC - Dave Southwell: I'm confused.  I just ran a few 
benchmarks on my pulsar cluster, and managed to just about fill my bookkeeper 
data disks on the three nodes.  The benchmarks are done, and I've gone through 
and manually removed any backlogged messages in all the namespaces.  So, now my 
ensemble has 0 backlog messages, yet the disk usage of the data disks hasn't 
been cleaned up.  I'm clearly missing something configuration-wise, or concept 
wise.  Any  suggestions?
----
2018-10-03 21:19:14 UTC - Dave Southwell: I should maybe add that two of my 
three bookies are readonly now, the other is still readwrite according to 
`./bookkeeper shell listbookies -readwrite`
----
2018-10-03 22:02:53 UTC - Ivan Kelly: Ledger GC needs to run on the bookie to 
clear out the disk
----
2018-10-03 22:09:51 UTC - Dave Southwell: I was just reading about that.  Looks 
like the defaults are maybe just a bit slower than I expected.  Once the 
storage is reclaimed will one or both of the readonly bookies switch to 
readwrite on their own?  Or do I need to do something to make that happen?
----
2018-10-03 22:10:30 UTC - Matteo Merli: They will switch back into read-write 
once the disk usage goes below threshold
----
2018-10-03 22:11:13 UTC - Dave Southwell: Great!  Thanks!
----
2018-10-03 22:11:59 UTC - Matteo Merli: regarding the deletion, there are 
multiple tunable to reduce the time for data to be deleted
----
2018-10-03 22:12:14 UTC - Matteo Merli: it’s mostly a tradeoff between disk 
space and throughput
----
2018-10-03 22:13:40 UTC - Matteo Merli: (Just found an old message regarding 
this) :
----
2018-10-03 22:13:48 UTC - Matteo Merli: ```
 There are few layers here:


 * First messages are stored in BookKeeper "ledgers". Each ledger is an 
append-only replicated log and can only be deleted entirely.
   So even if you consume few entries, the ledger won't be deleted until all 
messages stored in that ledger are consumed and
   acknowledged for all subscription (plus, eventually, the retention time).
   Ledgers are rolled-over on a size and time basis and there are few tunable 
to set in `broker.conf`:
    * `managedLedgerMaxEntriesPerLedger=50000`
    * `managedLedgerMinLedgerRolloverTimeMinutes=10`
    * `managedLedgerMaxLedgerRolloverTimeMinutes=240`

 * When a ledger is deleted, the bookies (storage nodes) won't delete the data 
immediately. Rather they rely on
   a garbage collection process. This GC runs periodically and checks for 
deleted ledgers and see if data on
   disk can be removed.
   Since there is no single file per-ledger, the bookie will compact the entry 
log files based on thresholds:
    * Gargage collection time: `gcWaitTime=900000` (default is 15min)
      - All empty files are removed

    * Minor compaction -- Runs every 1h and compact all the files with &amp;lt; 
20% "valid" data
       - `minorCompactionThreshold=0.2`
       - `minorCompactionInterval=3600`

    * Major compaction -- Runs every 24h and compact all the files with 
&amp;lt; 50% "valid" data
       - `majorCompactionThreshold=0.5`
       - `majorCompactionInterval=86400`
```
----
2018-10-03 22:20:35 UTC - Dave Southwell: ^ very useful!  I have a somewhat 
related question.  How do I scale ledger storage to handle situations where 
backlog counts spike?  I'm on GCE and am currently using bookie instances with 
two SSD's (logs, and data).  As I understand it, as the backlog grows so too 
will data disk usage.  If I add another bookie node(with the same config as 
existing), I'll just get additional replication,  not expanded storage 
capacity, as the index files are striped across all members of the ensemble?
----
2018-10-03 22:48:02 UTC - Ivan Kelly: you'll get expanded capacity if you add 
more nodes. the replication factor of the topics doesn't change depending on 
how many bookies you have
----
2018-10-03 22:49:12 UTC - Dave Southwell: Ahh, Ok.  That makes sense.
----
2018-10-03 22:51:07 UTC - Ivan Kelly: and if you use tiered storage, you don't 
even need extra nodes. just offload excess to GCS
----
2018-10-03 22:51:36 UTC - Ivan Kelly: though there's currently no way to 
trigger offload when bookies hit a certain threshold
----
2018-10-03 22:57:25 UTC - Grant Wu: How do I get the message ID when producing 
a message via the Go client?
----
2018-10-03 22:57:45 UTC - Grant Wu: I see in the example, we have
```
        // Attempt to send the message asynchronously and handle the response
        producer.SendAsync(ctx, asyncMsg, func(msg pulsar.ProducerMessage, err 
error) {
            if err != nil { log.Fatal(err) }

            fmt.Printf("Message %s succesfully published", msg.ID())
        })
```
----
2018-10-03 22:58:18 UTC - Grant Wu: But 
<https://godoc.org/github.com/apache/incubator-pulsar/pulsar-client-go/pulsar#ProducerMessage>
 doesn’t have an `ID()` field :thinking_face:
----
2018-10-03 23:54:00 UTC - Matteo Merli: @Grant Wu the example is incorrect 
:confused:
----
2018-10-03 23:54:21 UTC - Matteo Merli: the `ID()` is only available on the 
received messages
----
2018-10-03 23:54:50 UTC - Grant Wu: Hrm… it’s available through WebSockets 
though 
<https://pulsar.apache.org/docs/en/client-libraries-websocket/#producer-endpoint>
----
2018-10-03 23:55:26 UTC - Grant Wu: I’m currently converting an in-house 
example client from our old home grown Go client (which was written using the 
WebSocket library, which we’ve tossed) to the official client, which is why I 
ask
----
2018-10-03 23:56:39 UTC - Matteo Merli: Yes, it was mostly for simplifying the 
api
----
2018-10-03 23:56:53 UTC - Matteo Merli: from the underlying C++ lib we have the 
message id
----
2018-10-03 23:57:10 UTC - Grant Wu: I see
----
2018-10-03 23:57:14 UTC - Grant Wu: Well, we don’t really _need_ it
----
2018-10-03 23:57:27 UTC - Grant Wu: Might be useful to have it back somehow 
later
----
2018-10-04 01:01:09 UTC - Pablo Valdes: Is it possible to configure a topic 
message capacity and/or time to hold unacknowledged messages from subscription 
consumers.
My sceanrio is a instant message mobile app. Then I assigned each user a topic 
so every time someone publishes a message to that topic the user receives a new 
message. When a client mobile websocket is disconnected, it does not receive 
any message of course. Then when it reconnects I noticed that it gets all the 
unacknowledged messages sent while it was off line. It is working so far as 
expected however, I would like to have fine control over these offline 
messages. Is that possible?
----
2018-10-04 03:47:38 UTC - Matteo Merli: @Pablo Valdes you can set message TTL 
to have all messages older than a certain time to be automatically dropped. 
<http://pulsar.apache.org/docs/en/cookbooks-retention-expiry/#time-to-live-ttl>
----
2018-10-04 03:55:43 UTC - Pablo Valdes: That works thanks. Is the cookbook 
available in hard copy?
----
2018-10-04 04:18:57 UTC - Matteo Merli: what do you mean by hard copy?
----
2018-10-04 07:20:14 UTC - king: @king has joined the channel
----

Slack digest for #general - 2018-10-04

Reply via email to