Slack digest for #general - 2018-01-17

Apache Pulsar Slack Wed, 17 Jan 2018 11:10:20 -0800

2018-01-16 19:11:30 UTC - Daniel Ferreira Jorge: do you see what I mean? 
basically, it is impossible to reprocess old data with pulsar using 
subscriptions (which is on of the strong features that pulsar have over kafka, 
being able to act as a streaming platform and a queue)
----
2018-01-16 19:13:41 UTC - Daniel Ferreira Jorge: What is the lifecycle of a 
subscription? If I don't have any consumers consuming from it, will it still 
exist? Because then I can create a bootstrap consumer that will not consume 
anything just as an excuse to create a subscription. Will this work?
----
2018-01-16 19:15:09 UTC - Matteo Merli: &gt; What is the lifecycle of a 
subscription?

Once it’s created, it retains all messages published after that (minus explicit
TTL). Subscriptions can be dropped by explicitely unsubscribing (in `Consumer`
API) or through the REST/CLI .
----
2018-01-16 19:16:03 UTC - Matteo Merli: &gt; Because then I can create a
bootstrap consumer that will not consume anything just as an excuse to create a
subscription. Will this work?

Yes, that would be the same as creating the subscription and leave it there
----
2018-01-16 19:24:30 UTC - Daniel Ferreira Jorge: I think that works for me...
thanks!
----
2018-01-17 02:26:06 UTC - Jaebin Yoon: I'm getting the following errors in the
broker. In the classpath, there is "pulsar-checksum-1.21.0-incubating.nar" I'm
bringing up pulsar broker in the embedded environment with simple wrapper. Is
there something I need to set the environment to use that package correctly?

```2018-01-17 02:17:32,604 - WARN - [pulsar-io-49-3:ServerCnx@163] -
[/100.127.66.35:47937] Got exception:
org/apache/pulsar/checksum/utils/Crc32cChecksum
java.lang.NoClassDefFoundError: org/apache/pulsar/checksum/utils/Crc32cChecksum
at
org.apache.pulsar.broker.service.Producer.verifyChecksum(Producer.java:144)
at
org.apache.pulsar.broker.service.Producer.publishMessage(Producer.java:124)
at
org.apache.pulsar.broker.service.ServerCnx.handleSend(ServerCnx.java:622)```
----
2018-01-17 02:29:31 UTC - Sijie Guo: ah
----
2018-01-17 02:30:43 UTC - Sijie Guo: @Jaebin Yoon I think the
`pulsar-broker-shaded` is a bit problematic at this moment. basically it
relocated `org.apache.pulsar.checksum` to a relocated package, but it doesn’t
update the manifest in the nar package, so it fails to load the class.
----
2018-01-17 02:31:35 UTC - Jaebin Yoon: I'm not using pulsar-broker-shaded any
more (had enough problem).
----
2018-01-17 02:31:43 UTC - Sijie Guo: oh okay
----
2018-01-17 02:32:03 UTC - Jaebin Yoon: but still getting this error. not sure
*.nar should be treated differently.
----
2018-01-17 02:32:12 UTC - Sijie Guo: one second let me check
----
2018-01-17 02:33:07 UTC - Sijie Guo: did you see pulsar-checksum.jar in your
classpath?
----
2018-01-17 02:33:29 UTC - Jaebin Yoon: i see
`pulsar-checksum-1.21.0-incubating.nar` not jar
----
2018-01-17 02:35:08 UTC - Sijie Guo: is your application a maven project or
not? do you mind pointing me how do you include the pulsar-broker dependency in
your application?
----
2018-01-17 02:36:32 UTC - Jaebin Yoon: it's gradle. some part of build.gradle
``` compile 'org.apache.pulsar:pulsar-broker:1.21.0-incubating'
compile "org.apache.pulsar:pulsar-client-admin:1.21.0-incubating"
compile "org.apache.pulsar:pulsar-client-tools:1.21.0-incubating"```
----
2018-01-17 02:37:57 UTC - Jaebin Yoon: Does this *.nar have c++ binary?
----
2018-01-17 02:37:57 UTC - Sijie Guo: can you include “compile
‘org.apache.pulsar:pulsar-checksum:1.21.0-incubating’”
----
2018-01-17 02:38:21 UTC - Jaebin Yoon: actually i tried that. and it installed
that nar package. not jar.
----
2018-01-17 02:38:36 UTC - Sijie Guo: yes checksum is using nar to packaging a
c++ implementation of crc32c.
----
2018-01-17 02:38:46 UTC - Sijie Guo:
org.apache.pulsar:pulsar-checksum:jar:1.21.0-incubating’
----
2018-01-17 02:38:53 UTC - Sijie Guo: specify the “jar”?
----
2018-01-17 02:39:11 UTC - Jaebin Yoon: let me try that.
----
2018-01-17 02:41:19 UTC - Jaebin Yoon: I'm getting this error :
```&gt; Could not resolve all dependencies for configuration
':compileClasspath'.
&gt; Could not find pulsar-checksum-1.21.0-incubating.jar
(org.apache.pulsar:pulsar-checksum:1.21.0-incubating).
Searched in the following locations:

<https://repo.maven.apache.org/maven2/org/apache/pulsar/pulsar-checksum/1.21.0-incubating/pulsar-checksum-1.21.0-incubating-1.21.0-incubating.jar>```
----
2018-01-17 02:41:53 UTC - Sijie Guo: oh I probably write in a wrong way
----
2018-01-17 02:42:03 UTC - Sijie Guo: let me check gradle syntax (sorry about
that)
----
2018-01-17 02:44:04 UTC - Sijie Guo:
org.apache.pulsar:pulsar-checksum:1.21.0-incubating@jar ?
----
2018-01-17 02:45:22 UTC - Jaebin Yoon: yeah that resolves at least... let me
try to package to see if that select *.jar.
----
2018-01-17 02:45:32 UTC - Sijie Guo: cool
----
2018-01-17 02:48:18 UTC - Jaebin Yoon: I haven't seen *.nar package. so not
sure why it installed that nar package not jar.
----
2018-01-17 02:51:55 UTC - Sijie Guo: I think the pulsar-checksum is defined as
a `nar` module, <https://search.maven.org/#search%7Cga%7C1%7Cpulsar-checksum>
both nar and jar are published to the maven artifact. I guess in gradle, if you
don’t specify an ext/classifier, it would download the artifact that defined in
the pom file, so if you specify “jar” in the compile denpendencies, it would
resolve the jar artifact. maven seems to be using a different strategy.
----
2018-01-17 02:56:15 UTC - Jaebin Yoon: I see. I didn't know what nar file is.
I'm learning that now. ^^
<https://medium.com/hashmapinc/nifi-nar-files-explained-14113f7796fd>
Any reason that the pulsar packages nar format?
----
2018-01-17 02:58:39 UTC - Sijie Guo: @Jaebin Yoon I think NAR provides an easy
way to package native code. I am not sure if we can make it a jar module.
@Matteo Merli probably have more context about that.
----
2018-01-17 03:02:50 UTC - Jaebin Yoon: @Sijie Guo that solved the issue. now I
can produce/consume messages !!! That was the last piece. thanks a lot Sije for
your help!
----
2018-01-17 03:03:20 UTC - Sijie Guo: awesome! glad to hear that.
----
2018-01-17 03:03:25 UTC - Sijie Guo: you are welcome
----
2018-01-17 16:07:37 UTC - Daniel Ferreira Jorge: Hi again! Are the kafka
connectors compatible with pulsar using the kafka client wrapper?
----
2018-01-17 16:09:57 UTC - Matteo Merli: Some of them might be, some others not.
It really depends on the connector implementation, because I've seen some
assumptions made in some of them, for example around the topic name not
including some characters...
----
2018-01-17 16:10:49 UTC - Matteo Merli: I think that most would work,
especially if not doing fancy things with the API
----
2018-01-17 16:12:32 UTC - Matteo Merli: In one case (elasticsearch sink), I had
to fix a couple of bugs in the connector, e.g.: race condition that was not
happening with Kafka since the consumer assignment was taking few sec instead
of few ms in Pulsar
----
2018-01-17 16:20:55 UTC - Daniel Ferreira Jorge: Oh, ok, so tests have to be
made for each connector... that is fine! Also, I would like to request
something... When time permits, I think that an option to reset the cursor to a
point in time, when the subscription is first created, before any messages have
been consumed yet, would be great.
----
2018-01-17 16:24:56 UTC - Daniel Ferreira Jorge: Also, another question: is it
possible to run bookkeeper with JBOD (meaning multiple ledgers/journal disks
per bookie)?
----
2018-01-17 16:27:16 UTC - Matteo Merli: &gt;When time permits, I think that an
option to reset the cursor to a point in time, when the subscription is first
created, before any messages have been consumed yet, would be great.

Yes, that is something I heard more than once at this point
:slightly_smiling_face:
----
2018-01-17 16:31:07 UTC - Matteo Merli: &gt; Also, another question: is it
possible to run bookkeeper with JBOD (meaning multiple ledgers/journal disks
per bookie)?

Yes, though the default LedgerStorage that comes with Pulsar bookies
(DbLedgerStorage which stores indexes in RocksDB) doesn’t support it (It’s a
trivial thing to add, but since we always used it in combination with RAID,
there was never the need to do it). Other BookKeeper ledger storage
implementation (eg: the SortedLedgerStorage which is the default in BookKeeper)
support multiple directories for ledger storage.
----
2018-01-17 16:32:30 UTC - Matteo Merli: For journal: yes in any case you can
use multiple disks for journal. In case of fast SSDs, it’s also convenient to
have multiple journals on the same disk, to take advantage of the multiple IO
queues of the disk.
----
2018-01-17 16:49:48 UTC - Daniel Ferreira Jorge: Ok, thanks for the info!
Another feature that I've seen around and would be nice to have (in the long
term... I'm not familiar with pulsar's architecture but I think this is non
trivial) Is a second storage layer (cheaper). For use cases that messages are
not deleted, the messages could be sent to maybe S3 / Cloud Storage / HDFS.
Something like "--send-to-cold-storage-after-retention-expires". Then, pulsar
could have a interface that serves messages directly from this storage in case
we need to reprocess messages (this I believe would be non trivial). Or maybe
just a very simple producer that streams messages from the "cold storage" back
into bookkeeper to be reprocessed...
----
2018-01-17 16:56:23 UTC - Matteo Merli: Oh, yes, you mentioned this earlier
regarding Pravega. We thought about it for a very long time now, but never got
around to actually do it. I think it’s mostly a prioritazation thing. The
change it’s non-trivial but it shouldn’t be too hard to implement. It could be
done even with a separate process that copies the data to S3/HDFS and then
updates the metadata to reflect the data is not in a BK ledger anymore. All the
rest can remain the same (eg: still referring to it as ledgerId/entryId) and
this would be at the storage level, so it would be 0 changes in the broker
dispatcher code.
----
2018-01-17 17:05:30 UTC - Daniel Ferreira Jorge: Oh, yes, I also think this
have very low priority... It can be done by other means... for instance,
sending all the messages to BigQuery and back, using DataFlow... that is how we
do it with Cloud Pub/Sub to persist important topics...
----
2018-01-17 19:04:02 UTC - Allen Wang: Hey, question on routing mode for
partitioned topics. What is the default configuration and what is used in the
Kafka adaptor?
----
2018-01-17 19:06:23 UTC - Matteo Merli: The default is to use the hash of the
key on a message. If the message has no key, the producer will use a “default”
partition (picks 1 random partition and use it for all the messages it
publishes).

This is to maintain the same ordering guarantee when no partitions are there:
per-producer ordering.
----
2018-01-17 19:06:44 UTC - Matteo Merli: The same applies when using the Kafka
wrapper
----
2018-01-17 19:08:45 UTC - Allen Wang: I see. Thanks.
----

Slack digest for #general - 2018-01-17

Reply via email to