2020-01-26 10:07:04 UTC - Gaetan SNL: @Sijie Guo thank you !  I thought it was 
already implemented.
----
2020-01-26 10:50:56 UTC - Eugen: Suppose I have 2 (redundant) producers that 
under normal circumstances produce the exact same messages with the same 
SequenceIds to the same topic, but sometimes there may be gaps, so e.g. 
producer 1 produces messages 1, 2, 3, whereas producer 2 produces 1, 3 - so it 
does not produce message 2 (but the SequenceIds of producer 2 match those of 
producer 1 for all messages), and sometimes, one producer may fail for an 
extended period of time, but when it does come back its messages and seqence 
ids are in sync with the other producer, although there will be a large gap for 
the recovered producer. Can Pulsar take care of deduplication in this case as 
well? The docs for `ProducerBuilder.producerName()` seem to indicate otherwise:
```Specify a name for the producer.
...
Brokers will enforce that only a single producer a given name can be publishing 
on a topic.```
So it seems I would either a) have to put a process in-between the 2 redundant 
processes and the pulsar producer process in order to take care of the 
deduplication, hence creating a SPOF, or b) produce into 2 different topics and 
do deduplication using another process (a Pulsar Function perhaps?) that reads 
off of those 2 topics, incurring a lot of overhead and latency
----
2020-01-26 12:23:08 UTC - Fernando: So I managed to add mTLS to the 
Elasticsearch sink. It connects and I see a new index in elasticsearch created 
by the sink. However, no documents seem to be pushed to the index. The stats 
suggest messages are going through
```INFO  org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - 
[<persistent://packhunt/billing/aggregated-billing-records>] 
[packhunt/billing/aggregate-records-es-sink] [78d97] Prefetched messages: 0 --- 
Consume throughput received: 1.23 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 
0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0```
Could it be a compatibility issue with the version of elasticsearch?
----
2020-01-26 14:52:36 UTC - Dan: @Dan has joined the channel
----
2020-01-26 16:50:33 UTC - Sijie Guo: &gt;  Can Pulsar take care of 
deduplication in this case as well?
it can. as long as your producer 2 produce the messages with same sequence ids 
and with increasing sequence ids.

although producer 2 can not connect to broker if producer 1 connects to 1. 
there is only one producer of a same name can connect to brokers.
----
2020-01-26 16:51:11 UTC - Sijie Guo: I would suggest first trying out the 
connector without mTLS and trying the same elastic version.
----
2020-01-26 20:13:31 UTC - Eugen: So this does not work with both producers 
active at the same time. The use case is this: We have a UDP multicast data 
coming source that we receive twice, completely redundantly (different switches 
etc.), and - except for the multicast group numbers - both sources are 
completely identical. What I would like to have is two producers, one producer 
each receiving one of those redundant streams, and both of them, active/active, 
pushing into Pulsar. Pulsar would throw away almost half of the data, but in 
cases, where one of the producers is missing a UDP packet, it would use te UDP 
packet of the other producer. But I now see this won't work for two reasons: 1) 
as you said only one can be connected at the same time 2) Pulsar does not care 
if there is a gap in sequence numbers, so even if it was able to deduplicate 2 
(normally) identical streams, in cases where there are gaps, the order may not 
be preserved if producer p1 produces x, x+2 (x+1 being a udp packet that did 
not make it to p1) and then p2 produces x + 1
----
2020-01-26 20:20:38 UTC - Dan: hey, has anyone been able to get a Pulsar SQL 
cluster up and running out of the box with 2.5.0? it fails for me. added this 
issue: <https://github.com/apache/pulsar/issues/6146>
----
2020-01-26 20:53:22 UTC - Eugen: So would creating (and implementing) a PIP for 
this "active/active deduplication" use case with these 2 (orthogonal) features 
be realistic?
1. add a "deduplicate across producers" option to topics, which would make sure 
that the last seen SeqId is stored and handled per topic, not per producer
2. add an option for deduplication which orders messages with out-of-order 
SeqIds
For this exact use case, I have had to implement this feature twice already, so 
I know how to do it. With perhaps a difference in that this ordering and 
deduplication was performed in memory, whereas in the case of Pulsar we would 
perhaps want to persist messages as they come in (unordered) first, as Pulsar 
makes guarantees about messages being persisted once acked to the producer, For 
2., it would mean that whenever there is a gap in the incoming SeqId, the 
broker would hold the message in a buffer and wait for a configurable number of 
milliseconds in which the gap may get filled, or if not, the broker will give 
up and make the gap permanent (if the missing message came in afterwards, it 
would get discarded.
----
2020-01-26 21:15:36 UTC - Eugen: So for 2. there would be at least 1 
configurable parameter: *missing-seqid-message-wait-timeout*.
----
2020-01-26 21:17:01 UTC - Eugen: And for our use-case, we'd need another, also 
orthogonal, independently implementable feature, as the SeqIds are reset to 0 
(or 1, don't remember) at the start of the day, so I'd like to add
3. add admin (?) function to reset HighestSequenceId
----
2020-01-26 21:26:02 UTC - Eugen: This btw is a use-case coming out of the stock 
exchange market data feed world, so it is somewhat different from most web 
based event stream use cases, where you never have 2 (almost) exactly redundant 
streams - although I'd imagine Yahoo Finance, which is said to be an early (and 
ongoing?) user of Pulsar, may be able to make use of this as well
----
2020-01-26 22:04:31 UTC - Eugen: Anyways, this is a topic for the *dev* 
channel, so I'll restate my case over there
----
2020-01-27 02:49:54 UTC - Antti Kaikkonen: @Antti Kaikkonen has joined the 
channel
----
2020-01-27 06:18:28 UTC - Fernando: First thing I'd try is setting the IP of 
the proxy instead of the individual brokers
----
2020-01-27 08:44:12 UTC - Kevin Huber: @Kevin Huber has joined the channel
----

Reply via email to