Re: KafkaIO: reset topic for reading from the start with every run

2017-01-24 Thread Ismaël Mejía
One extra reminder, if you use the DirectRunner you can set the
DirectOptions to make the validations of the runner loose (and gain some
speed improvements).

setEnforceImmutability(false)
setEnforceEncodability(false)

On Mon, Jan 23, 2017 at 8:22 PM, Gareth Western 
wrote:

> Thanks Thomas. I'll be sure to convey that in the demo. The Flink local
> runner performs nicely. I'm now setting up the Flink cluster for the next
> test.
>
>
>
> On 23. jan. 2017 20:20, Thomas Groh wrote:
>
>> You should also generally expect the DirectRunner to be slower than
>> production runners - the goals of the DirectRunner are primarily ensuring
>> that Pipelines are portable to other production runners and enforcing the
>> Beam Programming Model while enabling local iteration and development, and
>> exposing bugs early. As a result, there is a relatively large amount of
>> additional work done per-element that will slow the Pipeline, and consume
>> additional local resources as the Pipeline executes.
>>
>>
>


Re: KafkaIO: reset topic for reading from the start with every run

2017-01-23 Thread Gareth Western
Thanks Thomas. I'll be sure to convey that in the demo. The Flink local 
runner performs nicely. I'm now setting up the Flink cluster for the 
next test.



On 23. jan. 2017 20:20, Thomas Groh wrote:
You should also generally expect the DirectRunner to be slower than 
production runners - the goals of the DirectRunner are primarily 
ensuring that Pipelines are portable to other production runners and 
enforcing the Beam Programming Model while enabling local iteration 
and development, and exposing bugs early. As a result, there is a 
relatively large amount of additional work done per-element that will 
slow the Pipeline, and consume additional local resources as the 
Pipeline executes.






Re: KafkaIO: reset topic for reading from the start with every run

2017-01-23 Thread Thomas Groh
You should also generally expect the DirectRunner to be slower than
production runners - the goals of the DirectRunner are primarily ensuring
that Pipelines are portable to other production runners and enforcing the
Beam Programming Model while enabling local iteration and development, and
exposing bugs early. As a result, there is a relatively large amount of
additional work done per-element that will slow the Pipeline, and consume
additional local resources as the Pipeline executes.

On Mon, Jan 23, 2017 at 11:05 AM, Gareth Western 
wrote:

> Actually I made a silly mistake in my configuration and had the wrong
> topic name. Tired eyes. Thanks for confirming the config!
>
> On 23. jan. 2017 19:17, Raghu Angadi wrote:
>
>
> On Mon, Jan 23, 2017 at 10:10 AM, Gareth Western <
> gar...@garethwestern.com> wrote:
>
>> *ConsumerConfig.GROUP_ID_CONFIG,
>> UUID.randomUUID().toString(),*
>> *ConsumerConfig.CLIENT_ID_CONFIG, "your_client_id",*
>> *ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"*
>
>
> Out of these three, you only need the last one. It should work, what did
> you observe?
>
>
>


Re: KafkaIO: reset topic for reading from the start with every run

2017-01-23 Thread Gareth Western
Actually I made a silly mistake in my configuration and had the wrong 
topic name. Tired eyes. Thanks for confirming the config!



On 23. jan. 2017 19:17, Raghu Angadi wrote:


On Mon, Jan 23, 2017 at 10:10 AM, Gareth Western 
mailto:gar...@garethwestern.com>> wrote:


*ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString(),**
**ConsumerConfig.CLIENT_ID_CONFIG,
"your_client_id",**
**ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
"earliest"*


Out of these three, you only need the last one. It should work, what 
did you observe?




Re: KafkaIO: reset topic for reading from the start with every run

2017-01-23 Thread Raghu Angadi
On Mon, Jan 23, 2017 at 10:10 AM, Gareth Western 
wrote:

> *ConsumerConfig.GROUP_ID_CONFIG,
> UUID.randomUUID().toString(),*
> *ConsumerConfig.CLIENT_ID_CONFIG, "your_client_id",*
> *ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"*


Out of these three, you only need the last one. It should work, what did
you observe?


KafkaIO: reset topic for reading from the start with every run

2017-01-23 Thread Gareth Western

Apologies if this is slightly off-Beam-topic.

I'm setting up a small demo for some colleagues to parse a 21-million 
line CSV file using a few different runners (Direct, Flink, and Google 
Dataflow). The TextIO using the direct runner seems a bit slow, so I was 
wondering if it's perhaps related to the IO. So I pushed the CSV to a 
Kafka topic (one line per entry). Now to test the topic I'd like to read 
from the start every time. I'm quite new to Kafka, so I'm not sure how 
to "reset" it. A bit of searching indicates that something like this for 
the ConsumerProperties should be correct:


p.apply(KafkaIO.read()
.withBootstrapServers(options.getBrokerUrl())
.withTopics(ImmutableList.of(options.getTopic()))
.withValueCoder(StringUtf8Coder.of())
.updateConsumerProperties(
*ImmutableMap.of(**
**ConsumerConfig.GROUP_ID_CONFIG, 
UUID.randomUUID().toString(),**

**ConsumerConfig.CLIENT_ID_CONFIG, "your_client_id",**
**ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest"**
**)*
))

But it didn't seem to work. Is there some other setting that I'm missing?

Kind regards,

Gareth