Re: Default Max Events to Read from Kafka

2019-11-17 Thread Vinoth Chandar
Thanks!

On Sun, Nov 17, 2019 at 4:59 AM Pratyaksh Sharma 
wrote:

> https://issues.apache.org/jira/browse/HUDI-340 tracks this.
>
> On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma 
> wrote:
>
> > Yeah,
> >
> > Would love to do that. Will create a jira and raise a PR.
> >
> > On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar 
> wrote:
> >
> >> Concurrent Writes :D
> >>
> >> The magic number 1M is from me actually :) . and there is no magic, it
> was
> >> picked to keep jobs from batch scanning Kafka since source-limit default
> >> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
> >> larger.
> >> Happy to take a PR, to make this limit higher (say 10M) and only use it
> >> when sourceLimit is infinity? Interested in contributing your change
> back?
> >>
> >> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma  >
> >> wrote:
> >>
> >> > Hi Nishith,
> >> >
> >> > I would like to know more about the reasonable payload size and topic
> >> > throughput that you mentioned. :)
> >> > Can you tell me few numbers around these two parameters which were
> >> involved
> >> > in deciding default value as 100.
> >> >
> >> > On Fri, Nov 15, 2019 at 5:32 PM Nishith  wrote:
> >> >
> >> > > Pratyaksh,
> >> > >
> >> > > The default value was chosen based on a “reasonable” payload size
> and
> >> > > topic throughput.
> >> > >
> >> > > The number of messages vs executor/driver memory highly depends on
> >> your
> >> > > message size.
> >> > > It is already a value that you can configure using “sourceLimit”,
> like
> >> > > you’ve already tried.
> >> > > Ideally, this number will be tuned by a user depending on the number
> >> of
> >> > > resources that can be provided vs ingestion latency.
> >> > >
> >> > > Sent from my iPhone
> >> > >
> >> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <
> >> [email protected]>
> >> > > wrote:
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > I have a small doubt. KafkaOffsetGen.java class has a variable
> >> called
> >> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually
> >> > reading
> >> > > > from Kafka, we take the minimum of sourceLimit and this variable
> to
> >> > > > actually form the RDD in case of KafkaSource.
> >> > > >
> >> > > > I want to know the following -
> >> > > >
> >> > > > 1. How did we arrive at this number?
> >> > > > 2. Why are we hard-coding it? Should we not make it configurable
> for
> >> > > users
> >> > > > to play around?
> >> > > >
> >> > > > For bootstrapping purpose, I tried running DeltaStreamer on a
> kafka
> >> > topic
> >> > > > with 1.5 crore events with the following configuration in
> continuous
> >> > > mode -
> >> > > >
> >> > > > 1. changed the above variable to Integer.MAX_VALUE.
> >> > > > 2. Kept source limit as 350 (35 lacs)
> >> > > > 3. executor-memory 4g
> >> > > > 4. driver-memory 6g
> >> > > >
> >> > > > Basically in my case, the RDD was having 35 lac events in one
> >> iteration
> >> > > and
> >> > > > it was able to run fine.
> >> > > >
> >> > > > If I try running deltaStreamer with a greater value of
> sourceLimit,
> >> > then
> >> > > I
> >> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs
> >> looks
> >> > > like
> >> > > > sort of a sweet spot to run DeltaStreamer.
> >> > >
> >> >
> >>
> >
>


Re: Default Max Events to Read from Kafka

2019-11-17 Thread Pratyaksh Sharma
https://issues.apache.org/jira/browse/HUDI-340 tracks this.

On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma 
wrote:

> Yeah,
>
> Would love to do that. Will create a jira and raise a PR.
>
> On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar  wrote:
>
>> Concurrent Writes :D
>>
>> The magic number 1M is from me actually :) . and there is no magic, it was
>> picked to keep jobs from batch scanning Kafka since source-limit default
>> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
>> larger.
>> Happy to take a PR, to make this limit higher (say 10M) and only use it
>> when sourceLimit is infinity? Interested in contributing your change back?
>>
>> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma 
>> wrote:
>>
>> > Hi Nishith,
>> >
>> > I would like to know more about the reasonable payload size and topic
>> > throughput that you mentioned. :)
>> > Can you tell me few numbers around these two parameters which were
>> involved
>> > in deciding default value as 100.
>> >
>> > On Fri, Nov 15, 2019 at 5:32 PM Nishith  wrote:
>> >
>> > > Pratyaksh,
>> > >
>> > > The default value was chosen based on a “reasonable” payload size and
>> > > topic throughput.
>> > >
>> > > The number of messages vs executor/driver memory highly depends on
>> your
>> > > message size.
>> > > It is already a value that you can configure using “sourceLimit”, like
>> > > you’ve already tried.
>> > > Ideally, this number will be tuned by a user depending on the number
>> of
>> > > resources that can be provided vs ingestion latency.
>> > >
>> > > Sent from my iPhone
>> > >
>> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <
>> [email protected]>
>> > > wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > I have a small doubt. KafkaOffsetGen.java class has a variable
>> called
>> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually
>> > reading
>> > > > from Kafka, we take the minimum of sourceLimit and this variable to
>> > > > actually form the RDD in case of KafkaSource.
>> > > >
>> > > > I want to know the following -
>> > > >
>> > > > 1. How did we arrive at this number?
>> > > > 2. Why are we hard-coding it? Should we not make it configurable for
>> > > users
>> > > > to play around?
>> > > >
>> > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka
>> > topic
>> > > > with 1.5 crore events with the following configuration in continuous
>> > > mode -
>> > > >
>> > > > 1. changed the above variable to Integer.MAX_VALUE.
>> > > > 2. Kept source limit as 350 (35 lacs)
>> > > > 3. executor-memory 4g
>> > > > 4. driver-memory 6g
>> > > >
>> > > > Basically in my case, the RDD was having 35 lac events in one
>> iteration
>> > > and
>> > > > it was able to run fine.
>> > > >
>> > > > If I try running deltaStreamer with a greater value of sourceLimit,
>> > then
>> > > I
>> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs
>> looks
>> > > like
>> > > > sort of a sweet spot to run DeltaStreamer.
>> > >
>> >
>>
>


Re: Default Max Events to Read from Kafka

2019-11-17 Thread Pratyaksh Sharma
Yeah,

Would love to do that. Will create a jira and raise a PR.

On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar  wrote:

> Concurrent Writes :D
>
> The magic number 1M is from me actually :) . and there is no magic, it was
> picked to keep jobs from batch scanning Kafka since source-limit default
> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
> larger.
> Happy to take a PR, to make this limit higher (say 10M) and only use it
> when sourceLimit is infinity? Interested in contributing your change back?
>
> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma 
> wrote:
>
> > Hi Nishith,
> >
> > I would like to know more about the reasonable payload size and topic
> > throughput that you mentioned. :)
> > Can you tell me few numbers around these two parameters which were
> involved
> > in deciding default value as 100.
> >
> > On Fri, Nov 15, 2019 at 5:32 PM Nishith  wrote:
> >
> > > Pratyaksh,
> > >
> > > The default value was chosen based on a “reasonable” payload size and
> > > topic throughput.
> > >
> > > The number of messages vs executor/driver memory highly depends on your
> > > message size.
> > > It is already a value that you can configure using “sourceLimit”, like
> > > you’ve already tried.
> > > Ideally, this number will be tuned by a user depending on the number of
> > > resources that can be provided vs ingestion latency.
> > >
> > > Sent from my iPhone
> > >
> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma  >
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > I have a small doubt. KafkaOffsetGen.java class has a variable called
> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually
> > reading
> > > > from Kafka, we take the minimum of sourceLimit and this variable to
> > > > actually form the RDD in case of KafkaSource.
> > > >
> > > > I want to know the following -
> > > >
> > > > 1. How did we arrive at this number?
> > > > 2. Why are we hard-coding it? Should we not make it configurable for
> > > users
> > > > to play around?
> > > >
> > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka
> > topic
> > > > with 1.5 crore events with the following configuration in continuous
> > > mode -
> > > >
> > > > 1. changed the above variable to Integer.MAX_VALUE.
> > > > 2. Kept source limit as 350 (35 lacs)
> > > > 3. executor-memory 4g
> > > > 4. driver-memory 6g
> > > >
> > > > Basically in my case, the RDD was having 35 lac events in one
> iteration
> > > and
> > > > it was able to run fine.
> > > >
> > > > If I try running deltaStreamer with a greater value of sourceLimit,
> > then
> > > I
> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks
> > > like
> > > > sort of a sweet spot to run DeltaStreamer.
> > >
> >
>


Re: Default Max Events to Read from Kafka

2019-11-15 Thread Vinoth Chandar
Concurrent Writes :D

The magic number 1M is from me actually :) . and there is no magic, it was
picked to keep jobs from batch scanning Kafka since source-limit default
was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
larger.
Happy to take a PR, to make this limit higher (say 10M) and only use it
when sourceLimit is infinity? Interested in contributing your change back?

On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma 
wrote:

> Hi Nishith,
>
> I would like to know more about the reasonable payload size and topic
> throughput that you mentioned. :)
> Can you tell me few numbers around these two parameters which were involved
> in deciding default value as 100.
>
> On Fri, Nov 15, 2019 at 5:32 PM Nishith  wrote:
>
> > Pratyaksh,
> >
> > The default value was chosen based on a “reasonable” payload size and
> > topic throughput.
> >
> > The number of messages vs executor/driver memory highly depends on your
> > message size.
> > It is already a value that you can configure using “sourceLimit”, like
> > you’ve already tried.
> > Ideally, this number will be tuned by a user depending on the number of
> > resources that can be provided vs ingestion latency.
> >
> > Sent from my iPhone
> >
> > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma 
> > wrote:
> > >
> > > Hi,
> > >
> > > I have a small doubt. KafkaOffsetGen.java class has a variable called
> > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually
> reading
> > > from Kafka, we take the minimum of sourceLimit and this variable to
> > > actually form the RDD in case of KafkaSource.
> > >
> > > I want to know the following -
> > >
> > > 1. How did we arrive at this number?
> > > 2. Why are we hard-coding it? Should we not make it configurable for
> > users
> > > to play around?
> > >
> > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka
> topic
> > > with 1.5 crore events with the following configuration in continuous
> > mode -
> > >
> > > 1. changed the above variable to Integer.MAX_VALUE.
> > > 2. Kept source limit as 350 (35 lacs)
> > > 3. executor-memory 4g
> > > 4. driver-memory 6g
> > >
> > > Basically in my case, the RDD was having 35 lac events in one iteration
> > and
> > > it was able to run fine.
> > >
> > > If I try running deltaStreamer with a greater value of sourceLimit,
> then
> > I
> > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks
> > like
> > > sort of a sweet spot to run DeltaStreamer.
> >
>


Re: Default Max Events to Read from Kafka

2019-11-15 Thread Bhavani Sudha
Hi Pratyaksh,

I get what you mean. You are concerned about the upper cap of events read,
always being 100 even though users can configure it to be lower than
that using sourceLimit. Since we are choosing Math.min(100,
sourceLimit), I think it would make sense to make the upper cap
configurable instead of setting to default 100.

@vinoth   what do you think?

Thanks,
Sudha

On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma 
wrote:

> Hi Nishith,
>
> I would like to know more about the reasonable payload size and topic
> throughput that you mentioned. :)
> Can you tell me few numbers around these two parameters which were involved
> in deciding default value as 100.
>
> On Fri, Nov 15, 2019 at 5:32 PM Nishith  wrote:
>
> > Pratyaksh,
> >
> > The default value was chosen based on a “reasonable” payload size and
> > topic throughput.
> >
> > The number of messages vs executor/driver memory highly depends on your
> > message size.
> > It is already a value that you can configure using “sourceLimit”, like
> > you’ve already tried.
> > Ideally, this number will be tuned by a user depending on the number of
> > resources that can be provided vs ingestion latency.
> >
> > Sent from my iPhone
> >
> > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma 
> > wrote:
> > >
> > > Hi,
> > >
> > > I have a small doubt. KafkaOffsetGen.java class has a variable called
> > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually
> reading
> > > from Kafka, we take the minimum of sourceLimit and this variable to
> > > actually form the RDD in case of KafkaSource.
> > >
> > > I want to know the following -
> > >
> > > 1. How did we arrive at this number?
> > > 2. Why are we hard-coding it? Should we not make it configurable for
> > users
> > > to play around?
> > >
> > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka
> topic
> > > with 1.5 crore events with the following configuration in continuous
> > mode -
> > >
> > > 1. changed the above variable to Integer.MAX_VALUE.
> > > 2. Kept source limit as 350 (35 lacs)
> > > 3. executor-memory 4g
> > > 4. driver-memory 6g
> > >
> > > Basically in my case, the RDD was having 35 lac events in one iteration
> > and
> > > it was able to run fine.
> > >
> > > If I try running deltaStreamer with a greater value of sourceLimit,
> then
> > I
> > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks
> > like
> > > sort of a sweet spot to run DeltaStreamer.
> >
>


Re: Default Max Events to Read from Kafka

2019-11-15 Thread Pratyaksh Sharma
Hi Nishith,

I would like to know more about the reasonable payload size and topic
throughput that you mentioned. :)
Can you tell me few numbers around these two parameters which were involved
in deciding default value as 100.

On Fri, Nov 15, 2019 at 5:32 PM Nishith  wrote:

> Pratyaksh,
>
> The default value was chosen based on a “reasonable” payload size and
> topic throughput.
>
> The number of messages vs executor/driver memory highly depends on your
> message size.
> It is already a value that you can configure using “sourceLimit”, like
> you’ve already tried.
> Ideally, this number will be tuned by a user depending on the number of
> resources that can be provided vs ingestion latency.
>
> Sent from my iPhone
>
> > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma 
> wrote:
> >
> > Hi,
> >
> > I have a small doubt. KafkaOffsetGen.java class has a variable called
> > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually reading
> > from Kafka, we take the minimum of sourceLimit and this variable to
> > actually form the RDD in case of KafkaSource.
> >
> > I want to know the following -
> >
> > 1. How did we arrive at this number?
> > 2. Why are we hard-coding it? Should we not make it configurable for
> users
> > to play around?
> >
> > For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic
> > with 1.5 crore events with the following configuration in continuous
> mode -
> >
> > 1. changed the above variable to Integer.MAX_VALUE.
> > 2. Kept source limit as 350 (35 lacs)
> > 3. executor-memory 4g
> > 4. driver-memory 6g
> >
> > Basically in my case, the RDD was having 35 lac events in one iteration
> and
> > it was able to run fine.
> >
> > If I try running deltaStreamer with a greater value of sourceLimit, then
> I
> > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks
> like
> > sort of a sweet spot to run DeltaStreamer.
>


Re: Default Max Events to Read from Kafka

2019-11-15 Thread Nishith
Pratyaksh,

The default value was chosen based on a “reasonable” payload size and topic 
throughput. 

The number of messages vs executor/driver memory highly depends on your message 
size.
It is already a value that you can configure using “sourceLimit”, like you’ve 
already tried.
Ideally, this number will be tuned by a user depending on the number of 
resources that can be provided vs ingestion latency.

Sent from my iPhone

> On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma  wrote:
> 
> Hi,
> 
> I have a small doubt. KafkaOffsetGen.java class has a variable called
> DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually reading
> from Kafka, we take the minimum of sourceLimit and this variable to
> actually form the RDD in case of KafkaSource.
> 
> I want to know the following -
> 
> 1. How did we arrive at this number?
> 2. Why are we hard-coding it? Should we not make it configurable for users
> to play around?
> 
> For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic
> with 1.5 crore events with the following configuration in continuous mode -
> 
> 1. changed the above variable to Integer.MAX_VALUE.
> 2. Kept source limit as 350 (35 lacs)
> 3. executor-memory 4g
> 4. driver-memory 6g
> 
> Basically in my case, the RDD was having 35 lac events in one iteration and
> it was able to run fine.
> 
> If I try running deltaStreamer with a greater value of sourceLimit, then I
> was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks like
> sort of a sweet spot to run DeltaStreamer.