Re: Default Max Events to Read from Kafka
Thanks! On Sun, Nov 17, 2019 at 4:59 AM Pratyaksh Sharma wrote: > https://issues.apache.org/jira/browse/HUDI-340 tracks this. > > On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma > wrote: > > > Yeah, > > > > Would love to do that. Will create a jira and raise a PR. > > > > On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar > wrote: > > > >> Concurrent Writes :D > >> > >> The magic number 1M is from me actually :) . and there is no magic, it > was > >> picked to keep jobs from batch scanning Kafka since source-limit default > >> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much > >> larger. > >> Happy to take a PR, to make this limit higher (say 10M) and only use it > >> when sourceLimit is infinity? Interested in contributing your change > back? > >> > >> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma > > >> wrote: > >> > >> > Hi Nishith, > >> > > >> > I would like to know more about the reasonable payload size and topic > >> > throughput that you mentioned. :) > >> > Can you tell me few numbers around these two parameters which were > >> involved > >> > in deciding default value as 100. > >> > > >> > On Fri, Nov 15, 2019 at 5:32 PM Nishith wrote: > >> > > >> > > Pratyaksh, > >> > > > >> > > The default value was chosen based on a “reasonable” payload size > and > >> > > topic throughput. > >> > > > >> > > The number of messages vs executor/driver memory highly depends on > >> your > >> > > message size. > >> > > It is already a value that you can configure using “sourceLimit”, > like > >> > > you’ve already tried. > >> > > Ideally, this number will be tuned by a user depending on the number > >> of > >> > > resources that can be provided vs ingestion latency. > >> > > > >> > > Sent from my iPhone > >> > > > >> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma < > >> [email protected]> > >> > > wrote: > >> > > > > >> > > > Hi, > >> > > > > >> > > > I have a small doubt. KafkaOffsetGen.java class has a variable > >> called > >> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually > >> > reading > >> > > > from Kafka, we take the minimum of sourceLimit and this variable > to > >> > > > actually form the RDD in case of KafkaSource. > >> > > > > >> > > > I want to know the following - > >> > > > > >> > > > 1. How did we arrive at this number? > >> > > > 2. Why are we hard-coding it? Should we not make it configurable > for > >> > > users > >> > > > to play around? > >> > > > > >> > > > For bootstrapping purpose, I tried running DeltaStreamer on a > kafka > >> > topic > >> > > > with 1.5 crore events with the following configuration in > continuous > >> > > mode - > >> > > > > >> > > > 1. changed the above variable to Integer.MAX_VALUE. > >> > > > 2. Kept source limit as 350 (35 lacs) > >> > > > 3. executor-memory 4g > >> > > > 4. driver-memory 6g > >> > > > > >> > > > Basically in my case, the RDD was having 35 lac events in one > >> iteration > >> > > and > >> > > > it was able to run fine. > >> > > > > >> > > > If I try running deltaStreamer with a greater value of > sourceLimit, > >> > then > >> > > I > >> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs > >> looks > >> > > like > >> > > > sort of a sweet spot to run DeltaStreamer. > >> > > > >> > > >> > > >
Re: Default Max Events to Read from Kafka
https://issues.apache.org/jira/browse/HUDI-340 tracks this. On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma wrote: > Yeah, > > Would love to do that. Will create a jira and raise a PR. > > On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar wrote: > >> Concurrent Writes :D >> >> The magic number 1M is from me actually :) . and there is no magic, it was >> picked to keep jobs from batch scanning Kafka since source-limit default >> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much >> larger. >> Happy to take a PR, to make this limit higher (say 10M) and only use it >> when sourceLimit is infinity? Interested in contributing your change back? >> >> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma >> wrote: >> >> > Hi Nishith, >> > >> > I would like to know more about the reasonable payload size and topic >> > throughput that you mentioned. :) >> > Can you tell me few numbers around these two parameters which were >> involved >> > in deciding default value as 100. >> > >> > On Fri, Nov 15, 2019 at 5:32 PM Nishith wrote: >> > >> > > Pratyaksh, >> > > >> > > The default value was chosen based on a “reasonable” payload size and >> > > topic throughput. >> > > >> > > The number of messages vs executor/driver memory highly depends on >> your >> > > message size. >> > > It is already a value that you can configure using “sourceLimit”, like >> > > you’ve already tried. >> > > Ideally, this number will be tuned by a user depending on the number >> of >> > > resources that can be provided vs ingestion latency. >> > > >> > > Sent from my iPhone >> > > >> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma < >> [email protected]> >> > > wrote: >> > > > >> > > > Hi, >> > > > >> > > > I have a small doubt. KafkaOffsetGen.java class has a variable >> called >> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually >> > reading >> > > > from Kafka, we take the minimum of sourceLimit and this variable to >> > > > actually form the RDD in case of KafkaSource. >> > > > >> > > > I want to know the following - >> > > > >> > > > 1. How did we arrive at this number? >> > > > 2. Why are we hard-coding it? Should we not make it configurable for >> > > users >> > > > to play around? >> > > > >> > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka >> > topic >> > > > with 1.5 crore events with the following configuration in continuous >> > > mode - >> > > > >> > > > 1. changed the above variable to Integer.MAX_VALUE. >> > > > 2. Kept source limit as 350 (35 lacs) >> > > > 3. executor-memory 4g >> > > > 4. driver-memory 6g >> > > > >> > > > Basically in my case, the RDD was having 35 lac events in one >> iteration >> > > and >> > > > it was able to run fine. >> > > > >> > > > If I try running deltaStreamer with a greater value of sourceLimit, >> > then >> > > I >> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs >> looks >> > > like >> > > > sort of a sweet spot to run DeltaStreamer. >> > > >> > >> >
Re: Default Max Events to Read from Kafka
Yeah, Would love to do that. Will create a jira and raise a PR. On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar wrote: > Concurrent Writes :D > > The magic number 1M is from me actually :) . and there is no magic, it was > picked to keep jobs from batch scanning Kafka since source-limit default > was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much > larger. > Happy to take a PR, to make this limit higher (say 10M) and only use it > when sourceLimit is infinity? Interested in contributing your change back? > > On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma > wrote: > > > Hi Nishith, > > > > I would like to know more about the reasonable payload size and topic > > throughput that you mentioned. :) > > Can you tell me few numbers around these two parameters which were > involved > > in deciding default value as 100. > > > > On Fri, Nov 15, 2019 at 5:32 PM Nishith wrote: > > > > > Pratyaksh, > > > > > > The default value was chosen based on a “reasonable” payload size and > > > topic throughput. > > > > > > The number of messages vs executor/driver memory highly depends on your > > > message size. > > > It is already a value that you can configure using “sourceLimit”, like > > > you’ve already tried. > > > Ideally, this number will be tuned by a user depending on the number of > > > resources that can be provided vs ingestion latency. > > > > > > Sent from my iPhone > > > > > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma > > > > wrote: > > > > > > > > Hi, > > > > > > > > I have a small doubt. KafkaOffsetGen.java class has a variable called > > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually > > reading > > > > from Kafka, we take the minimum of sourceLimit and this variable to > > > > actually form the RDD in case of KafkaSource. > > > > > > > > I want to know the following - > > > > > > > > 1. How did we arrive at this number? > > > > 2. Why are we hard-coding it? Should we not make it configurable for > > > users > > > > to play around? > > > > > > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka > > topic > > > > with 1.5 crore events with the following configuration in continuous > > > mode - > > > > > > > > 1. changed the above variable to Integer.MAX_VALUE. > > > > 2. Kept source limit as 350 (35 lacs) > > > > 3. executor-memory 4g > > > > 4. driver-memory 6g > > > > > > > > Basically in my case, the RDD was having 35 lac events in one > iteration > > > and > > > > it was able to run fine. > > > > > > > > If I try running deltaStreamer with a greater value of sourceLimit, > > then > > > I > > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks > > > like > > > > sort of a sweet spot to run DeltaStreamer. > > > > > >
Re: Default Max Events to Read from Kafka
Concurrent Writes :D The magic number 1M is from me actually :) . and there is no magic, it was picked to keep jobs from batch scanning Kafka since source-limit default was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much larger. Happy to take a PR, to make this limit higher (say 10M) and only use it when sourceLimit is infinity? Interested in contributing your change back? On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma wrote: > Hi Nishith, > > I would like to know more about the reasonable payload size and topic > throughput that you mentioned. :) > Can you tell me few numbers around these two parameters which were involved > in deciding default value as 100. > > On Fri, Nov 15, 2019 at 5:32 PM Nishith wrote: > > > Pratyaksh, > > > > The default value was chosen based on a “reasonable” payload size and > > topic throughput. > > > > The number of messages vs executor/driver memory highly depends on your > > message size. > > It is already a value that you can configure using “sourceLimit”, like > > you’ve already tried. > > Ideally, this number will be tuned by a user depending on the number of > > resources that can be provided vs ingestion latency. > > > > Sent from my iPhone > > > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma > > wrote: > > > > > > Hi, > > > > > > I have a small doubt. KafkaOffsetGen.java class has a variable called > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually > reading > > > from Kafka, we take the minimum of sourceLimit and this variable to > > > actually form the RDD in case of KafkaSource. > > > > > > I want to know the following - > > > > > > 1. How did we arrive at this number? > > > 2. Why are we hard-coding it? Should we not make it configurable for > > users > > > to play around? > > > > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka > topic > > > with 1.5 crore events with the following configuration in continuous > > mode - > > > > > > 1. changed the above variable to Integer.MAX_VALUE. > > > 2. Kept source limit as 350 (35 lacs) > > > 3. executor-memory 4g > > > 4. driver-memory 6g > > > > > > Basically in my case, the RDD was having 35 lac events in one iteration > > and > > > it was able to run fine. > > > > > > If I try running deltaStreamer with a greater value of sourceLimit, > then > > I > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks > > like > > > sort of a sweet spot to run DeltaStreamer. > > >
Re: Default Max Events to Read from Kafka
Hi Pratyaksh, I get what you mean. You are concerned about the upper cap of events read, always being 100 even though users can configure it to be lower than that using sourceLimit. Since we are choosing Math.min(100, sourceLimit), I think it would make sense to make the upper cap configurable instead of setting to default 100. @vinoth what do you think? Thanks, Sudha On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma wrote: > Hi Nishith, > > I would like to know more about the reasonable payload size and topic > throughput that you mentioned. :) > Can you tell me few numbers around these two parameters which were involved > in deciding default value as 100. > > On Fri, Nov 15, 2019 at 5:32 PM Nishith wrote: > > > Pratyaksh, > > > > The default value was chosen based on a “reasonable” payload size and > > topic throughput. > > > > The number of messages vs executor/driver memory highly depends on your > > message size. > > It is already a value that you can configure using “sourceLimit”, like > > you’ve already tried. > > Ideally, this number will be tuned by a user depending on the number of > > resources that can be provided vs ingestion latency. > > > > Sent from my iPhone > > > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma > > wrote: > > > > > > Hi, > > > > > > I have a small doubt. KafkaOffsetGen.java class has a variable called > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually > reading > > > from Kafka, we take the minimum of sourceLimit and this variable to > > > actually form the RDD in case of KafkaSource. > > > > > > I want to know the following - > > > > > > 1. How did we arrive at this number? > > > 2. Why are we hard-coding it? Should we not make it configurable for > > users > > > to play around? > > > > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka > topic > > > with 1.5 crore events with the following configuration in continuous > > mode - > > > > > > 1. changed the above variable to Integer.MAX_VALUE. > > > 2. Kept source limit as 350 (35 lacs) > > > 3. executor-memory 4g > > > 4. driver-memory 6g > > > > > > Basically in my case, the RDD was having 35 lac events in one iteration > > and > > > it was able to run fine. > > > > > > If I try running deltaStreamer with a greater value of sourceLimit, > then > > I > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks > > like > > > sort of a sweet spot to run DeltaStreamer. > > >
Re: Default Max Events to Read from Kafka
Hi Nishith, I would like to know more about the reasonable payload size and topic throughput that you mentioned. :) Can you tell me few numbers around these two parameters which were involved in deciding default value as 100. On Fri, Nov 15, 2019 at 5:32 PM Nishith wrote: > Pratyaksh, > > The default value was chosen based on a “reasonable” payload size and > topic throughput. > > The number of messages vs executor/driver memory highly depends on your > message size. > It is already a value that you can configure using “sourceLimit”, like > you’ve already tried. > Ideally, this number will be tuned by a user depending on the number of > resources that can be provided vs ingestion latency. > > Sent from my iPhone > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma > wrote: > > > > Hi, > > > > I have a small doubt. KafkaOffsetGen.java class has a variable called > > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually reading > > from Kafka, we take the minimum of sourceLimit and this variable to > > actually form the RDD in case of KafkaSource. > > > > I want to know the following - > > > > 1. How did we arrive at this number? > > 2. Why are we hard-coding it? Should we not make it configurable for > users > > to play around? > > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic > > with 1.5 crore events with the following configuration in continuous > mode - > > > > 1. changed the above variable to Integer.MAX_VALUE. > > 2. Kept source limit as 350 (35 lacs) > > 3. executor-memory 4g > > 4. driver-memory 6g > > > > Basically in my case, the RDD was having 35 lac events in one iteration > and > > it was able to run fine. > > > > If I try running deltaStreamer with a greater value of sourceLimit, then > I > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks > like > > sort of a sweet spot to run DeltaStreamer. >
Re: Default Max Events to Read from Kafka
Pratyaksh, The default value was chosen based on a “reasonable” payload size and topic throughput. The number of messages vs executor/driver memory highly depends on your message size. It is already a value that you can configure using “sourceLimit”, like you’ve already tried. Ideally, this number will be tuned by a user depending on the number of resources that can be provided vs ingestion latency. Sent from my iPhone > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma wrote: > > Hi, > > I have a small doubt. KafkaOffsetGen.java class has a variable called > DEFAULT_MAX_EVENTS_TO_READ which is set to 100. When actually reading > from Kafka, we take the minimum of sourceLimit and this variable to > actually form the RDD in case of KafkaSource. > > I want to know the following - > > 1. How did we arrive at this number? > 2. Why are we hard-coding it? Should we not make it configurable for users > to play around? > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic > with 1.5 crore events with the following configuration in continuous mode - > > 1. changed the above variable to Integer.MAX_VALUE. > 2. Kept source limit as 350 (35 lacs) > 3. executor-memory 4g > 4. driver-memory 6g > > Basically in my case, the RDD was having 35 lac events in one iteration and > it was able to run fine. > > If I try running deltaStreamer with a greater value of sourceLimit, then I > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks like > sort of a sweet spot to run DeltaStreamer.
