Re: Kafka segmentation

Cody Koeninger Sat, 19 Nov 2016 08:17:54 -0800

I mean I don't understand exactly what the issue is.  Can you fill in
these blanks


My settings are :

My code is :

I expected to see :

Instead, I saw :

On Thu, Nov 17, 2016 at 12:53 PM, Hoang Bao Thien <hbthien0...@gmail.com> wrote:
> I am sorry I don't understand your idea. What do you mean exactly?
>
> On Fri, Nov 18, 2016 at 1:50 AM, Cody Koeninger <c...@koeninger.org> wrote:
>>
>> Ok, I don't think I'm clear on the issue then.  Can you say what the
>> expected behavior is, and what the observed behavior is?
>>
>> On Thu, Nov 17, 2016 at 10:48 AM, Hoang Bao Thien <hbthien0...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > Thanks for your comments. But in fact, I don't want to limit the size of
>> > batches, it could be any greater size as it does.
>> >
>> > Thien
>> >
>> > On Fri, Nov 18, 2016 at 1:17 AM, Cody Koeninger <c...@koeninger.org>
>> > wrote:
>> >>
>> >> If you want a consistent limit on the size of batches, use
>> >> spark.streaming.kafka.maxRatePerPartition  (assuming you're using
>> >> createDirectStream)
>> >>
>> >> http://spark.apache.org/docs/latest/configuration.html#spark-streaming
>> >>
>> >> On Thu, Nov 17, 2016 at 12:52 AM, Hoang Bao Thien
>> >> <hbthien0...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I use CSV and other text files to Kafka just to test Kafka + Spark
>> >> > Streaming
>> >> > by using direct stream. That's why I don't want Spark streaming reads
>> >> > CSVs
>> >> > or text files directly.
>> >> > In addition, I don't want a giant batch of records like the link you
>> >> > sent.
>> >> > The problem is that we should receive the "similar" number of record
>> >> > of
>> >> > all
>> >> > batchs instead of the first two or three batches have so large number
>> >> > of
>> >> > records (e.g., 100K) but the last 1000 batches with only 200 records.
>> >> >
>> >> > I know that the problem is not from the auto.offset.reset=largest,
>> >> > but I
>> >> > don't know what I can do in this case.
>> >> >
>> >> > Do you and other ones could suggest me some solutions please as this
>> >> > seems
>> >> > the normal situation with Kafka+SpartStreaming.
>> >> >
>> >> > Thanks.
>> >> > Alex
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Nov 17, 2016 at 2:32 AM, Cody Koeninger <c...@koeninger.org>
>> >> > wrote:
>> >> >>
>> >> >> Yeah, if you're reporting issues, please be clear as to whether
>> >> >> backpressure is enabled, and whether maxRatePerPartition is set.
>> >> >>
>> >> >> I expect that there is something wrong with backpressure, see e.g.
>> >> >> https://issues.apache.org/jira/browse/SPARK-18371
>> >> >>
>> >> >> On Wed, Nov 16, 2016 at 5:05 PM, bo yang <bobyan...@gmail.com>
>> >> >> wrote:
>> >> >> > I hit similar issue with Spark Streaming. The batch size seemed a
>> >> >> > little
>> >> >> > random. Sometime it was large with many Kafka messages inside same
>> >> >> > batch,
>> >> >> > sometimes it was very small with just a few messages. Is it
>> >> >> > possible
>> >> >> > that
>> >> >> > was caused by the backpressure implementation in Spark Streaming?
>> >> >> >
>> >> >> > On Wed, Nov 16, 2016 at 4:22 PM, Cody Koeninger
>> >> >> > <c...@koeninger.org>
>> >> >> > wrote:
>> >> >> >>
>> >> >> >> Moved to user list.
>> >> >> >>
>> >> >> >> I'm not really clear on what you're trying to accomplish (why put
>> >> >> >> the
>> >> >> >> csv file through Kafka instead of reading it directly with
>> >> >> >> spark?)
>> >> >> >>
>> >> >> >> auto.offset.reset=largest just means that when starting the job
>> >> >> >> without any defined offsets, it will start at the highest (most
>> >> >> >> recent) available offsets.  That's probably not what you want if
>> >> >> >> you've already loaded csv lines into kafka.
>> >> >> >>
>> >> >> >> On Wed, Nov 16, 2016 at 2:45 PM, Hoang Bao Thien
>> >> >> >> <hbthien0...@gmail.com>
>> >> >> >> wrote:
>> >> >> >> > Hi all,
>> >> >> >> >
>> >> >> >> > I would like to ask a question related to the size of Kafka
>> >> >> >> > stream. I
>> >> >> >> > want
>> >> >> >> > to put data (e.g., file *.csv) to Kafka then use Spark
>> >> >> >> > streaming
>> >> >> >> > to
>> >> >> >> > get
>> >> >> >> > the
>> >> >> >> > output from Kafka and then save to Hive by using SparkSQL. The
>> >> >> >> > file
>> >> >> >> > csv
>> >> >> >> > is
>> >> >> >> > about 100MB with ~250K messages/rows (Each row has about 10
>> >> >> >> > fields
>> >> >> >> > of
>> >> >> >> > integer). I see that Spark Streaming first received two
>> >> >> >> > partitions/batches,
>> >> >> >> > the first is of 60K messages and the second is of 50K msgs. But
>> >> >> >> > from
>> >> >> >> > the
>> >> >> >> > third batch, Spark just received 200 messages for each batch
>> >> >> >> > (or
>> >> >> >> > partition).
>> >> >> >> > I think that this problem is coming from Kafka or some
>> >> >> >> > configuration
>> >> >> >> > in
>> >> >> >> > Spark. I already tried to configure with the setting
>> >> >> >> > "auto.offset.reset=largest", but every batch only gets 200
>> >> >> >> > messages.
>> >> >> >> >
>> >> >> >> > Could you please tell me how to fix this problem?
>> >> >> >> > Thank you so much.
>> >> >> >> >
>> >> >> >> > Best regards,
>> >> >> >> > Alex
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> ---------------------------------------------------------------------
>> >> >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >> >> >>
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Kafka segmentation

Reply via email to