Re: How to know whether I'm in the first batch of spark streaming

2016-04-21 Thread Yu Xie
Thank you Praveen

  in our spark streaming, we write down the data to a HDFS directory, and
use the MMDDHHHmm00 format of batch time as the directory name.
  So, when we stop the streaming and start the streaming again (we do not
use checkpoint), in the init of the first batch, we will write down the
empty directory between the stop and start.
  If the second batch runs faster than the first batch, and it will have
the chance to run the "init". In this case, the directory that the "first
batch" will output to will be set to an empty directory by the "second
batch", it will make the data mess.

  I have a question about the StreamingListener.
  If our system have some problem, such as hdfs issue, and the "first
batch" and "second batch" were both queued. When the issue gone, these two
batch will start together. Then, will onBatchStarted be called concurrently
for these two batches?

Thank you


On Thu, Apr 21, 2016 at 3:11 PM, Praveen Devarao 
wrote:

> Hi Yu,
>
> Could you provide more details on what and how are you trying to
> initialize.are you having this initialization as part of the code block
> in action of the DStream? Say if the second batch finishes before first
> batch wouldn't your results be affected as init would have not taken place
> (since you want it on first batch itself)?
>
> One way we could think of knowing the first batch is by
> implementing the *StreamingListener*trait which has a method *onBatchStarted
> *and *onBatchCompleted*...These methods should help you determine the
> first batch (definitely first batch will start first though order of ending
> is not guaranteed with concurrentJobs set to more than 1)...
>
> Would be interesting to know your use case...could you share, if
> possible?
>
> Thanking You
>
> -
> Praveen Devarao
> Spark Technology Centre
> IBM India Software Labs
>
> -
> "Courage doesn't always roar. Sometimes courage is the quiet voice at the
> end of the day saying I will try again"
>
>
>
> From:Yu Xie 
> To:user@spark.apache.org
> Date:19/04/2016 01:24 pm
> Subject:How to know whether I'm in the first batch of spark
> streaming
> --
>
>
>
> hi spark users
>
> I'm running a spark streaming application, with concurrentJobs > 1, so
> maybe more than one batches could run together.
>
> Now I would like to do some init work in the first batch based on the
> "time" of the first batch. So even the second batch runs faster than the
> first batch, I still need to init in the literal "first batch"
>
> Then is there a way that I can know that?
> Thank you
>
>
>


How to know whether I'm in the first batch of spark streaming

2016-04-19 Thread Yu Xie
hi spark users

I'm running a spark streaming application, with concurrentJobs > 1, so
maybe more than one batches could run together.

Now I would like to do some init work in the first batch based on the
"time" of the first batch. So even the second batch runs faster than the
first batch, I still need to init in the literal "first batch"

Then is there a way that I can know that?
Thank you


can checkpoint and write ahead log save the data in queued batch?

2016-03-11 Thread Yu Xie
Hi spark user

  I am running an spark streaming app that use receiver from a pubsub
system, and the pubsub system does NOT support ack.

  And I don't want the data to be lost if there is a driver failure, and by
accident, the batches queue up at that time.

  I tested by generating some queued batches with some input (see the pic),
and then quit the application.
  When I restart the application again, I saw there are no input for these
batches.

  Is it as expected?


Before restart
[image: Inline image 1]