Thank you Praveen in our spark streaming, we write down the data to a HDFS directory, and use the YYYYMMDDHHHmm00 format of batch time as the directory name. So, when we stop the streaming and start the streaming again (we do not use checkpoint), in the init of the first batch, we will write down the empty directory between the stop and start. If the second batch runs faster than the first batch, and it will have the chance to run the "init". In this case, the directory that the "first batch" will output to will be set to an empty directory by the "second batch", it will make the data mess.
I have a question about the StreamingListener. If our system have some problem, such as hdfs issue, and the "first batch" and "second batch" were both queued. When the issue gone, these two batch will start together. Then, will onBatchStarted be called concurrently for these two batches? Thank you On Thu, Apr 21, 2016 at 3:11 PM, Praveen Devarao <praveen...@in.ibm.com> wrote: > Hi Yu, > > Could you provide more details on what and how are you trying to > initialize.....are you having this initialization as part of the code block > in action of the DStream? Say if the second batch finishes before first > batch wouldn't your results be affected as init would have not taken place > (since you want it on first batch itself)? > > One way we could think of knowing the first batch is by > implementing the *StreamingListener*trait which has a method *onBatchStarted > *and *onBatchCompleted*...These methods should help you determine the > first batch (definitely first batch will start first though order of ending > is not guaranteed with concurrentJobs set to more than 1)... > > Would be interesting to know your use case...could you share, if > possible? > > Thanking You > > --------------------------------------------------------------------------------- > Praveen Devarao > Spark Technology Centre > IBM India Software Labs > > --------------------------------------------------------------------------------- > "Courage doesn't always roar. Sometimes courage is the quiet voice at the > end of the day saying I will try again" > > > > From: Yu Xie <yuu...@gmail.com> > To: user@spark.apache.org > Date: 19/04/2016 01:24 pm > Subject: How to know whether I'm in the first batch of spark > streaming > ------------------------------ > > > > hi spark users > > I'm running a spark streaming application, with concurrentJobs > 1, so > maybe more than one batches could run together. > > Now I would like to do some init work in the first batch based on the > "time" of the first batch. So even the second batch runs faster than the > first batch, I still need to init in the literal "first batch" > > Then is there a way that I can know that? > Thank you > > >