Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-15 Thread Tathagata Das
On second thought I am not entirely sure whether that bug is the issue. Are
you continuously appending to the file that you have copied to the
directory? Because filestream works correctly when the files are atomically
moved to the monitored directory.

TD


On Mon, Jul 14, 2014 at 9:08 PM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi Team,
>
> Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath")
> API also? Please conform. If yes, could you please help me to fix this
> issue. I'm using spark 1.0.0 version.
>
> Regards,
> Rajesh
>
>
> On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Oh yes, this was a bug and it has been fixed. Checkout from the master
>> branch!
>>
>>
>> https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC
>>
>> TD
>>
>>
>> On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez <
>> langel.gro...@gmail.com> wrote:
>>
>>> I have a basic spark streaming job that is watching a folder, processing
>>> any new file and updating a column family in cassandra using the new
>>> cassandra-spark-driver.
>>>
>>> I think there is a problem with SparkStreamingContext.textFileStream...
>>> if I start my job in local mode with no files in the folder that is watched
>>> and then I copy a bunch of files, sometimes spark is continually processing
>>> those files again and again.
>>>
>>> I have noticed that it usually happens when spark doesn't detect all new
>>> files in one go... i.e. I copied 6 files and spark detected 3 of them as
>>> new and processed them; then it detected the other 3 as new and processed
>>> them. After it finished to process all 6 files, it detected again the first
>>> 3 files as new files and processed them... then the other 3... and again...
>>> and again... and again.
>>>
>>> Should I rise a JIRA issue?
>>>
>>> Regards,
>>>
>>> Luis
>>>
>>
>>
>


Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Madabhattula Rajesh Kumar
Hi Team,

Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath")
API also? Please conform. If yes, could you please help me to fix this
issue. I'm using spark 1.0.0 version.

Regards,
Rajesh


On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das 
wrote:

> Oh yes, this was a bug and it has been fixed. Checkout from the master
> branch!
>
>
> https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC
>
> TD
>
>
> On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez <
> langel.gro...@gmail.com> wrote:
>
>> I have a basic spark streaming job that is watching a folder, processing
>> any new file and updating a column family in cassandra using the new
>> cassandra-spark-driver.
>>
>> I think there is a problem with SparkStreamingContext.textFileStream...
>> if I start my job in local mode with no files in the folder that is watched
>> and then I copy a bunch of files, sometimes spark is continually processing
>> those files again and again.
>>
>> I have noticed that it usually happens when spark doesn't detect all new
>> files in one go... i.e. I copied 6 files and spark detected 3 of them as
>> new and processed them; then it detected the other 3 as new and processed
>> them. After it finished to process all 6 files, it detected again the first
>> 3 files as new files and processed them... then the other 3... and again...
>> and again... and again.
>>
>> Should I rise a JIRA issue?
>>
>> Regards,
>>
>> Luis
>>
>
>


Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Tathagata Das
Oh yes, this was a bug and it has been fixed. Checkout from the master
branch!

https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC

TD


On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com> wrote:

> I have a basic spark streaming job that is watching a folder, processing
> any new file and updating a column family in cassandra using the new
> cassandra-spark-driver.
>
> I think there is a problem with SparkStreamingContext.textFileStream... if
> I start my job in local mode with no files in the folder that is watched
> and then I copy a bunch of files, sometimes spark is continually processing
> those files again and again.
>
> I have noticed that it usually happens when spark doesn't detect all new
> files in one go... i.e. I copied 6 files and spark detected 3 of them as
> new and processed them; then it detected the other 3 as new and processed
> them. After it finished to process all 6 files, it detected again the first
> 3 files as new files and processed them... then the other 3... and again...
> and again... and again.
>
> Should I rise a JIRA issue?
>
> Regards,
>
> Luis
>


Possible bug in Spark Streaming :: TextFileStream

2014-07-07 Thread Luis Ángel Vicente Sánchez
I have a basic spark streaming job that is watching a folder, processing
any new file and updating a column family in cassandra using the new
cassandra-spark-driver.

I think there is a problem with SparkStreamingContext.textFileStream... if
I start my job in local mode with no files in the folder that is watched
and then I copy a bunch of files, sometimes spark is continually processing
those files again and again.

I have noticed that it usually happens when spark doesn't detect all new
files in one go... i.e. I copied 6 files and spark detected 3 of them as
new and processed them; then it detected the other 3 as new and processed
them. After it finished to process all 6 files, it detected again the first
3 files as new files and processed them... then the other 3... and again...
and again... and again.

Should I rise a JIRA issue?

Regards,

Luis