Re: Possible bug in Spark Streaming :: TextFileStream
On second thought I am not entirely sure whether that bug is the issue. Are you continuously appending to the file that you have copied to the directory? Because filestream works correctly when the files are atomically moved to the monitored directory. TD On Mon, Jul 14, 2014 at 9:08 PM, Madabhattula Rajesh Kumar < mrajaf...@gmail.com> wrote: > Hi Team, > > Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath") > API also? Please conform. If yes, could you please help me to fix this > issue. I'm using spark 1.0.0 version. > > Regards, > Rajesh > > > On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Oh yes, this was a bug and it has been fixed. Checkout from the master >> branch! >> >> >> https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC >> >> TD >> >> >> On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez < >> langel.gro...@gmail.com> wrote: >> >>> I have a basic spark streaming job that is watching a folder, processing >>> any new file and updating a column family in cassandra using the new >>> cassandra-spark-driver. >>> >>> I think there is a problem with SparkStreamingContext.textFileStream... >>> if I start my job in local mode with no files in the folder that is watched >>> and then I copy a bunch of files, sometimes spark is continually processing >>> those files again and again. >>> >>> I have noticed that it usually happens when spark doesn't detect all new >>> files in one go... i.e. I copied 6 files and spark detected 3 of them as >>> new and processed them; then it detected the other 3 as new and processed >>> them. After it finished to process all 6 files, it detected again the first >>> 3 files as new files and processed them... then the other 3... and again... >>> and again... and again. >>> >>> Should I rise a JIRA issue? >>> >>> Regards, >>> >>> Luis >>> >> >> >
Re: Possible bug in Spark Streaming :: TextFileStream
Hi Team, Is this issue with JavaStreamingContext.textFileStream("hdfsfolderpath") API also? Please conform. If yes, could you please help me to fix this issue. I'm using spark 1.0.0 version. Regards, Rajesh On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das wrote: > Oh yes, this was a bug and it has been fixed. Checkout from the master > branch! > > > https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC > > TD > > > On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez < > langel.gro...@gmail.com> wrote: > >> I have a basic spark streaming job that is watching a folder, processing >> any new file and updating a column family in cassandra using the new >> cassandra-spark-driver. >> >> I think there is a problem with SparkStreamingContext.textFileStream... >> if I start my job in local mode with no files in the folder that is watched >> and then I copy a bunch of files, sometimes spark is continually processing >> those files again and again. >> >> I have noticed that it usually happens when spark doesn't detect all new >> files in one go... i.e. I copied 6 files and spark detected 3 of them as >> new and processed them; then it detected the other 3 as new and processed >> them. After it finished to process all 6 files, it detected again the first >> 3 files as new files and processed them... then the other 3... and again... >> and again... and again. >> >> Should I rise a JIRA issue? >> >> Regards, >> >> Luis >> > >
Re: Possible bug in Spark Streaming :: TextFileStream
Oh yes, this was a bug and it has been fixed. Checkout from the master branch! https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC TD On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez < langel.gro...@gmail.com> wrote: > I have a basic spark streaming job that is watching a folder, processing > any new file and updating a column family in cassandra using the new > cassandra-spark-driver. > > I think there is a problem with SparkStreamingContext.textFileStream... if > I start my job in local mode with no files in the folder that is watched > and then I copy a bunch of files, sometimes spark is continually processing > those files again and again. > > I have noticed that it usually happens when spark doesn't detect all new > files in one go... i.e. I copied 6 files and spark detected 3 of them as > new and processed them; then it detected the other 3 as new and processed > them. After it finished to process all 6 files, it detected again the first > 3 files as new files and processed them... then the other 3... and again... > and again... and again. > > Should I rise a JIRA issue? > > Regards, > > Luis >
Possible bug in Spark Streaming :: TextFileStream
I have a basic spark streaming job that is watching a folder, processing any new file and updating a column family in cassandra using the new cassandra-spark-driver. I think there is a problem with SparkStreamingContext.textFileStream... if I start my job in local mode with no files in the folder that is watched and then I copy a bunch of files, sometimes spark is continually processing those files again and again. I have noticed that it usually happens when spark doesn't detect all new files in one go... i.e. I copied 6 files and spark detected 3 of them as new and processed them; then it detected the other 3 as new and processed them. After it finished to process all 6 files, it detected again the first 3 files as new files and processed them... then the other 3... and again... and again... and again. Should I rise a JIRA issue? Regards, Luis