Re: Issues with reading gz files with Spark Streaming
On 22 Oct 2016, at 20:58, Nkechi Achara> wrote: I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. in that case there's nothing obvious to mee. try logging at trace/debug the class: org.apache.spark.sql.execution.streaming.FileStreamSource On 22 October 2016 at 15:14, Steve Loughran > wrote: > On 21 Oct 2016, at 15:53, Nkechi Achara > > wrote: > > Hi, > > I am using Spark 1.5.0 to read gz files with textFileStream, but when new > files are dropped in the specified directory. I know this is only the case > with gz files as when i extract the file into the directory specified the > files are read on the next window and processed. > > My code is here: > > val comments = ssc.fileStream[LongWritable, Text, > TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false). > map(pair => pair._2.toString) > comments.foreachRDD(i => i.foreach(m=> println(m))) > > any idea why the gz files are not being recognized. > > Thanks in advance, > > K Are the files being written in the directory or renamed in? As you should be using rename() against a filesystem (not an object store) to make sure that the file isn't picked up
Re: Issues with reading gz files with Spark Streaming
I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. On 22 October 2016 at 15:14, Steve Loughranwrote: > > > On 21 Oct 2016, at 15:53, Nkechi Achara wrote: > > > > Hi, > > > > I am using Spark 1.5.0 to read gz files with textFileStream, but when > new files are dropped in the specified directory. I know this is only the > case with gz files as when i extract the file into the directory specified > the files are read on the next window and processed. > > > > My code is here: > > > > val comments = ssc.fileStream[LongWritable, Text, > TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false). > > map(pair => pair._2.toString) > > comments.foreachRDD(i => i.foreach(m=> println(m))) > > > > any idea why the gz files are not being recognized. > > > > Thanks in advance, > > > > K > > Are the files being written in the directory or renamed in? As you should > be using rename() against a filesystem (not an object store) to make sure > that the file isn't picked up >
Re: Issues with reading gz files with Spark Streaming
> On 21 Oct 2016, at 15:53, Nkechi Acharawrote: > > Hi, > > I am using Spark 1.5.0 to read gz files with textFileStream, but when new > files are dropped in the specified directory. I know this is only the case > with gz files as when i extract the file into the directory specified the > files are read on the next window and processed. > > My code is here: > > val comments = ssc.fileStream[LongWritable, Text, > TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false). > map(pair => pair._2.toString) > comments.foreachRDD(i => i.foreach(m=> println(m))) > > any idea why the gz files are not being recognized. > > Thanks in advance, > > K Are the files being written in the directory or renamed in? As you should be using rename() against a filesystem (not an object store) to make sure that the file isn't picked up - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Issues with reading gz files with Spark Streaming
Hi, I am using Spark 1.5.0 to read gz files with textFileStream, but when new files are dropped in the specified directory. I know this is only the case with gz files as when i extract the file into the directory specified the files are read on the next window and processed. My code is here: val comments = ssc.fileStream[LongWritable, Text, TextInputFormat]("file:///tmp/", (f: Path) => true, newFilesOnly=false). map(pair => pair._2.toString) comments.foreachRDD(i => i.foreach(m=> println(m))) any idea why the gz files are not being recognized. Thanks in advance, K