Re: SQLContext load. Filtering files
Thanks Akhil, I will have a look. I have a dude regarding to spark streaming and filestream. If spark streaming crashs and while spark was down new files are created in input folder, when spark streaming is launched again, how can I process these files? Thanks. Regards. Miguel. On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Have a look at the spark streaming. You can make use of the ssc.fileStream. Eg: val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) You can also specify a filter function http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext as the second argument. Thanks Best Regards On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote: Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is that I have a folder where an external process is inserting files every X minutes and I need process these files once, and I can't move, rename or copy the source files. Thanks -- Regards Miguel Ángel -- Saludos. Miguel Ángel
Re: SQLContext load. Filtering files
Have a look at the spark streaming. You can make use of the ssc.fileStream. Eg: val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) You can also specify a filter function http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext as the second argument. Thanks Best Regards On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote: Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is that I have a folder where an external process is inserting files every X minutes and I need process these files once, and I can't move, rename or copy the source files. Thanks -- Regards Miguel Ángel
Re: SQLContext load. Filtering files
If you have enabled checkpointing the spark will handle that for you. Thanks Best Regards On Thu, Aug 27, 2015 at 4:21 PM, Masf masfwo...@gmail.com wrote: Thanks Akhil, I will have a look. I have a dude regarding to spark streaming and filestream. If spark streaming crashs and while spark was down new files are created in input folder, when spark streaming is launched again, how can I process these files? Thanks. Regards. Miguel. On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com wrote: Have a look at the spark streaming. You can make use of the ssc.fileStream. Eg: val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](input) You can also specify a filter function http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext as the second argument. Thanks Best Regards On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote: Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is that I have a folder where an external process is inserting files every X minutes and I need process these files once, and I can't move, rename or copy the source files. Thanks -- Regards Miguel Ángel -- Saludos. Miguel Ángel
SQLContext load. Filtering files
Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is that I have a folder where an external process is inserting files every X minutes and I need process these files once, and I can't move, rename or copy the source files. Thanks -- Regards Miguel Ángel