Re: SQLContext load. Filtering files

2015-08-27 Thread Masf
Thanks Akhil, I will have a look.

I have a dude regarding to spark streaming and filestream. If spark
streaming crashs and while spark was down new files are created in input
folder, when spark streaming is launched again, how can I process these
files?

Thanks.
Regards.
Miguel.



On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Have a look at the spark streaming. You can make use of the ssc.fileStream.

 Eg:

 val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
   AvroKeyInputFormat[GenericRecord]](input)

 You can also specify a filter function
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
 as the second argument.

 Thanks
 Best Regards

 On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote:

 Hi.

 I'd like to read Avro files using this library
 https://github.com/databricks/spark-avro

 I need to load several files from a folder, not all files. Is there some
 functionality to filter the files to load?

 And... Is is possible to know the name of the files loaded from a folder?

 My problem is that I have a folder where an external process is inserting
 files every X minutes and I need process these files once, and I can't
 move, rename or copy the source files.


 Thanks
 --

 Regards
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Re: SQLContext load. Filtering files

2015-08-27 Thread Akhil Das
Have a look at the spark streaming. You can make use of the ssc.fileStream.

Eg:

val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
  AvroKeyInputFormat[GenericRecord]](input)

You can also specify a filter function
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
as the second argument.

Thanks
Best Regards

On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote:

 Hi.

 I'd like to read Avro files using this library
 https://github.com/databricks/spark-avro

 I need to load several files from a folder, not all files. Is there some
 functionality to filter the files to load?

 And... Is is possible to know the name of the files loaded from a folder?

 My problem is that I have a folder where an external process is inserting
 files every X minutes and I need process these files once, and I can't
 move, rename or copy the source files.


 Thanks
 --

 Regards
 Miguel Ángel



Re: SQLContext load. Filtering files

2015-08-27 Thread Akhil Das
If you have enabled checkpointing the spark will handle that for you.

Thanks
Best Regards

On Thu, Aug 27, 2015 at 4:21 PM, Masf masfwo...@gmail.com wrote:

 Thanks Akhil, I will have a look.

 I have a dude regarding to spark streaming and filestream. If spark
 streaming crashs and while spark was down new files are created in input
 folder, when spark streaming is launched again, how can I process these
 files?

 Thanks.
 Regards.
 Miguel.



 On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Have a look at the spark streaming. You can make use of the
 ssc.fileStream.

 Eg:

 val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
   AvroKeyInputFormat[GenericRecord]](input)

 You can also specify a filter function
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
 as the second argument.

 Thanks
 Best Regards

 On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote:

 Hi.

 I'd like to read Avro files using this library
 https://github.com/databricks/spark-avro

 I need to load several files from a folder, not all files. Is there some
 functionality to filter the files to load?

 And... Is is possible to know the name of the files loaded from a folder?

 My problem is that I have a folder where an external process is
 inserting files every X minutes and I need process these files once, and I
 can't move, rename or copy the source files.


 Thanks
 --

 Regards
 Miguel Ángel





 --


 Saludos.
 Miguel Ángel



SQLContext load. Filtering files

2015-08-19 Thread Masf
Hi.

I'd like to read Avro files using this library
https://github.com/databricks/spark-avro

I need to load several files from a folder, not all files. Is there some
functionality to filter the files to load?

And... Is is possible to know the name of the files loaded from a folder?

My problem is that I have a folder where an external process is inserting
files every X minutes and I need process these files once, and I can't
move, rename or copy the source files.


Thanks
-- 

Regards
Miguel Ángel