Re: Using Spark SQL with multiple (avro) files

2015-01-16 Thread Michael Armbrust
I'd open an issue on the github to ask us to allow you to use hadoops glob
file format for the path.

On Thu, Jan 15, 2015 at 4:57 AM, David Jones letsnumsperi...@gmail.com
wrote:

 I've tried this now. Spark can load multiple avro files from the same
 directory by passing a path to a directory. However, passing multiple paths
 separated with commas didn't work.


 Is there any way to load all avro files in multiple directories using
 sqlContext.avroFile?

 On Wed, Jan 14, 2015 at 3:53 PM, David Jones letsnumsperi...@gmail.com
 wrote:

 Should I be able to pass multiple paths separated by commas? I haven't
 tried but didn't think it'd work. I'd expected a function that accepted a
 list of strings.

 On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 If the wildcard path you have doesn't work you should probably open a
 bug -- I had a similar problem with Parquet and it was a bug which recently
 got closed. Not sure if sqlContext.avroFile shares a codepath with 
 .parquetFile...you
 can try running with bits that have the fix for .parquetFile or look at the
 source...
 Here was my question for reference:

 http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E

 On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com
 wrote:

 Hi,

 I have a program that loads a single avro file using spark SQL, queries
 it, transforms it and then outputs the data. The file is loaded with:

 val records = sqlContext.avroFile(filePath)
 val data = records.registerTempTable(data)
 ...


 Now I want to run it over tens of thousands of Avro files (all with
 schemas that contain the fields I'm interested in).

 Is it possible to load multiple avro files recursively from a top-level
 directory using wildcards? All my avro files are stored under
 s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
 these on EMR.

 If that's not possible, is there some way to load multiple avro files
 into the same table/RDD so the whole dataset can be processed (and in that
 case I'd supply paths to each file concretely, but I *really* don't want to
 have to do that).

 Thanks
 David







Re: Using Spark SQL with multiple (avro) files

2015-01-15 Thread David Jones
I've tried this now. Spark can load multiple avro files from the same
directory by passing a path to a directory. However, passing multiple paths
separated with commas didn't work.


Is there any way to load all avro files in multiple directories using
sqlContext.avroFile?

On Wed, Jan 14, 2015 at 3:53 PM, David Jones letsnumsperi...@gmail.com
wrote:

 Should I be able to pass multiple paths separated by commas? I haven't
 tried but didn't think it'd work. I'd expected a function that accepted a
 list of strings.

 On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com
 wrote:

 If the wildcard path you have doesn't work you should probably open a bug
 -- I had a similar problem with Parquet and it was a bug which recently got
 closed. Not sure if sqlContext.avroFile shares a codepath with 
 .parquetFile...you
 can try running with bits that have the fix for .parquetFile or look at the
 source...
 Here was my question for reference:

 http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E

 On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com
 wrote:

 Hi,

 I have a program that loads a single avro file using spark SQL, queries
 it, transforms it and then outputs the data. The file is loaded with:

 val records = sqlContext.avroFile(filePath)
 val data = records.registerTempTable(data)
 ...


 Now I want to run it over tens of thousands of Avro files (all with
 schemas that contain the fields I'm interested in).

 Is it possible to load multiple avro files recursively from a top-level
 directory using wildcards? All my avro files are stored under
 s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
 these on EMR.

 If that's not possible, is there some way to load multiple avro files
 into the same table/RDD so the whole dataset can be processed (and in that
 case I'd supply paths to each file concretely, but I *really* don't want to
 have to do that).

 Thanks
 David






Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread Yana Kadiyska
If the wildcard path you have doesn't work you should probably open a bug
-- I had a similar problem with Parquet and it was a bug which recently got
closed. Not sure if sqlContext.avroFile shares a codepath with
.parquetFile...you
can try running with bits that have the fix for .parquetFile or look at the
source...
Here was my question for reference:
http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E

On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com
wrote:

 Hi,

 I have a program that loads a single avro file using spark SQL, queries
 it, transforms it and then outputs the data. The file is loaded with:

 val records = sqlContext.avroFile(filePath)
 val data = records.registerTempTable(data)
 ...


 Now I want to run it over tens of thousands of Avro files (all with
 schemas that contain the fields I'm interested in).

 Is it possible to load multiple avro files recursively from a top-level
 directory using wildcards? All my avro files are stored under
 s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
 these on EMR.

 If that's not possible, is there some way to load multiple avro files into
 the same table/RDD so the whole dataset can be processed (and in that case
 I'd supply paths to each file concretely, but I *really* don't want to have
 to do that).

 Thanks
 David



Re: Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
Should I be able to pass multiple paths separated by commas? I haven't
tried but didn't think it'd work. I'd expected a function that accepted a
list of strings.

On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 If the wildcard path you have doesn't work you should probably open a bug
 -- I had a similar problem with Parquet and it was a bug which recently got
 closed. Not sure if sqlContext.avroFile shares a codepath with 
 .parquetFile...you
 can try running with bits that have the fix for .parquetFile or look at the
 source...
 Here was my question for reference:

 http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E

 On Wed, Jan 14, 2015 at 4:34 AM, David Jones letsnumsperi...@gmail.com
 wrote:

 Hi,

 I have a program that loads a single avro file using spark SQL, queries
 it, transforms it and then outputs the data. The file is loaded with:

 val records = sqlContext.avroFile(filePath)
 val data = records.registerTempTable(data)
 ...


 Now I want to run it over tens of thousands of Avro files (all with
 schemas that contain the fields I'm interested in).

 Is it possible to load multiple avro files recursively from a top-level
 directory using wildcards? All my avro files are stored under
 s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
 these on EMR.

 If that's not possible, is there some way to load multiple avro files
 into the same table/RDD so the whole dataset can be processed (and in that
 case I'd supply paths to each file concretely, but I *really* don't want to
 have to do that).

 Thanks
 David





Using Spark SQL with multiple (avro) files

2015-01-14 Thread David Jones
Hi,

I have a program that loads a single avro file using spark SQL, queries it,
transforms it and then outputs the data. The file is loaded with:

val records = sqlContext.avroFile(filePath)
val data = records.registerTempTable(data)
...


Now I want to run it over tens of thousands of Avro files (all with schemas
that contain the fields I'm interested in).

Is it possible to load multiple avro files recursively from a top-level
directory using wildcards? All my avro files are stored under
s3://my-bucket/avros/*/DATE/*.avro, and I want to run my task across all of
these on EMR.

If that's not possible, is there some way to load multiple avro files into
the same table/RDD so the whole dataset can be processed (and in that case
I'd supply paths to each file concretely, but I *really* don't want to have
to do that).

Thanks
David