Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust
You can create a partitioned hive table using Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote:

 Hi,

 I've got a bunch of data stored in S3 under directories like this:

 s3n://blah/y=2015/m=01/d=25/lots-of-files.csv

 In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
 it only scans the necessary directories for files to read.

 As far as I can tell from searching and reading the docs, the right way of
 loading this data into Spark is to use sc.textFile(s3n://blah/*/*/*/)

 1) Is there any way in Spark to access y, m and d as fields? In Hive, you
 declare them in the schema, but you don't put them in the CSV files - their
 values are extracted from the path.
 2) Is there any way to get Spark to use the y, m and d fields to minimise
 the files it transfers from S3?

 Thanks,

 Danny.



Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Cheng Lian
Currently no if you don't want to use Spark SQL's HiveContext. But we're 
working on adding partitioning support to the external data sources API, 
with which you can create, for example, partitioned Parquet tables 
without using Hive.


Cheng

On 1/26/15 8:47 AM, Danny Yates wrote:

Thanks Michael.

I'm not actually using Hive at the moment - in fact, I'm trying to 
avoid it if I can. I'm just wondering whether Spark has anything 
similar I can leverage?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Chris Gore
Good to hear there will be partitioning support.  I’ve had some success loading 
partitioned data specified with Unix glowing format.  i.e.:

sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”)

would load dates 2014-11-24 through 2014-11-30.  Not the most ideal solution, 
but it seems to work for loading data from a range.

Best,
Chris

 On Jan 26, 2015, at 10:55 AM, Cheng Lian lian.cs@gmail.com wrote:
 
 Currently no if you don't want to use Spark SQL's HiveContext. But we're 
 working on adding partitioning support to the external data sources API, with 
 which you can create, for example, partitioned Parquet tables without using 
 Hive.
 
 Cheng
 
 On 1/26/15 8:47 AM, Danny Yates wrote:
 Thanks Michael.
 
 I'm not actually using Hive at the moment - in fact, I'm trying to avoid it 
 if I can. I'm just wondering whether Spark has anything similar I can 
 leverage?
 
 Thanks
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Thanks Michael.

I'm not actually using Hive at the moment - in fact, I'm trying to avoid it
if I can. I'm just wondering whether Spark has anything similar I can
leverage?

Thanks


Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Michael Armbrust

 I'm not actually using Hive at the moment - in fact, I'm trying to avoid
 it if I can. I'm just wondering whether Spark has anything similar I can
 leverage?


Let me clarify, you do not need to have Hive installed, and what I'm
suggesting is completely self-contained in Spark SQL.  We support the Hive
Query Language for expressing partitioned tables when you are using a
HiveContext, but the execution will be done using RDDs.  If you don't
manually configure a hive installation, Spark will just create a local
metastore in the current directory.

In the future we are planning to support non-HiveQL mechanisms for
expressing partitioning.


Re: Can Spark benefit from Hive-like partitions?

2015-01-26 Thread Danny Yates
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for 
the info!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org