Re: Can Spark benefit from Hive-like partitions?
You can create a partitioned hive table using Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote: Hi, I've got a bunch of data stored in S3 under directories like this: s3n://blah/y=2015/m=01/d=25/lots-of-files.csv In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that it only scans the necessary directories for files to read. As far as I can tell from searching and reading the docs, the right way of loading this data into Spark is to use sc.textFile(s3n://blah/*/*/*/) 1) Is there any way in Spark to access y, m and d as fields? In Hive, you declare them in the schema, but you don't put them in the CSV files - their values are extracted from the path. 2) Is there any way to get Spark to use the y, m and d fields to minimise the files it transfers from S3? Thanks, Danny.
Re: Can Spark benefit from Hive-like partitions?
Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can Spark benefit from Hive-like partitions?
Good to hear there will be partitioning support. I’ve had some success loading partitioned data specified with Unix glowing format. i.e.: sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”) would load dates 2014-11-24 through 2014-11-30. Not the most ideal solution, but it seems to work for loading data from a range. Best, Chris On Jan 26, 2015, at 10:55 AM, Cheng Lian lian.cs@gmail.com wrote: Currently no if you don't want to use Spark SQL's HiveContext. But we're working on adding partitioning support to the external data sources API, with which you can create, for example, partitioned Parquet tables without using Hive. Cheng On 1/26/15 8:47 AM, Danny Yates wrote: Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Can Spark benefit from Hive-like partitions?
Thanks Michael. I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Thanks
Re: Can Spark benefit from Hive-like partitions?
I'm not actually using Hive at the moment - in fact, I'm trying to avoid it if I can. I'm just wondering whether Spark has anything similar I can leverage? Let me clarify, you do not need to have Hive installed, and what I'm suggesting is completely self-contained in Spark SQL. We support the Hive Query Language for expressing partitioned tables when you are using a HiveContext, but the execution will be done using RDDs. If you don't manually configure a hive installation, Spark will just create a local metastore in the current directory. In the future we are planning to support non-HiveQL mechanisms for expressing partitioning.
Re: Can Spark benefit from Hive-like partitions?
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for the info! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org