You can create a partitioned hive table using Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote:
Hi,
I've got a bunch of data stored in S3 under directories like this:
Hi,
I've got a bunch of data stored in S3 under directories like this:
s3n://blah/y=2015/m=01/d=25/lots-of-files.csv
In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
it only scans the necessary directories for files to read.
As far as I can tell from searching and
Currently no if you don't want to use Spark SQL's HiveContext. But we're
working on adding partitioning support to the external data sources API,
with which you can create, for example, partitioned Parquet tables
without using Hive.
Cheng
On 1/26/15 8:47 AM, Danny Yates wrote:
Thanks
Good to hear there will be partitioning support. I’ve had some success loading
partitioned data specified with Unix glowing format. i.e.:
sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”)
would load dates 2014-11-24 through 2014-11-30. Not the most ideal solution,
but it
Thanks Michael.
I'm not actually using Hive at the moment - in fact, I'm trying to avoid it
if I can. I'm just wondering whether Spark has anything similar I can
leverage?
Thanks
I'm not actually using Hive at the moment - in fact, I'm trying to avoid
it if I can. I'm just wondering whether Spark has anything similar I can
leverage?
Let me clarify, you do not need to have Hive installed, and what I'm
suggesting is completely self-contained in Spark SQL. We support
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for
the info!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org