Re: Spark ignoring partition names without equals (=) separator

2016-11-29 Thread Steve Loughran

On 29 Nov 2016, at 05:19, Prasanna Santhanam 
> wrote:

On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran 
> wrote:

irrespective of naming, know that deep directory trees are performance killers 
when listing files on s3 and setting up jobs. You might actually be better off 
having them in the same directory and using a pattern like 2016-03-11-*
as the pattten to find files.

Thanks Bharat and Steve - I've generally followed the partitioned table format 
over the flat structure since this aides WHERE clause filtering 
(PredicatePushDown?). Wrt performance that helps the write once, query many 
times kind of workloads. Changing this in our production application that dumps 
these is cumbersome. Is there a configuration that would override this 
restriction for Spark? Does it make sense to have one?

if it's done' leave alone. Just be aware that s3 doesn't like deep directories 
that much, as listing is fairly slow



Re: Spark ignoring partition names without equals (=) separator

2016-11-28 Thread Prasanna Santhanam
On Mon, Nov 28, 2016 at 4:39 PM, Steve Loughran 
wrote:

>
> irrespective of naming, know that deep directory trees are performance
> killers when listing files on s3 and setting up jobs. You might actually be
> better off having them in the same directory and using a pattern like
> 2016-03-11-*
> as the pattten to find files.
>

Thanks Bharat and Steve - I've generally followed the partitioned table
format over the flat structure since this aides WHERE clause filtering
(PredicatePushDown?). Wrt performance that helps the write once, query many
times kind of workloads. Changing this in our production application that
dumps these is cumbersome. Is there a configuration that would override
this restriction for Spark? Does it make sense to have one?


Re: Spark ignoring partition names without equals (=) separator

2016-11-28 Thread Steve Loughran

irrespective of naming, know that deep directory trees are performance killers 
when listing files on s3 and setting up jobs. You might actually be better off 
having them in the same directory and using a pattern like 2016-03-11-*
as the pattten to find files.



On 28 Nov 2016, at 04:18, Prasanna Santhanam 
> wrote:

I've been toying around with Spark SQL lately and trying to move some workloads 
from Hive. In the hive world the partitions below are recovered on an ALTER 
TABLE RECOVER PARTITIONS

Path:
s3://bucket-company/path/2016/03/11
s3://bucket-company/path/2016/03/12
s3://bucket-company/path/2016/03/13

Where as Spark ignores these unless the partition information is of the format 
below

s3://bucket-company/path/year=2016/month=03/day=11
s3://bucket-company/path/year=2016/month=03/day=12
s3://bucket-company/path/year=2016/month=03/day=13

The code for this is in 
ddl.scala.
If my DDL already expresses the partition information why does Spark ignore the 
partition and enforce this separator?

DDL:
CREATE EXTERNAL TABLE test_tbl
(
   column1 STRING,
   column2 STRUCT  <... >
)
PARTITIONED BY (year STRING, month STRING, day STRING)
LOCATION s3://bucket-company/path

Thanks,







Re: Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Bharath Bhushan
Prasanna,
AFAIK spark does not handle folders without partition column names in them
and there is no way to get spark to do it.
I think the reason for this is that parquet file hierarchies had this info
and historically spark deals more with those.

On Mon, Nov 28, 2016 at 9:48 AM, Prasanna Santhanam  wrote:

> I've been toying around with Spark SQL lately and trying to move some
> workloads from Hive. In the hive world the partitions below are recovered
> on an ALTER TABLE RECOVER PARTITIONS
>
> *Path:*
> s3://bucket-company/path/2016/03/11
> s3://bucket-company/path/2016/03/12
> s3://bucket-company/path/2016/03/13
>
> Where as Spark ignores these unless the partition information is of the
> format below
>
> s3://bucket-company/path/year=2016/month=03/day=11
> s3://bucket-company/path/year=2016/month=03/day=12
> s3://bucket-company/path/year=2016/month=03/day=13
>
> The code for this is in ddl.scala.
> 
> If my DDL already expresses the partition information why does Spark
> ignore the partition and enforce this separator?
>
> *DDL:*
> CREATE EXTERNAL TABLE test_tbl
> (
>column1 STRING,
>column2 STRUCT  <... >
> )
> PARTITIONED BY (year STRING, month STRING, day STRING)
> LOCATION s3://bucket-company/path
>
> Thanks,
>
>
>
>
>


-- 
Bharath (ಭರತ್)


Spark ignoring partition names without equals (=) separator

2016-11-27 Thread Prasanna Santhanam
I've been toying around with Spark SQL lately and trying to move some
workloads from Hive. In the hive world the partitions below are recovered
on an ALTER TABLE RECOVER PARTITIONS

*Path:*
s3://bucket-company/path/2016/03/11
s3://bucket-company/path/2016/03/12
s3://bucket-company/path/2016/03/13

Where as Spark ignores these unless the partition information is of the
format below

s3://bucket-company/path/year=2016/month=03/day=11
s3://bucket-company/path/year=2016/month=03/day=12
s3://bucket-company/path/year=2016/month=03/day=13

The code for this is in ddl.scala.

If my DDL already expresses the partition information why does Spark ignore
the partition and enforce this separator?

*DDL:*
CREATE EXTERNAL TABLE test_tbl
(
   column1 STRING,
   column2 STRUCT  <... >
)
PARTITIONED BY (year STRING, month STRING, day STRING)
LOCATION s3://bucket-company/path

Thanks,