Re: parquet partition discovery

2015-04-08 Thread Michael Armbrust
Back to the user list so everyone can see the result of the discussion...

Ah. It all makes sense now. The issue is that when I created the parquet
 files, I included an unnecessary directory name (data.parquet) below the
 partition directories. It’s just a leftover from when I started with
 Michael’s sample code and it only made sense before I added the partition
 directories. I probably thought it was some magic name that was required
 when spark scanned for parquet files. The structure looks something like
 this:



 drwxr-xr-x   - user supergroup  0 2015-04-02 13:17
 hdfs://host/tablename/date=20150302/sym=A/data.parquet/...

 If I just move all the files up a level (there goes a day of work) , the
 existing code should work fine. Whether it’s useful to handle intermediate
 non-partition directories or whether that just creates some extra risk I
 can’t say, since I’m new to all the technology in this whole stack.


I'm mixed here.  There is always a tradeoff between silently ignoring
structure that people might not be aware of (and thus might be a bug) and
just working.  Having this as an option at least certainly seems
reasonable.  I'd be curious if anyone had other thoughts?


   Unfortunately, it takes many minutes (even with mergeSchema=false) to
 create the RDD. It appears that the whole data store will still be
 recursively traversed (even with mergeSchema=false, a manually specified
 schema, and a partition spec [which I can’t pass in through a public API])
 so that all of the metadata FileStatuses can be cached. In my case I’m
 going to have years of data, so there’s no way that will be feasible.



 Should I just explicitly load the partitions I want instead of using
 partition discovery? Is there any plan to have a less aggressive version of
 support for partitions, where metadata is only cached for partitions that
 are used in queries?


We improved the speed here in 1.3.1 so I'd be curious if that helps.  We
definitely need to continue to speed things up here though.  We have to
enumerate all the partitions so we know what data to read when a query
comes in, but I do think we can parallelize it or something.


Re: parquet partition discovery

2015-04-08 Thread Cheng Lian



On 4/9/15 3:09 AM, Michael Armbrust wrote:

Back to the user list so everyone can see the result of the discussion...

Ah. It all makes sense now. The issue is that when I created the
parquet files, I included an unnecessary directory name
(data.parquet) below the partition directories. It’s just a
leftover from when I started with Michael’s sample code and it
only made sense before I added the partition directories. I
probably thought it was some magic name that was required when
spark scanned for parquet files. The structure looks something
like this:

drwxr-xr-x   - user supergroup  0 2015-04-02 13:17
hdfs://host/tablename/date=20150302/sym=A/data.parquet/...

If I just move all the files up a level (there goes a day of work)
, the existing code should work fine. Whether it’s useful to
handle intermediate non-partition directories or whether that just
creates some extra risk I can’t say, since I’m new to all the
technology in this whole stack.


I'm mixed here.  There is always a tradeoff between silently 
ignoring structure that people might not be aware of (and thus might 
be a bug) and just working. Having this as an option at least 
certainly seems reasonable.  I'd be curious if anyone had other thoughts?

Take the following directory name as an example:

   /path/to/partition/a=1/random/b=foo

One possible approach can be, we grab both a=1 and b=foo, then 
either report random by throwing an exception or ignore it with a WARN 
log.


Unfortunately, it takes many minutes (even with mergeSchema=false)
to create the RDD. It appears that the whole data store will still
be recursively traversed (even with mergeSchema=false, a manually
specified schema, and a partition spec [which I can’t pass in
through a public API]) so that all of the metadata FileStatuses
can be cached. In my case I’m going to have years of data, so
there’s no way that will be feasible.

Should I just explicitly load the partitions I want instead of
using partition discovery? Is there any plan to have a less
aggressive version of support for partitions, where metadata is
only cached for partitions that are used in queries?


We improved the speed here in 1.3.1 so I'd be curious if that helps.  
We definitely need to continue to speed things up here though.  We 
have to enumerate all the partitions so we know what data to read when 
a query comes in, but I do think we can parallelize it or something.