Re: parquet optimal file structure - flat vs nested

2017-05-03 Thread Steve Loughran

> On 30 Apr 2017, at 09:19, Zeming Yu  wrote:
> 
> Hi,
> 
> We're building a parquet based data lake. I was under the impression that 
> flat files are more efficient than deeply nested files (say 3 or 4 levels 
> down). Is that correct?
> 
> Thanks,
> Zeming

Where's the data going to live: HDFS or an object store? If it's somewhere like 
Amazon S3 I'd be biased towards the flatter structure as how the client 
libraries mimic treewalking is pretty expensive in terms of HTTP calls, and, as 
those calls all take place during the initial, serialized, query planning 
stage, expensive. 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Can you give more details on the schema? Is it 6 TB just airport information as 
below? 

> On 30. Apr 2017, at 23:08, Zeming Yu  wrote:
> 
> I thought relational databases with 6 TB of data can be quite expensive?
> 
>> On 1 May 2017 12:56 am, "Muthu Jayakumar"  wrote:
>> I am not sure if parquet is a good fit for this? This seems more like filter 
>> lookup than an aggregate like query. I am curious to see what others have to 
>> say.
>> Would it be more efficient if a relational database with the right index 
>> (code field in the above case) to perform more efficiently (with spark that 
>> uses predicate push-down)? 
>> Hope this helps.
>> 
>> Thanks,
>> Muthu
>> 
>>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu  wrote:
>>> Another question: I need to store airport info in a parquet file and 
>>> present it when a user makes a query. 
>>> 
>>> For example:
>>> 
>>> "airport": {
>>> "code": "TPE",
>>> "name": "Taipei (Taoyuan Intl.)",
>>> "longName": "Taipei, Taiwan 
>>> (TPE-Taoyuan Intl.)",
>>> "city": "Taipei",
>>> "localName": "Taoyuan Intl.",
>>> "airportCityState": "Taipei, Taiwan"
>>> 
>>> 
>>> Is it best practice to store just the coce "TPE" and then look up the name 
>>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>> 
 On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke  wrote:
 Depends on your queries, the data structure etc. generally flat is better, 
 but if your query filter is on the highest level then you may have better 
 performance with a nested structure, but it really depends
 
 > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
 >
 > Hi,
 >
 > We're building a parquet based data lake. I was under the impression 
 > that flat files are more efficient than deeply nested files (say 3 or 4 
 > levels down). Is that correct?
 >
 > Thanks,
 > Zeming
>>> 
>> 


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
You have to find out how the user filters - by code? By airport name? Then you 
can have the right structure. Although, in the scenario below ORC with bloom 
filters may have some advantages.
It is crucial that you sort the data when inserting it on the columns your user 
wants to filter. E.g. If filters by code then it needs to be sorted on the code.


> On 30. Apr 2017, at 23:08, Zeming Yu  wrote:
> 
> I thought relational databases with 6 TB of data can be quite expensive?
> 
>> On 1 May 2017 12:56 am, "Muthu Jayakumar"  wrote:
>> I am not sure if parquet is a good fit for this? This seems more like filter 
>> lookup than an aggregate like query. I am curious to see what others have to 
>> say.
>> Would it be more efficient if a relational database with the right index 
>> (code field in the above case) to perform more efficiently (with spark that 
>> uses predicate push-down)? 
>> Hope this helps.
>> 
>> Thanks,
>> Muthu
>> 
>>> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu  wrote:
>>> Another question: I need to store airport info in a parquet file and 
>>> present it when a user makes a query. 
>>> 
>>> For example:
>>> 
>>> "airport": {
>>> "code": "TPE",
>>> "name": "Taipei (Taoyuan Intl.)",
>>> "longName": "Taipei, Taiwan 
>>> (TPE-Taoyuan Intl.)",
>>> "city": "Taipei",
>>> "localName": "Taoyuan Intl.",
>>> "airportCityState": "Taipei, Taiwan"
>>> 
>>> 
>>> Is it best practice to store just the coce "TPE" and then look up the name 
>>> "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>> 
 On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke  wrote:
 Depends on your queries, the data structure etc. generally flat is better, 
 but if your query filter is on the highest level then you may have better 
 performance with a nested structure, but it really depends
 
 > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
 >
 > Hi,
 >
 > We're building a parquet based data lake. I was under the impression 
 > that flat files are more efficient than deeply nested files (say 3 or 4 
 > levels down). Is that correct?
 >
 > Thanks,
 > Zeming
>>> 
>> 


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
I thought relational databases with 6 TB of data can be quite expensive?

On 1 May 2017 12:56 am, "Muthu Jayakumar"  wrote:

> I am not sure if parquet is a good fit for this? This seems more like
> filter lookup than an aggregate like query. I am curious to see what others
> have to say.
> Would it be more efficient if a relational database with the right index
> (code field in the above case) to perform more efficiently (with spark that
> uses predicate push-down)?
> Hope this helps.
>
> Thanks,
> Muthu
>
> On Sun, Apr 30, 2017 at 1:45 AM, Zeming Yu  wrote:
>
>> Another question: I need to store airport info in a parquet file and
>> present it when a user makes a query.
>>
>> For example:
>>
>> "airport": {
>> "code": "TPE",
>> "name": "Taipei (Taoyuan Intl.)",
>> "longName": "Taipei, Taiwan
>> (TPE-Taoyuan Intl.)",
>> "city": "Taipei",
>> "localName": "Taoyuan Intl.",
>> "airportCityState": "Taipei,
>> Taiwan"
>>
>>
>> Is it best practice to store just the coce "TPE" and then look up the
>> name "Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?
>>
>> On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke 
>> wrote:
>>
>>> Depends on your queries, the data structure etc. generally flat is
>>> better, but if your query filter is on the highest level then you may have
>>> better performance with a nested structure, but it really depends
>>>
>>> > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
>>> >
>>> > Hi,
>>> >
>>> > We're building a parquet based data lake. I was under the impression
>>> that flat files are more efficient than deeply nested files (say 3 or 4
>>> levels down). Is that correct?
>>> >
>>> > Thanks,
>>> > Zeming
>>>
>>
>>
>


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Another question: I need to store airport info in a parquet file and
present it when a user makes a query.

For example:

"airport": {
"code": "TPE",
"name": "Taipei (Taoyuan Intl.)",
"longName": "Taipei, Taiwan
(TPE-Taoyuan Intl.)",
"city": "Taipei",
"localName": "Taoyuan Intl.",
"airportCityState": "Taipei, Taiwan"


Is it best practice to store just the coce "TPE" and then look up the name
"Taipei (Taoyuan Intl.)" from a relational database? Any alternatives?

On Sun, Apr 30, 2017 at 6:34 PM, Jörn Franke  wrote:

> Depends on your queries, the data structure etc. generally flat is better,
> but if your query filter is on the highest level then you may have better
> performance with a nested structure, but it really depends
>
> > On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
> >
> > Hi,
> >
> > We're building a parquet based data lake. I was under the impression
> that flat files are more efficient than deeply nested files (say 3 or 4
> levels down). Is that correct?
> >
> > Thanks,
> > Zeming
>


Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Jörn Franke
Depends on your queries, the data structure etc. generally flat is better, but 
if your query filter is on the highest level then you may have better 
performance with a nested structure, but it really depends

> On 30. Apr 2017, at 10:19, Zeming Yu  wrote:
> 
> Hi,
> 
> We're building a parquet based data lake. I was under the impression that 
> flat files are more efficient than deeply nested files (say 3 or 4 levels 
> down). Is that correct?
> 
> Thanks,
> Zeming

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Hi,

We're building a parquet based data lake. I was under the impression that
flat files are more efficient than deeply nested files (say 3 or 4 levels
down). Is that correct?

Thanks,
Zeming