hi Partha — in the examples you gave:

* Simple struct: 2 string fields, 3 primitive/numeric fields
* Complex struct: 9 string fields, 4 primitive/numeric fields

I would guess that the larger number of binary/string fields and
overall size of the schema (5 vs 13 fields) is influencing the
decoding time more than the nesting level. That said, the more
deeply-nested case has not been optimized as much as the shallow/flat
case.

On Mon, Mar 14, 2022 at 9:59 AM Partha Dutta <[email protected]> wrote:
>
> I've been trying to understand some slowness in my application. I am reading 
> data from Azure ADLS using fsspec, and I am finding that reading columns that 
> have nested structs are much slower.
>
> The file is about 1GB in size, and I am reading a single row group from the 
> file (approximately 453,000 records)
>
> I tried with different column types, and these are the execution times that I 
> observed to read a single row group, and a single column:
>
> timestamp column: 0.468 seconds
> simple struct (no nesting, 5 fields): 0.672 seconds
> nested struct (3 levels of nesting): 4.12 seconds
>
> This is the parquet definition of the simple struct:
> optional group field_id=-1 device {
>     optional binary field_id=676 typeIDService (String);
>     optional binary field_id=677 typeID (String);
>     optional int32 field_id=678 screenWidth;
>     optional int32 field_id=679 screenHeight;
>     optional int32 field_id=680 colorDepth;
>   }
>
> And this is the nested struct:
> optional group field_id=-1 web {
>     optional group field_id=-1 webPageDetails {
>       optional binary field_id=59 name (String);
>       optional binary field_id=60 server (String);
>       optional binary field_id=61 URL (String);
>       optional boolean field_id=62 isErrorPage;
>       optional boolean field_id=63 isHomePage;
>       optional binary field_id=64 siteSection (String);
>       optional group field_id=-1 pageViews {
>         optional double field_id=66 value;
>       }
>     }
>     optional group field_id=-1 webReferrer {
>       optional binary field_id=67 type (String);
>       optional binary field_id=68 URL (String);
>     }
>     optional group field_id=-1 webInteraction {
>       optional binary field_id=69 type (String);
>       optional binary field_id=70 name (String);
>       optional group field_id=-1 linkClicks {
>         optional double field_id=73 value;
>       }
>       optional binary field_id=72 URL (String);
>     }
>   }
>
> I am curious as to why the performance is so slow.
> --
> Partha Dutta
> [email protected]

Reply via email to