hi Partha — in the examples you gave: * Simple struct: 2 string fields, 3 primitive/numeric fields * Complex struct: 9 string fields, 4 primitive/numeric fields
I would guess that the larger number of binary/string fields and overall size of the schema (5 vs 13 fields) is influencing the decoding time more than the nesting level. That said, the more deeply-nested case has not been optimized as much as the shallow/flat case. On Mon, Mar 14, 2022 at 9:59 AM Partha Dutta <[email protected]> wrote: > > I've been trying to understand some slowness in my application. I am reading > data from Azure ADLS using fsspec, and I am finding that reading columns that > have nested structs are much slower. > > The file is about 1GB in size, and I am reading a single row group from the > file (approximately 453,000 records) > > I tried with different column types, and these are the execution times that I > observed to read a single row group, and a single column: > > timestamp column: 0.468 seconds > simple struct (no nesting, 5 fields): 0.672 seconds > nested struct (3 levels of nesting): 4.12 seconds > > This is the parquet definition of the simple struct: > optional group field_id=-1 device { > optional binary field_id=676 typeIDService (String); > optional binary field_id=677 typeID (String); > optional int32 field_id=678 screenWidth; > optional int32 field_id=679 screenHeight; > optional int32 field_id=680 colorDepth; > } > > And this is the nested struct: > optional group field_id=-1 web { > optional group field_id=-1 webPageDetails { > optional binary field_id=59 name (String); > optional binary field_id=60 server (String); > optional binary field_id=61 URL (String); > optional boolean field_id=62 isErrorPage; > optional boolean field_id=63 isHomePage; > optional binary field_id=64 siteSection (String); > optional group field_id=-1 pageViews { > optional double field_id=66 value; > } > } > optional group field_id=-1 webReferrer { > optional binary field_id=67 type (String); > optional binary field_id=68 URL (String); > } > optional group field_id=-1 webInteraction { > optional binary field_id=69 type (String); > optional binary field_id=70 name (String); > optional group field_id=-1 linkClicks { > optional double field_id=73 value; > } > optional binary field_id=72 URL (String); > } > } > > I am curious as to why the performance is so slow. > -- > Partha Dutta > [email protected]
