I've been trying to understand some slowness in my application. I am
reading data from Azure ADLS using fsspec, and I am finding that reading
columns that have nested structs are much slower.
The file is about 1GB in size, and I am reading a single row group from the
file (approximately 453,000 records)
I tried with different column types, and these are the execution times that
I observed to read a single row group, and a single column:
timestamp column: 0.468 seconds
simple struct (no nesting, 5 fields): 0.672 seconds
nested struct (3 levels of nesting): 4.12 seconds
This is the parquet definition of the simple struct:
optional group field_id=-1 device {
optional binary field_id=676 typeIDService (String);
optional binary field_id=677 typeID (String);
optional int32 field_id=678 screenWidth;
optional int32 field_id=679 screenHeight;
optional int32 field_id=680 colorDepth;
}
And this is the nested struct:
optional group field_id=-1 web {
optional group field_id=-1 webPageDetails {
optional binary field_id=59 name (String);
optional binary field_id=60 server (String);
optional binary field_id=61 URL (String);
optional boolean field_id=62 isErrorPage;
optional boolean field_id=63 isHomePage;
optional binary field_id=64 siteSection (String);
optional group field_id=-1 pageViews {
optional double field_id=66 value;
}
}
optional group field_id=-1 webReferrer {
optional binary field_id=67 type (String);
optional binary field_id=68 URL (String);
}
optional group field_id=-1 webInteraction {
optional binary field_id=69 type (String);
optional binary field_id=70 name (String);
optional group field_id=-1 linkClicks {
optional double field_id=73 value;
}
optional binary field_id=72 URL (String);
}
}
I am curious as to why the performance is so slow.
--
Partha Dutta
[email protected]