I've been trying to understand some slowness in my application. I am
reading data from Azure ADLS using fsspec, and I am finding that reading
columns that have nested structs are much slower.

The file is about 1GB in size, and I am reading a single row group from the
file (approximately 453,000 records)

I tried with different column types, and these are the execution times that
I observed to read a single row group, and a single column:

timestamp column: 0.468 seconds
simple struct (no nesting, 5 fields): 0.672 seconds
nested struct (3 levels of nesting): 4.12 seconds

This is the parquet definition of the simple struct:
optional group field_id=-1 device {
    optional binary field_id=676 typeIDService (String);
    optional binary field_id=677 typeID (String);
    optional int32 field_id=678 screenWidth;
    optional int32 field_id=679 screenHeight;
    optional int32 field_id=680 colorDepth;
  }

And this is the nested struct:
optional group field_id=-1 web {
    optional group field_id=-1 webPageDetails {
      optional binary field_id=59 name (String);
      optional binary field_id=60 server (String);
      optional binary field_id=61 URL (String);
      optional boolean field_id=62 isErrorPage;
      optional boolean field_id=63 isHomePage;
      optional binary field_id=64 siteSection (String);
      optional group field_id=-1 pageViews {
        optional double field_id=66 value;
      }
    }
    optional group field_id=-1 webReferrer {
      optional binary field_id=67 type (String);
      optional binary field_id=68 URL (String);
    }
    optional group field_id=-1 webInteraction {
      optional binary field_id=69 type (String);
      optional binary field_id=70 name (String);
      optional group field_id=-1 linkClicks {
        optional double field_id=73 value;
      }
      optional binary field_id=72 URL (String);
    }
  }

I am curious as to why the performance is so slow.
-- 
Partha Dutta
[email protected]

Reply via email to