Larry Parker created ARROW-9637:
-----------------------------------

             Summary: Speed degradation with categoricals
                 Key: ARROW-9637
                 URL: https://issues.apache.org/jira/browse/ARROW-9637
             Project: Apache Arrow
          Issue Type: Bug
    Affects Versions: 1.0.0
            Reporter: Larry Parker


I have noticed some major speed degradation when using categorical data types.  
For example, a Parquet file with 1 million rows that sums 10 float columns and 
groups by two columns (one a date column and one a category column).  The 
cardinality of the category seems to have a major effect.  When grouping on 
category column of cardinality 10, performance is decent (query runs in 150 
ms).  But with cardinality of 100, the query runs in 10 seconds.  If I switch 
over to my Parquet file that does *not* have categorical columns, the same 
query that took 10 seconds with categoricals now runs in 350 ms.

I would be happy to post the Pandas code that I'm using (including how I'm 
creating the Parquet file), but I first wanted to report this and see if it's a 
known issue.

Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to