qudade edited a comment on issue #26118: [SPARK-24915][Python] Fix Row handling 
with Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-552749031
 
 
   @zero323 @HyukjinKwon 
   I did some performance tests (Details in [this 
gist](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d)).
   
   As expected, for `Row`s that are created using kwargs (has `__from_dict__`) 
AND where fields are ordered alphabetically the performance is worse (~15% at 
15 fields, ~25% at 150 fields) and 
[memory](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080607)
 
[consumption](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080608)
 increases.
   
   Of course, this is an edge case - I think it is rare to construct dataframes 
from `Row`s (otherwise this bug would have been fixed earlier) but for 
tests/experiments when performance is less of an issue.
   
   If performance is an issue, we could check if the order of fields is already 
alphabetical (making the performance worse for the general case) or determine 
the order once and reuse this mapping (might require major changes).
   
   In my experience, not being able to create a dataframe from dict-like `Row`s 
was a time-consuming annoyance. The value added by this PR is to enable this.
   
   What do you think? Is there any way to improve the code without making it 
unnecessarily complex?
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to