qudade edited a comment on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-552749031 @zero323 @HyukjinKwon I did some performance tests (Details in [this gist](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d)). As expected, for `Row`s that are created using kwargs (has `__from_dict__`) AND where fields are ordered alphabetically the performance is worse (~15% at 15 fields, ~25% at 150 fields) and [memory](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080607) [consumption](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080608) increases. Of course, this is an edge case - I think it is rare to construct dataframes from `Row`s (otherwise this bug would have been fixed earlier) but for tests/experiments when performance is less of an issue. If performance is an issue, we could check if the order of fields is already alphabetical (making the performance worse for the general case) or determine the order once and reuse this mapping (might require major changes). In my experience, not being able to create a dataframe from dict-like `Row`s was a time-consuming annoyance. The value added by this PR is to enable this. What do you think? Is there any way to improve the code without making it unnecessarily complex?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org