[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-569220466 @HyukjinKwon Thanks for the quick response. I meant just applying this change (#26118) against 2.4. #26496 is certainly a major behaviour change - #26118 just changes behaviour as much as expected from a bugfix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-569120524 @HyukjinKwon I think #26496 will remove the main reason for this change but this will be Spark 3.0. This change might be more helpful for 2.x. Does it make sense to open a PR against a 2.x branch? I saw examples of backporting fixes but not of PRs solely against old branches. Any suggestions how to proceed? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-559249391 Thanks! I agree, I wouldn't have used `Row`s if not for trying to recreate a problem for a test setting. After reading the `types.py` code I found another way. I hope this change spares someone from this hustle. @zero323 @HyukjinKwon What is the process to get this PR merged? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-552749031 @zero323 @HyukjinKwon I did some performance tests (Details inhttps://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d). As expected, for `Row`s that are created using kwargs (has `__from_dict__`) AND where fields are ordered alphabetically the performance is worse (~15% at 15 fields, ~25% at 150 fields) and [memory](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080607) [consumption](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080608) increases. Of course, this is an edge case - I think it is rare to construct dataframes from `Row`s (otherwise this bug would have been fixed earlier) but for tests/experiments when performance is less of an issue. If performance is an issue, we could check if the order of fields is already alphabetical (making the performance worse for the general case) or determine the order once and reuse this mapping (might require major changes). In my experience, not being able to create a dataframe from dict-like `Row`s was a time-consuming annoyance. The value added by this PR is to enable this. What do you think? Is there any way to improve the code without making it unnecessarily complex? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema. URL: https://github.com/apache/spark/pull/26118#issuecomment-546634920 @zero323 Thanks for looking into this > Personally I'd prefer to wait a moment and see where the discussion on SPARK-22232 goes. If the resolution is introduction of legacy mode, then the scope of this particular change could be conditioned on it and Python version. I agree that an update to `Row` is required and I'm happy to see discussions going in the right direction. However, this will be part of Spark 3.x and for many Spark users using it in production it will be a long wait until this is going to happen. Do you think it makes sense to apply this change for Spark 2.x? > If not I'd like to see some memory profiling data (especially memory - timings might be actually better for now, as we skip all the nasty `obj[n]`, but that's not very meaningful*) first. I can try to do that. Do you have any example how to do proper timings with spark? > It is also worth mentioning that SPARK-24915 is really an edge case. If user wants to provide schema then it is hard to justify using `Rows` (plain `tuple` or different flavors of `dict` are much better choice in such case), so making everyone worse (if I am right about performance impact) to support it doesn't make much sense. In my case it was about reproducing input for a test case. I just wanted to created a dataframe containing the same rows as the problematic dataframe. In the end, I chose a different way to produce the data but it feels strange to not be able to simple create `Row`s the way you see them in some dataframe. > * Is there any reason why we do this: > > https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L615 > > instead of just `tuple(obj)`? That's huge performance bottleneck with wide schemas. Depending on the resolution of this one, that's something to fix, don't you think? This was a workaround introduced by #14469. `tuple(obj)` relies on the order of fields - which for `Row` is alphabetically. If this doesn't correspond to the schema the order of the fields will be messed up. My change here is actually just doing the same workaround for schemas with fields that need some serialization. I'm unclear if the proposed change will have any negative effect. The codepath should only be taken in cases that either failed (as in [SPARK-24915](https://issues.apache.org/jira/browse/SPARK-24915)) or might have silently mixed up the `.toInternal()` calls of different types. Even with sub-optimal performance, this would only improve the situation for users. Would you prefer to replace https://github.com/qudade/spark/blob/a52de2e4b258e7fecad4143e00f01df4b096a513/python/pyspark/sql/types.py#L603 ``` return self.toInternal(obj.asDict()) ``` by ``` return tuple(f.toInternal(obj[n]) if c else obj[n] for n, f, c in zip(self.names, self.fields, self._needConversion)) ``` to reduce one indirection? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org