[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

2019-12-27 Thread GitBox
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-569220466
 
 
   @HyukjinKwon Thanks for the quick response.
   
   I meant just applying this change (#26118) against 2.4.  #26496 is certainly 
a major behaviour change - #26118 just changes behaviour as much as expected 
from a bugfix.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

2019-12-26 Thread GitBox
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-569120524
 
 
   @HyukjinKwon I think #26496 will remove the main reason for this change but 
this will be Spark 3.0. 
   
   This change might be more helpful for 2.x. Does it make sense to open a PR 
against a 2.x branch? I saw examples of backporting fixes but not of PRs solely 
against old branches. Any suggestions how to proceed?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

2019-11-27 Thread GitBox
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-559249391
 
 
   Thanks!
   
   I agree, I wouldn't have used `Row`s if not for trying to recreate a problem 
for a test setting. After reading the `types.py` code I found another way. I 
hope this change spares someone from this hustle.
   
   @zero323 @HyukjinKwon What is the process to get this PR merged?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

2019-11-11 Thread GitBox
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-552749031
 
 
   @zero323 @HyukjinKwon 
   I did some performance tests (Details 
inhttps://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d).
   
   As expected, for `Row`s that are created using kwargs (has `__from_dict__`) 
AND where fields are ordered alphabetically the performance is worse (~15% at 
15 fields, ~25% at 150 fields) and 
[memory](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080607)
 
[consumption](https://gist.github.com/qudade/dc9d01f55d27d65ab66d68e3b8d1588d#gistcomment-3080608)
 increases.
   
   Of course, this is an edge case - I think it is rare to construct dataframes 
from `Row`s (otherwise this bug would have been fixed earlier) but for 
tests/experiments when performance is less of an issue.
   
   If performance is an issue, we could check if the order of fields is already 
alphabetical (making the performance worse for the general case) or determine 
the order once and reuse this mapping (might require major changes).
   
   In my experience, not being able to create a dataframe from dict-like `Row`s 
was a time-consuming annoyance. The value added by this PR is to enable this.
   
   What do you think? Is there any way to improve the code without making it 
unnecessarily complex?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with Schema.

2019-10-26 Thread GitBox
qudade commented on issue #26118: [SPARK-24915][Python] Fix Row handling with 
Schema.
URL: https://github.com/apache/spark/pull/26118#issuecomment-546634920
 
 
   
   @zero323 Thanks for looking into this
   
   > Personally I'd prefer to wait a moment and see where the discussion on 
SPARK-22232 goes. If the resolution is introduction of legacy mode, then the 
scope of this particular change could be conditioned on it and Python version.
   
I agree that an update to `Row` is required and I'm happy to see 
discussions going in the 
   right direction. However, this will be part of Spark 3.x and for many Spark 
users using it in production it will be a long wait until this is going to 
happen.
   
   Do you think it makes sense to apply this change for Spark 2.x?
   
   > If not I'd like to see some memory profiling data (especially memory - 
timings might be actually better for now, as we skip all the nasty `obj[n]`, 
but that's not very meaningful*) first.
   
   I can try to do that. Do you have any example how to do proper timings with 
spark?
   
   > It is also worth mentioning that SPARK-24915 is really an edge case. If 
user wants to provide schema then it is hard to justify using `Rows` (plain 
`tuple` or different flavors of `dict` are much better choice in such case), so 
making everyone worse (if I am right about performance impact) to support it 
doesn't make much sense.
   
   In my case it was about reproducing input for a test case. I just wanted to 
created a dataframe containing the same rows as the problematic dataframe. In 
the end, I chose a different way to produce the data but it feels strange to 
not be able to simple create `Row`s the way you see them in some dataframe.
   
   > * Is there any reason why we do this:
   > 
   > 
https://github.com/apache/spark/blob/2115bf61465b504bc21e37465cb34878039b5cb8/python/pyspark/sql/types.py#L615
   > 
   > instead of just `tuple(obj)`? That's huge performance bottleneck with wide 
schemas. Depending on the resolution of this one, that's something to fix, 
don't you think?
   
   This was a workaround introduced by #14469. `tuple(obj)` relies on the order 
of fields - which for `Row` is alphabetically. If this doesn't correspond to 
the schema the order of the fields will be messed up.
   
   My change here is actually just doing the same workaround for schemas with 
fields that need some serialization.
   
   I'm unclear if the proposed change will have any negative effect. The 
codepath should only be taken in cases that either failed (as in 
[SPARK-24915](https://issues.apache.org/jira/browse/SPARK-24915)) or might have 
silently mixed up the `.toInternal()` calls of different types.
   
   Even with sub-optimal performance, this would only improve the situation for 
users.
   
   Would you prefer to replace
   
   
https://github.com/qudade/spark/blob/a52de2e4b258e7fecad4143e00f01df4b096a513/python/pyspark/sql/types.py#L603
   
   ```
   return self.toInternal(obj.asDict())
   ```
   
   by
   
   ```
   return tuple(f.toInternal(obj[n]) if c else obj[n]
for n, f, c in zip(self.names, self.fields, 
self._needConversion))
   ```
   
   to reduce one indirection?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org