Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@MLnick @srowen @imatiach-msft Thank you for your comments and for helping
making this improvement robust.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
Merged to master. Thanks @datumbox, and everyone for reviews.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73718 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73718/testReport)**
for PR 17059 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73718/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73718 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73718/testReport)**
for PR 17059 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73706/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73706 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73706/testReport)**
for PR 17059 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73706 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73706/testReport)**
for PR 17059 at commit
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@MLnick Cool did not know that. Just updated commit messages. I think I'm
good with this PR from my side.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox once a PR is ok'ed to test from a committer, it should
automatically retest on each commit.
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73697/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73697 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73697/testReport)**
for PR 17059 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73697 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73697/testReport)**
for PR 17059 at commit
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@srowen @MLnick I had a typo on the committed version... Could you please
retest it? I don't think I have the rights to do it myself.
---
If your project is set up for it, you can reply to this
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73693/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73693 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73693/testReport)**
for PR 17059 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73693 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73693/testReport)**
for PR 17059 at commit
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@MLnick I accepted all proposed changes. Please retest. :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
@imatiach-msft thanks for reviewing and requesting the test case, I was
going to ask for one :)
---
If your project is set up for it, you can reply to this email and have your
reply appear on
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox left a few minor comments to clean up. Overall looks good. I
think it is fine for 2.2.
For the regParam even though it is a minor thing it would be best to just
create a new JIRA
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
The discussion around numeric types was at
https://github.com/apache/spark/pull/12762#r61539327. The checked cast was
introduced because it was previously implicitly cast to `Int` anyway, and would
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73620/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73620 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73620/testReport)**
for PR 17059 at commit
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
I updated the exception message and I added Unit-tests to cover the
checkedCast method as @imatiach-msft suggested. I have not corrected the value
of regParam but I had a look and it is safe to do
Github user imatiach-msft commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox ah ok, thanks for the link. LGTM!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@imatiach-msft This is already used in other parts of MLlib. Have a look
here:
Github user imatiach-msft commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox thanks for adding the tests, I've approved the review, assuming
the other reviewers are ok with minimal double support. I haven't seen
withClue used in the same way you've used it,
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73620 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73620/testReport)**
for PR 17059 at commit
Github user imatiach-msft commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox I added an additional comment -- since you believe we should keep
the current code, we should add a test case and slightly modify the error
message to let the user know what other
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@imatiach-msft Thanks mate. Show it and replied. I really don't mind it
either way to be honest.
@MLnick @srowen I am not sure if this PR will make it to Spark 2.2 due to
the upcoming
Github user imatiach-msft commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox I've added some additional comments with regards to the
fractional part, please take a look, thanks!
---
If your project is set up for it, you can reply to this email and have your
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73585/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73585 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73585/testReport)**
for PR 17059 at commit
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17059
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #73585 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73585/testReport)**
for PR 17059 at commit
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
Ok, let me take a look at this. Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@MLnick Yeap! I have run into problems with the original implementation
when dealing with several billions records. This is when removing casting
starts paying off. :)
---
If your project is set
Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox you mention there is GC & performance overhead which makes some
sense. Have you run into problems with very large scale (like millions users &
items & ratings)? I did regression tests
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@imatiach-msft This comparison is intentional and checks 2 things: That the
number is within integer range and that the Id does not have any non-zero
digits after the decimal point. If the number
Github user imatiach-msft commented on the issue:
https://github.com/apache/spark/pull/17059
@datumbox I like the changes, I just had a minor concern about the code
where we call v.intValue and then compare this to v.doubleValue -- due to
precision issues, I'm not sure if this is
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@srowen @mlnick I updated the PR based on what was discussed above and I
tested it again on Spark 2.1. I also updated the coding styles and the
exception message.
The changes requested
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
Yeah, Scala Long matches. Here is the "stand-alone" script that I used to
confirm that everything works ok (tested on Spark 2.1):
```scala
import org.apache.spark.sql.types._
import
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17059
This one would be good for @mlnick to look at.
The Scala Long will match java.lang.Number right? if so it works even if
there's another conversion there, but maybe that one is trivial, and
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@srowen The following snippet handles explicitly Longs. It can be rewritten
to remove duplicate code by introducing bools for overflow detection but I
don't think it is worth it. In theory you can
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
Ignore my comment about duplicate code. It can be written to avoid it. I
will investigate handling the SQL decimal types as you recommended and I will
update the code tonight.
---
If your
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@srowen: Thanks for the comments. We are getting there. :)
I will handle the Long case as you suggest.
If you think people use SQL decimal types, I can include them at the end of
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17059
That's compelling regarding performance. It's not big but not trivial.
My remaining concern is whether you're handling all the cases the original
did. `Number` covers a lot but does it include
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
I decided to provide a few more bencharks in order to alleviate some of the
concerns raised by @srowen.
To reproduce the results add the following snippet in the ALSSuite class:
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@srowen Yes, the behaviour of the method remains the same. This patch
helped me get a measurable improvement on GC overhead in Spark 2.0, so I though
that it would be beneficial for others. Anyway
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
jenkins retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17059
You are effectively handling the cases that casting handles. Doesn't a
Scala long get boxed another time now? I bet it works, just wondering why
that's not handled like Int. I'm neutral on this and
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
Could you explain what you mean by "duplicating"? It is safe with Scala
Long; I did lots of tests to ensure that it works well. If you require any
changes, I'm happy to update the PR.
---
If
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17059
I get it, though the drawback is mostly that you've duplicated in a way the
machinery that would construe any number as something comparable to the min/max
int value. (Does this not miss the case of
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
Concerning the failed scala style test, this is caused by the long in-line
comment that I added to explain the if statement. If you decide to approve the
PR, I can remove it. Cheers! :)
---
If
Github user datumbox commented on the issue:
https://github.com/apache/spark/pull/17059
@srowen I believe that this needs to be fixed for 2 reasons:
1. Casting the ids to double just to convert it back to integer is not an
elegant solution and it is rather confusing.
2. The
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #3582 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3582/testReport)**
for PR 17059 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17059
**[Test build #3582 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3582/testReport)**
for PR 17059 at commit
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17059
I get it, but is that cast actually slowing things down measurably? because
this change also adds some overhead in the checks it does. I think it's better
to let the cast to double deal with the
63 matches
Mail list logo