[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 merged to master and branch-2.0 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 Ping. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @falaki I think it is the same issue as JIRA 17781. Your change looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 @falaki I mean for the case the exception is raised, but it sounds it is the same test just on different R version LGTM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67218/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #67218 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67218/consoleFull)** for PR 15421 at commit [`20d0234`](https://github.com/apache/spark/commit/20d0234be212b84c916a13bbe3d343fa05118547). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 could that behavior with `list` relate to the issue in this ticket? https://issues.apache.org/jira/browse/SPARK-17781 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @felixcheung you mean test for parallelizing NAs and getting them back? The patch includes that test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 Is it possible to have a test to trigger this case? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I print out binary of serialized data before sending and the bytes sent by R side. They are the same. I also called `unserialize` after serialization on R side. It deserializes successfully. So `list(NA)` should have a special handling for serialization on R side. From R source code, I didn't get the point. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #67218 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67218/consoleFull)** for PR 15421 at commit [`20d0234`](https://github.com/apache/spark/commit/20d0234be212b84c916a13bbe3d343fa05118547). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 is right - i think we can put back the `catch` in there and add a `TODO` referring to the JIRA. Alternately I'll try to reproduce this on a fresh Ubuntu 16.04 VM later tonight and see if I can contribute to the debugging --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/67206/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #67206 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67206/consoleFull)** for PR 15421 at commit [`e56948e`](https://github.com/apache/spark/commit/e56948e999112a598136c86ad7545d38a5d322fa). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 It seems that this test case will break windows automation test. We might need to resolve the new JIRA. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 appveyor also failed. Error messages below: java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/10/19 17:54:34 ERROR TaskSetManager: Task 0 in stage 108.0 failed 1 times; aborting job 16/10/19 17:54:34 ERROR RBackendHandler: dfToCols on org.apache.spark.sql.api.r.SQLUtils failed java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor119.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:352) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:345) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:352) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:345) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:352) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:345) at
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 New JIRA is created: https://issues.apache.org/jira/browse/SPARK-18011 I add one comment in the JIRA to mention discussions on this PR. I think I am close to the root cause. I will spend more time to follow up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram I will create a JIRA later today. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #67206 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/67206/consoleFull)** for PR 15421 at commit [`e56948e`](https://github.com/apache/spark/commit/e56948e999112a598136c86ad7545d38a5d322fa). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram I removed catching `NegativeArraySizeException` from this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 Thats a good idea. @wangmiao1981 Can you create a new JIRA for this and @falaki can we add that JIRA as a pointer in the comment close to the `NegativeArraySizeException` ? Otherwise code change looks fine to me. @felixcheung Any other comments ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 Thanks @wangmiao1981. I thin it is best to file a separate JIRA for this issue. Thanks a lot for catching it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 Master Branch: 16/10/18 16:10:15 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 would you please also test the master branch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 Update R on another linux: R version 3.3.1 (2016-06-21) -- "Bug in Your Hair" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) Distributor ID: Ubuntu Description:Ubuntu 16.04.1 LTS Release:16.04 Codename: xenial I still see the negative index failure. It seem very consistent on my setups: 2 out of 3 setups fail; 1 out of 3 pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 You are right. There is no change for the two versions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 Sorry my question wasn't clear - Is there a source change that we can spot that might have caused this behavior ? I don't see this line changing recently looking at history of https://github.com/wch/r-source/blob/trunk/src/library/base/R/serialize.R --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 As discussed previously, R 3.3.1 works. For 3.3.0, `NA` is serialized but it is not serialized as `String`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 did this change in a recent R version ? I'm not sure why `NA` is not being serialized ? That `if` statement should only affect the value assigned to `type` right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram I think we can use that test case. Somehow, I missed the debug message of [3] and [4], but it should not be quite related. The reason should be my `serialize` function, as shown above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I think the reason is because of the code below: `> serialize function (object, connection, ascii = FALSE, xdr = TRUE, version = NULL, refhook = NULL) { if (!is.null(connection)) { if (!inherits(connection, "connection")) stop("'connection' must be a connection") if (missing(ascii)) ascii <- summary(connection)$text == "text" } if (!ascii && inherits(connection, "sockconn")) .Internal(serializeb(object, connection, xdr, version, refhook)) else { type <- if (is.na(ascii)) 2L else if (ascii) 1L else if (!xdr) 3L else 0L .Internal(serialize(object, connection, type, version, refhook)) } } ` ` is.na(list(NA))` `[1] TRUE` ` is.na(list(17116))` [1] FALSE So, `"2016-11-11"` and `NA` are serialized as different types (i.e., `NA` is not serialized with my R version). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 Thanks - the lines in [3], [4] will be called if we do any operation on the DataFrame. i.e. something like `dim(c)`. Also can we use the same test case that is in the test file checked in ? ``` df <- data.frame(id = 1:2, date = c(as.Date("2016-10-01"), NA)) DF <- collect(createDataFrame(df)) is.na(DF$date[2]) # should be TRUE ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 By putting debug messages to [3] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/inst/worker/worker.R#L159 [4] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/inst/worker/worker.R#L78 Neither is called. I also put debug message on [1] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/R/context.R#L140 For > a <- as.Date(c("2016-11-11", "NA")) > b <- as.data.frame(a) > c <- createDataFrame(b > a [1] "2016-11-11" NA > b a 1 2016-11-11 2 `slices` that is in lapply is: slice list(17116) slice list(NA) `"2016-11-11"` becomes `17116` and `"NA"` is `NA`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 Cool - @falaki lets see if @wangmiao1981 debugging finds anything in the next day or so ? The only thing that would be good is to get this in for 2.0.2 cut if that is happening soon. I'll coordinate that with @rxin --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram Thanks for your explanation! I can continue debugging this as you pointed and I can constantly reproduce the issue. For this PR, I think it is fine for handling the `NA` in the backend except for the unnecessary exception handling. I can submit a follow up PR on the serialization part. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 Thanks @wangmiao1981 - There are two different kinds of serializations that happen in SparkR - one is the RPC style serialization where function arguments are serialized using `writeDate`, `writeInt` etc. The other is batch or bulk serialization that we use in case of converting R `data.frame` to Spark RDDs. This is used in the `createDataFrame` case from [1]. Now the way this is supposed to work is that this is converted by the call to `lapply` and `getJRDD` [2] to be a row-wise serialized `SparkDataFrame`. To do this on the executor side you will have a `unserialize` called on the bulk data [3] and a `writeRowSerialize` called for each row [4]. So the final byte stream to look at is the one here. But my guess is that things are going wrong somewhere before this -- i.e. the byte stream at [3] for example has some different type or something like that. Or to put it another way, are we sure `writeString` was called with `NA` or was it some other function like `writeBin` because the types were wrong ? The other reason for such a transient bug might be that the channels are not getting flushed somewhere and this doesn't show up on some R versions. But yeah your debugging methods are in line with what I would try [1] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/R/context.R#L140 [2] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/R/SQLContext.R#L275 [3] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/inst/worker/worker.R#L159 [4] https://github.com/apache/spark/blob/d88a1bae6a9c975c39549ec2326d839ea93949b2/R/pkg/inst/worker/worker.R#L78 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram We found that the negative index error happens in some version of R. For example, on my mac, R version 3.3.0 (2016-05-03) and the previous Windows test 3.3.2. For the failed cases, I put debug message and tried to read the byte array. For the field `NA`, the failed case doesn't serialize the length as integer (i.e., the byte array only includes `NA` but no the length `3`). Therefore, when readString reads the byte array in the order of length and string, it actually reads `NA` as an integer, which is negative. When creating Dateframe, it serializes the data.frame as a `jobj`. I checked for both good and bad cases, but I didn't find any differences between the two. I didn't find a way to debug the `jobj` serialization logic as it just writes binary in batch. Maybe, I can try again to dump the binary stream. On the surface, I didn't find the reason why length is not serialized for the failed case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 @falaki @wangmiao1981 Can we summarize the discussion around the `NegativeArraySizeException` ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram and @felixcheung do I need to do more on this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66988/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66988 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66988/consoleFull)** for PR 15421 at commit [`59827a1`](https://github.com/apache/spark/commit/59827a19db93604120dc7229f6ed82777b4cd354). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @falaki As I found, for the failure cases, the serialization doesn't serialize length of `NA` and only serialize `NA` as characters (String). I just don't get the point how it is exactly serialized, because it is not serialized by field and field type. Therefore, logic like writeDate, writeTime doesn't kick in. It is good to learn how to debug serialization and R backend hook. I will keep tracking this issue at spare time. Thanks for spending time reading this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wandjenkins thanks! It is interesting with R 3.3.1 it worked! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 Distributor ID: Ubuntu Description:Ubuntu 16.04.1 LTS Release:16.04 Codename: xenial R version 3.3.0 beta (2016-03-30 r70404) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) I test it on above system. It passed. R version 3.3.0 (2016-05-03) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit) With this version, it still fails. It should be a R specific version issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66988 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66988/consoleFull)** for PR 15421 at commit [`59827a1`](https://github.com/apache/spark/commit/59827a19db93604120dc7229f6ed82777b4cd354). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wandjenkins yes, I think there must be some non-standard issue on your system. I tested on Mac and Linux with different version of R and passed the test case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user yhuai commented on the issue: https://github.com/apache/spark/pull/15421 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @felixcheung I rebuild clean image and use the test case on JIRA. Still fails. > df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 1:12) > > dim(createDataFrame(df)) 16/10/14 14:28:51 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.OutOfMemoryError: Java heap space at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:133) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) Since automated tests passed, it could be related to my specific system. You can ignore the issue now. I will follow up if I find any issues. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @felixcheung Yes. I think I missed `vector` in the above example. I am building on clean checkout and test again. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @felixcheung and @shivaram AppVeyor test that @HyukjinKwon kicked passed and jenkins passed too. Do you think this is ready? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 do you mean for this? ``` > a <- as.Date("2016-10-11", "NA") > a [1] NA > a <- as.Date(c("2016-10-11", "NA")) > a [1] "2016-10-11" NA ``` in the 2nd case you do have a vector of Date and NA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 ![r](https://cloud.githubusercontent.com/assets/5033592/19378640/90240106-91a2-11e6-9acc-19dac45f64cb.jpeg) It seems that R doesn't handle `NA` well. I suspect that the root cause is R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I trace back to `createDataFrame.default` ... jrdd <- getJRDD(lapply(rdd, function(x) x), "row") srdd <- callJMethod(jrdd, "rdd") sdf <- callJStatic("org.apache.spark.sql.api.r.SQLUtils", "createDF", srdd, schema$jobj, sparkSession) dataFrame(sdf) The `srdd` object should be the one used in `createDF`, which caused `def readString(in: DataInputStream): String ` negative index issue. I checked `data` in `createDataFrame.default`, which is correct on R side. I get lost in the serialization part. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 No. Mac. I can always reproduce this issue on Mac. I modified code above on scala side to make sure that `NA` is correctly serialized on scala side. But the integer (i.e., length of `NA`) before `NA` is not serialized. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 you are talking about Windows right? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I modified the code on `readString`. When len is negative, I read 3 bytes. `NA` is read on scala side, which is correct field value. But the integer of len is not serialized. It seems that R side serialization doesn't add length for `NA`. I am debugging R side now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15421 Build started: [SparkR] `ALL` [![PR-15421](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=82AF3188-11E3-40D8-A71C-7806413EDAE3=true)](https://ci.appveyor.com/project/spark-test/spark/branch/82AF3188-11E3-40D8-A71C-7806413EDAE3) Ah, I wrote the scripts locally (separately with Spark's AppVeyor). I can't access to the Apache's AppVeyor so I made another account. I wrote some documnetations in https://github.com/HyukjinKwon/spark-appveyor to show what the scripts actually doing :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 I just tried the patch on R version 3.3.1 (2016-06-21) -- "Bug in Your Hair" on Linux and it passed tests. @HyukjinKwon how can kick another AppVeyor test? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 Last night, I debugged the problem again. Now I can reproduce it as follows: > a <- as.Date("2016-10-01", "NA") > b <- as.data.frame(a) > c <- createDataFrame(b) > a [1] NA > b a 1 I compared with the working case. R side function calls are the same for both cases and data type inferences are correct. R side writes data as `raw` type and the backend encounters problem when casting the byte stream based on schema. I continue debugging it now and hope to find the cause of why reading bytes to date type fails. For the above example, I don't know why "2016-10-01" is missed after `as.Date`. But based on my debug message, the number of args and value (`NA`) are correctly represented on R side before serialization. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/15421 FWIW, I wrote a bunch of scripts to automatically launch a build via AppVeyor via @spark-test account with a pretty string such as > Build started: [CORE] `org.apache.spark.storage.DiskStoreSuite` [![PR-15320](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=097F2F95-4748-4435-967F-98980DB2112E=true)](https://ci.appveyor.com/project/spark-test/spark/branch/097F2F95-4748-4435-967F-98980DB2112E) or > Build started: [R] `ALL` [![PR-15320](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=097F2F95-4748-4435-967F-98980DB2112E=true)](https://ci.appveyor.com/project/spark-test/spark/branch/097F2F95-4748-4435-967F-98980DB2112E) Please feel free to cc me if any of you wants. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @felixcheung and @wangmiao1981 thanks! This is good point. I will try testing it on different version of R. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 It's possible with R version - Jenkins is running 3.1.1 I think, the minimal supported version. AppVeyor is running 3.3.2 I believe, which matches closer to the one @wangmiao1981 has --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I suspect that it could be related to my R installation: localhost:~ mwang$ R R version 3.3.0 (2016-05-03) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-apple-darwin13.4.0 (64-bit) But I am not sure yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 That's interesting. I patched your change to a clean checkout and simply tested against the example on the JIRA. It throws the above exception. val obj = (sqlSerDe._1)(dis, dataType) if (obj == null) { throw new IllegalArgumentException (s"Invalid type $dataType")<= this line } else { obj } I have no clue why it fails on my laptop. I can test on my own server (ubuntu) tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 I don't get the exception that you reported on Mac. Also note that the unit test is passing on Linux. I am not sure why returning null is an issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I did test on Mac, not on Windows. I don't have windows setup either. The very first issue that I saw is the negative index issue. In addition, returning `null` when exception is caught is incorrect. The caller `readTypedObjects` will throw invalid datatype exception. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 My guess is that on windows R serialization behaves differently and serializes `NA` as `null`. Unfortunately, I don't have a windows machine to verify. Would you please test that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66755/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66755 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66755/consoleFull)** for PR 15421 at commit [`59827a1`](https://github.com/apache/spark/commit/59827a19db93604120dc7229f6ed82777b4cd354). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 It might not be a R specific issue. I am trying to create a test case on Scala side in SQLUtilsSuite.scala. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 I think we should find out the root cause of the negative length of "NA" field. Yesterday, I debugged R side and I have not found out the reason yet. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 New test on Mac: > df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 1:12) > > dim(createDataFrame(df)) 16/10/11 12:10:30 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.IllegalArgumentException: Invalid type N at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:86) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 MacBook Pro (Retina, 15-inch, Mid 2015) This is the machine that I test the patch on. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66755 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66755/consoleFull)** for PR 15421 at commit [`59827a1`](https://github.com/apache/spark/commit/59827a19db93604120dc7229f6ed82777b4cd354). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @falaki I saw the exception on Mac too. But I don't find the root cause of negative length in the input stream. Catching the exception will solve the problem. Do you want to explore the reason of negative index? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @wangmiao1981 thanks for testing on Windows. I added a check for this. Would you please try again and let me know? Unfortunately, I don't have access to a windows box for testing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user wangmiao1981 commented on the issue: https://github.com/apache/spark/pull/15421 @falaki I patched your fix to a clean build. I still see the following error: > df <- data.frame(Date = as.Date(c(rep("2016-01-10", 10), "NA", "NA")), id = 1:12) > > dim(createDataFrame(df)) 16/10/11 11:51:00 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) at org.apache.spark.api.r.SerDe$.readTypedObject(SerDe.scala:77) at org.apache.spark.api.r.SerDe$.readObject(SerDe.scala:61) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:161) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$bytesToRow$1.apply(SQLUtils.scala:160) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.api.r.SQLUtils$.bytesToRow(SQLUtils.scala:160) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at org.apache.spark.sql.api.r.SQLUtils$$anonfun$5.apply(SQLUtils.scala:138) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) I am still debugging. It seems that source on R side has some issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 hmm, still the same error in the new test case in appveyor ``` Failed - 1. Error: SPARK-17811: can create DataFrame containing NA as date and time (@test_sparkSQL.R#388) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 105.0 failed 1 times, most recent failure: Lost task 0.0 in stage 105.0 (TID 109, localhost): java.lang.NegativeArraySizeException at org.apache.spark.api.r.SerDe$.readStringBytes(SerDe.scala:110) at org.apache.spark.api.r.SerDe$.readString(SerDe.scala:119) at org.apache.spark.api.r.SerDe$.readDate(SerDe.scala:128) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66746/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66746 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66746/consoleFull)** for PR 15421 at commit [`82ec5c8`](https://github.com/apache/spark/commit/82ec5c81e0c0e48bfb6008dcdeef544457cfc014). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66745/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66745 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66745/consoleFull)** for PR 15421 at commit [`17aa8ba`](https://github.com/apache/spark/commit/17aa8ba85b5be3bc670c0818390c44afe78c4eef). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user shivaram commented on the issue: https://github.com/apache/spark/pull/15421 Yeah it should be fine to merge this into branch-2.0 when it is ready --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66746 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66746/consoleFull)** for PR 15421 at commit [`82ec5c8`](https://github.com/apache/spark/commit/82ec5c81e0c0e48bfb6008dcdeef544457cfc014). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66745 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66745/consoleFull)** for PR 15421 at commit [`17aa8ba`](https://github.com/apache/spark/commit/17aa8ba85b5be3bc670c0818390c44afe78c4eef). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user felixcheung commented on the issue: https://github.com/apache/spark/pull/15421 This fails on AppVeyor - any idea? ``` . Error: SPARK-17811: can create DataFrame containing NA as date and time (@test_sparkSQL.R#388) org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 105.0 failed 1 times, most recent failure: Lost task 0.0 in stage 105.0 (TID 109, localhost): java.lang.NegativeArraySizeException ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66702/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66702 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66702/consoleFull)** for PR 15421 at commit [`9e621eb`](https://github.com/apache/spark/commit/9e621ebb1b4d9ac20fa294937ebe87e88730f3c9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user falaki commented on the issue: https://github.com/apache/spark/pull/15421 @shivaram can I nominate this patch for 2.0 branch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66702 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66702/consoleFull)** for PR 15421 at commit [`9e621eb`](https://github.com/apache/spark/commit/9e621ebb1b4d9ac20fa294937ebe87e88730f3c9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15421 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66686/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15421: [SPARK-17811] SparkR cannot parallelize data.frame with ...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15421 **[Test build #66686 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66686/consoleFull)** for PR 15421 at commit [`83726fc`](https://github.com/apache/spark/commit/83726fc96e198703e18a02682fc4004cce3bae00). * This patch **fails SparkR unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org