Shivaram Venkataraman created SPARK-6822:
--------------------------------------------

             Summary: lapplyPartition passes empty list to function
                 Key: SPARK-6822
                 URL: https://issues.apache.org/jira/browse/SPARK-6822
             Project: Spark
          Issue Type: Bug
          Components: SparkR
    Affects Versions: 1.4.0
            Reporter: Shivaram Venkataraman


I have an rdd containing two elements, as expected or as shown by a collect. 
When I call lapplyPartition on it with a function that prints its arguments in 
stderr, the function gets called three times, the first two with the expected 
arguments and the third with an empty list as argument. I was wondering if 
that's a bug or if there are conditions under which that's possible. I 
apologize I don't have a simple test case ready yet. I run into this potential 
bug developing a separate package, plyrmr. If you are willing to install it, 
the test case is very simple. The rdd that creates this problem is a result of 
a join, but I couldn't replicate the problem using a plain vanilla join.

Example from Antonio on SparkR JIRA: I don't have time to try any harder to 
repro this without plyrmr. For the record this is the example

{code}
library(plyrmr)
plyrmr.options(backend = "spark")
df1 = mtcars[1:4,]
df2 = mtcars[3:6,]
w = as.data.frame(gapply(merge(input(df1), input(df2)), identity))
{code}
the gapply is implemented with a lapplyPartition in most cases. The merge with 
a join. as.data.frame with a collect. The join has an arbitrary argument of 4 
partitions. If I turn that down to 2L, the problem disappears. I will check in 
a version with a workaround in place but a debugging statement will leave a 
record in stderr whenever the workaround kicks in, so that we can track it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to