[jira] [Created] (SPARK-11000) Derby have booted the database twice in yarn security mode.
SaintBacchus created SPARK-11000: Summary: Derby have booted the database twice in yarn security mode. Key: SPARK-11000 URL: https://issues.apache.org/jira/browse/SPARK-11000 Project: Spark Issue Type: Bug Components: Spark Shell, SQL, YARN Affects Versions: 1.6.0 Reporter: SaintBacchus *bin/spark-shell --master yarn-client* If spark was build with hive, this simple command will also have a problem: _Another instance of Derby may have already booted the database_ {code:title=Exeception|borderStyle=solid} Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /opt/client/Spark/spark/metastore_db. at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source) at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source) at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source) ... 130 more Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /opt/client/Spark/spark/metastore_db. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11000) Derby have booted the database twice in yarn security mode.
[ https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948138#comment-14948138 ] SaintBacchus commented on SPARK-11000: -- Very similar with it but this is in yarn security mode. > Derby have booted the database twice in yarn security mode. > --- > > Key: SPARK-11000 > URL: https://issues.apache.org/jira/browse/SPARK-11000 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL, YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > *bin/spark-shell --master yarn-client* > If spark was build with hive, this simple command will also have a problem: > _Another instance of Derby may have already booted the database_ > {code:title=Exeception|borderStyle=solid} > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /opt/client/Spark/spark/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 130 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /opt/client/Spark/spark/metastore_db. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11000) Derby have booted the database twice in yarn security mode.
[ https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948237#comment-14948237 ] Apache Spark commented on SPARK-11000: -- User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/9026 > Derby have booted the database twice in yarn security mode. > --- > > Key: SPARK-11000 > URL: https://issues.apache.org/jira/browse/SPARK-11000 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL, YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > *bin/spark-shell --master yarn-client* > If spark was build with hive, this simple command will also have a problem: > _Another instance of Derby may have already booted the database_ > {code:title=Exeception|borderStyle=solid} > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /opt/client/Spark/spark/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 130 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /opt/client/Spark/spark/metastore_db. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11000) Derby have booted the database twice in yarn security mode.
[ https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11000: Assignee: Apache Spark > Derby have booted the database twice in yarn security mode. > --- > > Key: SPARK-11000 > URL: https://issues.apache.org/jira/browse/SPARK-11000 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL, YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus >Assignee: Apache Spark > > *bin/spark-shell --master yarn-client* > If spark was build with hive, this simple command will also have a problem: > _Another instance of Derby may have already booted the database_ > {code:title=Exeception|borderStyle=solid} > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /opt/client/Spark/spark/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 130 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /opt/client/Spark/spark/metastore_db. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11000) Derby have booted the database twice in yarn security mode.
[ https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11000: Assignee: (was: Apache Spark) > Derby have booted the database twice in yarn security mode. > --- > > Key: SPARK-11000 > URL: https://issues.apache.org/jira/browse/SPARK-11000 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL, YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > *bin/spark-shell --master yarn-client* > If spark was build with hive, this simple command will also have a problem: > _Another instance of Derby may have already booted the database_ > {code:title=Exeception|borderStyle=solid} > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /opt/client/Spark/spark/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 130 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /opt/client/Spark/spark/metastore_db. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors
[ https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948299#comment-14948299 ] Sun Rui edited comment on SPARK-10981 at 10/8/15 8:48 AM: -- yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for this? In the PR, please: 1. Support all join types defined in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala (You can remove the "_" char from the currently supported join types in SparkR) 2. Add test cases for missing join types including "leftsemi" was (Author: sunrui): yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for this? In the PR, please: 1. Support all join types defined in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala (You can move the "_" char from the currently supported join types in SparkR) 2. Add test cases for missing join types including "leftsemi" > R semijoin leads to Java errors, R leftsemi leads to Spark errors > - > > Key: SPARK-10981 > URL: https://issues.apache.org/jira/browse/SPARK-10981 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.5.0 > Environment: SparkR from RStudio on Macbook >Reporter: Monica Liu >Priority: Minor > Labels: easyfix, newbie > > I am using SparkR from RStudio, and I ran into an error with the join > function that I recreated with a smaller example: > {code:title=joinTest.R|borderStyle=solid} > Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) > sc <- sparkR.init("local[4]") > sqlContext <- sparkRSQL.init(sc) > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) > df1= createDataFrame(sqlContext, df) > showDF(df1) > x = c(2, 3, 10) > t = c("dd", "ee", "ff") > c = c(FALSE, FALSE, TRUE) > dff = data.frame(x, t, c) > df2 = createDataFrame(sqlContext, dff) > showDF(df2) > res = join(df1, df2, df1$n == df2$x, "semijoin") > showDF(res) > {code} > Running this code, I encountered the error: > {panel} > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. > Supported join types include: 'inner', 'outer', 'full', 'fullouter', > 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'. > {panel} > However, if I changed the joinType to "leftsemi", > {code} > res = join(df1, df2, df1$n == df2$x, "leftsemi") > {code} > I would get the error: > {panel} > Error in .local(x, y, ...) : > joinType must be one of the following types: 'inner', 'outer', > 'left_outer', 'right_outer', 'semijoin' > {panel} > Since the join function in R appears to invoke a Java method, I went into > DataFrame.R and changed the code on line 1374 and line 1378 to change the > "semijoin" to "leftsemi" to match the Java function's parameters. These also > make the R joinType accepted values match those of Scala's. > semijoin: > {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} > if (joinType %in% c("inner", "outer", "left_outer", "right_outer", > "semijoin")) { > sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) > } > else { > stop("joinType must be one of the following types: ", > "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'") > } > {code} > leftsemi: > {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} > if (joinType %in% c("inner", "outer", "left_outer", "right_outer", > "leftsemi")) { > sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) > } > else { > stop("joinType must be one of the following types: ", > "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'") > } > {code} > This fixed the issue, but I'm not sure if this solution breaks hive > compatibility or causes other issues, but I can submit a pull request to > change this -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10981) R semijoin leads to Java errors, R leftsemi leads to Spark errors
[ https://issues.apache.org/jira/browse/SPARK-10981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948299#comment-14948299 ] Sun Rui commented on SPARK-10981: - yes, this is a bug in SparkR. your fix looks good. Could you submit a PR for this? In the PR, please: 1. Support all join types defined in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/joinTypes.scala (You can move the "_" char from the currently supported join types in SparkR) 2. Add test cases for missing join types including "leftsemi" > R semijoin leads to Java errors, R leftsemi leads to Spark errors > - > > Key: SPARK-10981 > URL: https://issues.apache.org/jira/browse/SPARK-10981 > Project: Spark > Issue Type: Bug > Components: R >Affects Versions: 1.5.0 > Environment: SparkR from RStudio on Macbook >Reporter: Monica Liu >Priority: Minor > Labels: easyfix, newbie > > I am using SparkR from RStudio, and I ran into an error with the join > function that I recreated with a smaller example: > {code:title=joinTest.R|borderStyle=solid} > Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/") > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) > sc <- sparkR.init("local[4]") > sqlContext <- sparkRSQL.init(sc) > n = c(2, 3, 5) > s = c("aa", "bb", "cc") > b = c(TRUE, FALSE, TRUE) > df = data.frame(n, s, b) > df1= createDataFrame(sqlContext, df) > showDF(df1) > x = c(2, 3, 10) > t = c("dd", "ee", "ff") > c = c(FALSE, FALSE, TRUE) > dff = data.frame(x, t, c) > df2 = createDataFrame(sqlContext, dff) > showDF(df2) > res = join(df1, df2, df1$n == df2$x, "semijoin") > showDF(res) > {code} > Running this code, I encountered the error: > {panel} > Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : > java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. > Supported join types include: 'inner', 'outer', 'full', 'fullouter', > 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'. > {panel} > However, if I changed the joinType to "leftsemi", > {code} > res = join(df1, df2, df1$n == df2$x, "leftsemi") > {code} > I would get the error: > {panel} > Error in .local(x, y, ...) : > joinType must be one of the following types: 'inner', 'outer', > 'left_outer', 'right_outer', 'semijoin' > {panel} > Since the join function in R appears to invoke a Java method, I went into > DataFrame.R and changed the code on line 1374 and line 1378 to change the > "semijoin" to "leftsemi" to match the Java function's parameters. These also > make the R joinType accepted values match those of Scala's. > semijoin: > {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} > if (joinType %in% c("inner", "outer", "left_outer", "right_outer", > "semijoin")) { > sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) > } > else { > stop("joinType must be one of the following types: ", > "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'") > } > {code} > leftsemi: > {code:title=DataFrame.R: join(x, y, joinExpr, joinType)|borderStyle=solid} > if (joinType %in% c("inner", "outer", "left_outer", "right_outer", > "leftsemi")) { > sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType) > } > else { > stop("joinType must be one of the following types: ", > "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'") > } > {code} > This fixed the issue, but I'm not sure if this solution breaks hive > compatibility or causes other issues, but I can submit a pull request to > change this -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-10879) spark on yarn support priority option
[ https://issues.apache.org/jira/browse/SPARK-10879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yun Zhao closed SPARK-10879. Resolution: Later > spark on yarn support priority option > - > > Key: SPARK-10879 > URL: https://issues.apache.org/jira/browse/SPARK-10879 > Project: Spark > Issue Type: Improvement > Components: Spark Submit, YARN >Reporter: Yun Zhao > > Add a YARN-only option to spark-submit: *--priority PRIORITY* .The priority > of your YARN application (Default: 0). > Add a property: *spark.yarn.priority* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript
[ https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948307#comment-14948307 ] Sun Rui commented on SPARK-10971: - just be curious: how do you distribute RScript to YARN nodes? Why not installing R in all YARN nodes so that it need not be distributed for each job to improve performance? > sparkR: RRunner should allow setting path to Rscript > > > Key: SPARK-10971 > URL: https://issues.apache.org/jira/browse/SPARK-10971 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Thomas Graves > > I'm running spark on yarn and trying to use R in cluster mode. RRunner seems > to just call Rscript and assumes its in the path. But on our YARN deployment > R isn't installed on the nodes so it needs to be distributed along with the > job and we need the ability to point to where it gets installed. sparkR in > client mode has the config spark.sparkr.r.command to point to Rscript. > RRunner should have something similar so it works in cluster mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10960) SQL with windowing function cannot reference column in inner select block
[ https://issues.apache.org/jira/browse/SPARK-10960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10960: Description: There seems to be a bug in the Spark SQL parser when I use windowing functions. Specifically, when the SELECT refers to a column from an inner select block, the parser throws an error. Here is an example: -- When I use a windowing function and add a '1' constant to the result, {code} select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1 {code} The Spark SQL parser works. The whole SQL is: {code} select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1, D1.c3 as c3, D1.c4 as c4, D1.c5 as c5 from (select T3671.ROW_WID as c3, T3671.CAL_MONTH as c4, T3671.CAL_YEAR as c5, 1 as c6 from W_DAY_D T3671 ) D1 {code} However, if I change the projection so that it refers to a column in an inner select block, D1.C6, whose value is itself a '1' literal, so it is functionally equivalent to the SQL above, Spark SQL will throw an error: {code} select Rank() OVER ( ORDER BY D1.c3 ) + D1.C6 as c1, D1.c3 as c3, D1.c4 as c4, D1.c5 as c5 from (select T3671.ROW_WID as c3, T3671.CAL_MONTH as c4, T3671.CAL_YEAR as c5, 1 as c6 from W_DAY_D T3671 ) D1 {code} The error message is: {code} . . . . . . . . . . . . . . . .> java.lang.NullPointerException Error: org.apache.spark.sql.AnalysisException: resolved attribute(s) c6#3386 missing from c5#3390 ,c3#3383,c4#3389,_we0#3461,c3#3388 in operator !Project [c3#3388,c4#3389,c5#3390,c3#3383,_we0#346 1,(_we0#3461 + c6#3386) AS c1#3387]; (state=,code=0) {code} The above example is a simplified version of the SQL I was testing. The full SQL I was using, which fails with a similar error, is as follows: {code} select Case when case D1.c6 when 1 then D1.c3 else NULL end is not null then Rank() OVER ( ORDER BY case when ( case D1.c6 when 1 then D1.c3 else NULL end ) is null then 1 else 0 end, case D1.c6 when 1 then D1.c3 else NULL end ) end as c1, Case when case D1.c7 when 1 then D1.c3 else NULL end is not null then Rank() OVER ( PARTITION BY D1.c4, D1.c5 ORDER BY case when ( case D1.c7 when 1 then D1.c3 else NULL end ) is null then 1 else 0 end, case D1.c7 when 1 then D1.c3 else NULL end ) end as c2, D1.c3 as c3, D1.c4 as c4, D1.c5 as c5 from (select T3671.ROW_WID as c3, T3671.CAL_MONTH as c4, T3671.CAL_YEAR as c5, ROW_NUMBER() OVER (PARTITION BY T3671.CAL_MONTH, T3671.CAL_YEAR ORDER BY T3671.CAL_MONTH DESC, T3671.CAL_YEAR DESC) as c6, ROW_NUMBER() OVER (PARTITION BY T3671.CAL_MONTH, T3671.CAL_YEAR, T3671.ROW_WID ORDER BY T3671.CAL_MONTH DESC, T3671.CAL_YEAR DESC, T3671.ROW_WID DESC) as c7 from W_DAY_D T3671 ) D1 {code} Hopefully when fixed, both these sample SQLs should work! was: There seems to be a bug in the Spark SQL parser when I use windowing functions. Specifically, when the SELECT refers to a column from an inner select block, the parser throws an error. Here is an example: -- When I use a windowing function and add a '1' constant to the result, select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1 The Spark SQL parser works. The whole SQL is: select Rank() OVER ( ORDER BY D1.c3 ) + 1 as c1, D1.c3 as c3, D1.c4 as c4, D1.c5 as c5 from (select T3671.ROW_WID as c3, T3671.CAL_MONTH as c4, T3671.CAL_YEAR as c5, 1 as c6 from W_DAY_D T3671 ) D1 -- However, if I change the projection so that it refers to a column in an inner select block, D1.C6, whose value is itself a '1' literal, so it is functionally equivalent to the SQL above, Spark SQL will throw an error: select Rank() OVER ( ORDER BY D1.c3 ) +
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948268#comment-14948268 ] Sun Rui commented on SPARK-10903: - There are a number of functions defined in SQLContext.R taking a sqlContext instance as its first argument. Instead of removing the first argument from these functions, I'd rather we can allow both cases (that is the sqlContext parameter is passed or not passed) in these functions, for backward compatibility. > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
[ https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-7869. Resolution: Fixed Fix Version/s: 1.6.0 > Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns > -- > > Key: SPARK-7869 > URL: https://issues.apache.org/jira/browse/SPARK-7869 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.3.1 > Environment: Spark 1.3.1 >Reporter: Brad Willard >Assignee: Alexey Grishchenko >Priority: Minor > Fix For: 1.6.0 > > > Most of our tables load into dataframes just fine with postgres. However we > have a number of tables leveraging the JSONB datatype. Spark will error and > refuse to load this table. While asking for Spark to support JSONB might be a > tall order in the short term, it would be great if Spark would at least load > the table ignoring the columns it can't load or have it be an option. > {code} > pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json") > Py4JJavaError: An error occurred while calling o41.load. > : java.sql.SQLException: Unsupported type > at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78) > at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112) > at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133) > at > org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121) > at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10977) SQL injection bugs in JdbcUtils and DataFrameWriter
[ https://issues.apache.org/jira/browse/SPARK-10977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948302#comment-14948302 ] Sean Owen commented on SPARK-10977: --- It's a JDBC thing rather than database specific (e.g. parsed by JDBC) but yes I don't know if it works for every possible part of the query. Like, I don't know if you can prepare a statement to "SELECT * FROM ?". It's worth ruling it out before considering another approach as that would be by far the easiest thing. Quoting is probably the next-easiest. Would love to see a test case involving Little Bobby Tables here. https://xkcd.com/327/ > SQL injection bugs in JdbcUtils and DataFrameWriter > --- > > Key: SPARK-10977 > URL: https://issues.apache.org/jira/browse/SPARK-10977 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Rick Hillegas >Priority: Minor > > SPARK-10857 identifies a SQL injection bug in the JDBC dialect code. A > similar SQL injection bug can be found in 2 places in JdbcUtils and another > place in DataFrameWriter: > {noformat} > The DROP TABLE logic in JdbcUtils concatenates boilerplate with a > user-supplied string: > def dropTable(conn: Connection, table: String): Unit = { > conn.prepareStatement(s"DROP TABLE $table").executeUpdate() > } > Same for the INSERT logic in JdbcUtils: > def insertStatement(conn: Connection, table: String, rddSchema: StructType): > PreparedStatement = { > val sql = new StringBuilder(s"INSERT INTO $table VALUES (") > var fieldsLeft = rddSchema.fields.length > while (fieldsLeft > 0) { > sql.append("?") > if (fieldsLeft > 1) sql.append(", ") else sql.append(")") > fieldsLeft = fieldsLeft - 1 > } > conn.prepareStatement(sql.toString()) > } > Same for the CREATE TABLE logic in DataFrameWriter: > def jdbc(url: String, table: String, connectionProperties: Properties): > Unit = { >... > > if (!tableExists) { > val schema = JdbcUtils.schemaString(df, url) > val sql = s"CREATE TABLE $table ($schema)" > conn.prepareStatement(sql).executeUpdate() > } >... > } > {noformat} > Maybe we can find a common solution to all of these SQL injection bugs. > Something like this: > 1) Parse the user-supplied table name into a table identifier and an optional > schema identifier. We can borrow logic from org.apache.derby.iapi.util.IdUtil > in order to do this. > 2) Double-quote (and escape as necessary) the schema and table identifiers so > that the database interprets them as delimited ids. > That should prevent the SQL injection attacks. > With this solution, if the user specifies table names like cityTable and > trafficSchema.congestionTable, then the generated DROP TABLE statements would > be > {noformat} > DROP TABLE "CITYTABLE" > DROP TABLE "TRAFFICSCHEMA"."CONGESTIONTABLE" > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948133#comment-14948133 ] Apache Spark commented on SPARK-10999: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/9024 > Physical plan node Coalesce should be able to handle UnsafeRow > -- > > Key: SPARK-10999 > URL: https://issues.apache.org/jira/browse/SPARK-10999 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > The following PySpark snippet shows the problem: > {noformat} > >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True) > ... > == Physical Plan == > Coalesce 1 > ConvertToSafe > TungstenProject [id#3L AS a#4L] >Scan PhysicalRDD[id#3L] > {noformat} > The {{ConvertToSafe}} is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10999: Assignee: Cheng Lian (was: Apache Spark) > Physical plan node Coalesce should be able to handle UnsafeRow > -- > > Key: SPARK-10999 > URL: https://issues.apache.org/jira/browse/SPARK-10999 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > > The following PySpark snippet shows the problem: > {noformat} > >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True) > ... > == Physical Plan == > Coalesce 1 > ConvertToSafe > TungstenProject [id#3L AS a#4L] >Scan PhysicalRDD[id#3L] > {noformat} > The {{ConvertToSafe}} is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10999: Assignee: Apache Spark (was: Cheng Lian) > Physical plan node Coalesce should be able to handle UnsafeRow > -- > > Key: SPARK-10999 > URL: https://issues.apache.org/jira/browse/SPARK-10999 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Apache Spark >Priority: Minor > > The following PySpark snippet shows the problem: > {noformat} > >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True) > ... > == Physical Plan == > Coalesce 1 > ConvertToSafe > TungstenProject [id#3L AS a#4L] >Scan PhysicalRDD[id#3L] > {noformat} > The {{ConvertToSafe}} is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10974) Add progress bar for output operation column and use red dots for failed batches
[ https://issues.apache.org/jira/browse/SPARK-10974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-10974: - Summary: Add progress bar for output operation column and use red dots for failed batches (was: Add progress bar for output operation column) > Add progress bar for output operation column and use red dots for failed > batches > > > Key: SPARK-10974 > URL: https://issues.apache.org/jira/browse/SPARK-10974 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow
Cheng Lian created SPARK-10999: -- Summary: Physical plan node Coalesce should be able to handle UnsafeRow Key: SPARK-10999 URL: https://issues.apache.org/jira/browse/SPARK-10999 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor The following PySpark snippet shows the problem: {noformat} >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True) ... == Physical Plan == Coalesce 1 ConvertToSafe TungstenProject [id#3L AS a#4L] Scan PhysicalRDD[id#3L] {noformat} The {{ConvertToSafe}} is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10326) Cannot launch YARN job on Windows
[ https://issues.apache.org/jira/browse/SPARK-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948206#comment-14948206 ] Jose Antonio commented on SPARK-10326: -- C:\WINDOWS\system32>pyspark --master yarn-client Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Sep 15 2015, 14:26:14) [MSC v.1500 64 bit (AMD64)] Type "copyright", "credits" or "license" for more information. IPython 4.0.0 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. 15/10/08 09:28:05 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 15/10/08 09:28:06 WARN : Your hostname, PC-509512 resolves to a loopback/non-reachable address: fe80:0:0:0:0:5efe:a5f:c318%net3, but we couldn't find any external IP address! 15/10/08 09:28:08 WARN BlockReaderLocal: The short-circuit local reads feature cannot be used because UNIX Domain sockets are not available on Windows. 15/10/08 09:28:08 ERROR SparkContext: Error initializing SparkContext. java.net.URISyntaxException: Illegal character in opaque part at index 2: C:\spark\bin\..\python\lib\pyspark.zip at java.net.URI$Parser.fail(Unknown Source) at java.net.URI$Parser.checkChars(Unknown Source) at java.net.URI$Parser.parse(Unknown Source) at java.net.URI.(Unknown Source) at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:558) at org.apache.spark.deploy.yarn.Client$$anonfun$setupLaunchEnv$7.apply(Client.scala:557) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.Client.setupLaunchEnv(Client.scala:557) at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:628) at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) at org.apache.spark.SparkContext.(SparkContext.scala:523) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) 15/10/08 09:28:08 ERROR Utils: Uncaught exception in thread Thread-2 java.lang.NullPointerException at org.apache.spark.network.netty.NettyBlockTransferService.close(NettyBlockTransferService.scala:152) at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1228) at org.apache.spark.SparkEnv.stop(SparkEnv.scala:100) at org.apache.spark.SparkContext$$anonfun$stop$12.apply$mcV$sp(SparkContext.scala:1749) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185) at org.apache.spark.SparkContext.stop(SparkContext.scala:1748) at org.apache.spark.SparkContext.(SparkContext.scala:593) at org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown Source) at java.lang.reflect.Constructor.newInstance(Unknown Source) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:214) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) --- Py4JJavaError Traceback (most recent call last) C:\spark\bin\..\python\pyspark\shell.py in () 41
[jira] [Commented] (SPARK-10919) Association rules class should return the support of each rule
[ https://issues.apache.org/jira/browse/SPARK-10919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948235#comment-14948235 ] Tofigh commented on SPARK-10919: sure > Association rules class should return the support of each rule > -- > > Key: SPARK-10919 > URL: https://issues.apache.org/jira/browse/SPARK-10919 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Tofigh >Priority: Minor > Original Estimate: 1h > Remaining Estimate: 1h > > The current implementation of Association rule does not return the frequency > of appearance of each rule. This piece of information is essential for > implementing functional dependency on top of the AR. In order to return the > frequency (support) of each rule, freqUnion: Double, and freqAntecedent: > Double should be: val freqUnion: Double, val freqAntecedent: Double -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11000) Derby have booted the database twice in yarn security mode.
[ https://issues.apache.org/jira/browse/SPARK-11000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-11000: - Comment: was deleted (was: Very similar with it but this is in yarn security mode.) > Derby have booted the database twice in yarn security mode. > --- > > Key: SPARK-11000 > URL: https://issues.apache.org/jira/browse/SPARK-11000 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL, YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus > > *bin/spark-shell --master yarn-client* > If spark was build with hive, this simple command will also have a problem: > _Another instance of Derby may have already booted the database_ > {code:title=Exeception|borderStyle=solid} > Caused by: java.sql.SQLException: Another instance of Derby may have already > booted the database /opt/client/Spark/spark/metastore_db. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown > Source) > at > org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown > Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > ... 130 more > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database /opt/client/Spark/spark/metastore_db. > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9040) StructField datatype Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-9040. -- Resolution: Not A Problem Fix Version/s: (was: 1.4.0) > StructField datatype Conversion Error > - > > Key: SPARK-9040 > URL: https://issues.apache.org/jira/browse/SPARK-9040 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 1.3.0 > Environment: Cloudera 5.3 on CDH 6 >Reporter: Sandeep Pal > > The following issue occurs if I specify the StructFields in specific order in > StructType as follow: > fields = [StructField("d", IntegerType(), True),StructField("b", > IntegerType(), True),StructField("a", StringType(), True),StructField("c", > IntegerType(), True)] > But the following code words fine: > fields = [StructField("d", IntegerType(), True),StructField("b", > IntegerType(), True),StructField("c", IntegerType(), True),StructField("a", > StringType(), True)] > in () > 18 > 19 schema = StructType(fields) > ---> 20 schemasimid_simple = > sqlContext.createDataFrame(simid_simplereqfields, schema) > 21 schemasimid_simple.registerTempTable("simid_simple") > /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 302 > 303 for row in rows: > --> 304 _verify_type(row, schema) > 305 > 306 # convert python objects to sql data > /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in > _verify_type(obj, dataType) > 986 "length of fields (%d)" % (len(obj), > len(dataType.fields))) > 987 for v, f in zip(obj, dataType.fields): > --> 988 _verify_type(v, f.dataType) > 989 > 990 _cached_cls = weakref.WeakValueDictionary() > /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in > _verify_type(obj, dataType) > 970 if type(obj) not in _acceptable_types[_type]: > 971 raise TypeError("%s can not accept object in type %s" > --> 972 % (dataType, type(obj))) > 973 > 974 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10752) Implement corr() and cov in DataFrameStatFunctions
[ https://issues.apache.org/jira/browse/SPARK-10752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10752: -- Assignee: Sun Rui > Implement corr() and cov in DataFrameStatFunctions > -- > > Key: SPARK-10752 > URL: https://issues.apache.org/jira/browse/SPARK-10752 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.5.0 >Reporter: Sun Rui >Assignee: Sun Rui > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948511#comment-14948511 ] Sean Owen commented on SPARK-10914: --- I don't think having it on one machine necessarily matters. You still have two JVMs in play; whereas you can't reproduce when just one JVM is involved. What if the driver has oops on, but the executor does not? and the results from the executor are parsed somewhere as if oops are on? Normally this would be wholly transparent to JVM bytecode but tungsten / SizeEstimator are depending in part on the actual representation of the object in memory. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948492#comment-14948492 ] Ben Moran commented on SPARK-10914: --- I just tried moving the master to the worker box, so it's entirely on one machine. (Ubuntu 14.04 + now Oracle JDK 1.8). It still reproduces the bug. So, entirely on spark-worker: {code} spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-master.sh spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ sbin/start-slave.sh --master spark://spark-worker:7077 spark@spark-worker:~/spark-1.5.1-bin-hadoop2.6$ bin/spark-shell --master spark://spark-worker:7077 --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) Type in expressions to have them evaluated. Type :help for more information. 15/10/08 12:15:12 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. 15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:19 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 15/10/08 12:15:20 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 15/10/08 12:15:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 15/10/08 12:15:21 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext. scala> val x = sql("select 1 xx union all select 2") x: org.apache.spark.sql.DataFrame = [xx: int] scala> val y = sql("select 1 yy union all select 2") y: org.apache.spark.sql.DataFrame = [yy: int] scala> scala> x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ res0: Long = 0 {code} does give me the incorrect count. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-9040) StructField datatype Conversion Error
[ https://issues.apache.org/jira/browse/SPARK-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-9040: -- > StructField datatype Conversion Error > - > > Key: SPARK-9040 > URL: https://issues.apache.org/jira/browse/SPARK-9040 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core, SQL >Affects Versions: 1.3.0 > Environment: Cloudera 5.3 on CDH 6 >Reporter: Sandeep Pal > > The following issue occurs if I specify the StructFields in specific order in > StructType as follow: > fields = [StructField("d", IntegerType(), True),StructField("b", > IntegerType(), True),StructField("a", StringType(), True),StructField("c", > IntegerType(), True)] > But the following code words fine: > fields = [StructField("d", IntegerType(), True),StructField("b", > IntegerType(), True),StructField("c", IntegerType(), True),StructField("a", > StringType(), True)] > in () > 18 > 19 schema = StructType(fields) > ---> 20 schemasimid_simple = > sqlContext.createDataFrame(simid_simplereqfields, schema) > 21 schemasimid_simple.registerTempTable("simid_simple") > /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/context.py in > createDataFrame(self, data, schema, samplingRatio) > 302 > 303 for row in rows: > --> 304 _verify_type(row, schema) > 305 > 306 # convert python objects to sql data > /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in > _verify_type(obj, dataType) > 986 "length of fields (%d)" % (len(obj), > len(dataType.fields))) > 987 for v, f in zip(obj, dataType.fields): > --> 988 _verify_type(v, f.dataType) > 989 > 990 _cached_cls = weakref.WeakValueDictionary() > /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in > _verify_type(obj, dataType) > 970 if type(obj) not in _acceptable_types[_type]: > 971 raise TypeError("%s can not accept object in type %s" > --> 972 % (dataType, type(obj))) > 973 > 974 if isinstance(dataType, ArrayType): > TypeError: StringType can not accept object in type -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10939) Misaligned data with RDD.zip after repartition
[ https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10939. --- Resolution: Not A Problem Provisionally resolving as not a problem since RDDs don't have a guaranteed ordering. You can only really use zip with something that's sorted. Here you may indeed see different orderings within the same RDD if it gets recalculated. > Misaligned data with RDD.zip after repartition > -- > > Key: SPARK-10939 > URL: https://issues.apache.org/jira/browse/SPARK-10939 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.1, 1.5.0 > Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5 > - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5 >Reporter: Dan Brown > > Split out from https://issues.apache.org/jira/browse/SPARK-10685: > Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces > "misaligned" data, meaning different column values in the same row aren't > matched, as if a zip shuffled the collections before zipping them. It's > difficult to reproduce because it's nondeterministic, doesn't occur in local > mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using > pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 > (bin-without-hadoop). > Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying > to build it ourselves when we ran into this problem. Let me put in my vote > for reopening the issue and supporting {{DataFrame.zip}} in the standard lib. > - https://issues.apache.org/jira/browse/SPARK-7460 > h3. Repro > Fail: RDD.zip after repartition > {code} > df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) > df = df.repartition(100) > rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, > b=y.b)) > [r for r in rdd.collect() if r.a != r.b][:3] # Should be [] > {code} > Sample outputs (nondeterministic): > {code} > [] > [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)] > [] > [] > [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)] > [] > {code} > Test setup: > - local\[8]: {{MASTER=local\[8]}} > - dist\[N]: 1 driver + 1 master + N workers > {code} > "Fail" tests pass? cluster mode spark version > > yes local[8] 1.3.0-cdh5.4.5 > no dist[4] 1.3.0-cdh5.4.5 > yes local[8] 1.4.1 > yes dist[1] 1.4.1 > no dist[2] 1.4.1 > no dist[4] 1.4.1 > yes local[8] 1.5.0 > yes dist[1] 1.5.0 > no dist[2] 1.5.0 > no dist[4] 1.5.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10979) SparkR: Add merge to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-10979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10979: -- Component/s: SparkR > SparkR: Add merge to DataFrame > -- > > Key: SPARK-10979 > URL: https://issues.apache.org/jira/browse/SPARK-10979 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan > > Add merge function to DataFrame, which supports R signature. > https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948486#comment-14948486 ] Sean Owen commented on SPARK-10914: --- Still kind of guessing here... but what if the problem is that the computation spans machines that have a different oops configuration (driver vs executor) and that breaks some assumption in the low-level byte munging? > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-10940) Too many open files Spark Shuffle
[ https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-10940: --- > Too many open files Spark Shuffle > - > > Key: SPARK-10940 > URL: https://issues.apache.org/jira/browse/SPARK-10940 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node standalone spark cluster with 1 master and 5 > worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and > 36 cores. >Reporter: Sandeep Pal > > Executing terasort by Spark-SQL on the data generated by teragen in hadoop. > Data size generated is ~456 GB. > Terasort passing with --total-executor-cores = 40, where as failing for > --total-executor-cores = 120. > I have tried to increase the ulimit to 10k but the problem persists. > Note: The above failed configuration of 120 cores worked on spark core code > on the top of rdd. The failure is only in case of using Spark SQL. > Below is the error message from one of the executor node: > java.io.FileNotFoundException: > /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 > (Too many open files) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10939) Misaligned data with RDD.zip after repartition
[ https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948596#comment-14948596 ] Michael Malak commented on SPARK-10939: --- Here Matei explains the explicit design decision to prefer shuffle performance arising from randomization over deterministic RDD computation: https://issues.apache.org/jira/browse/SPARK-3098?focusedCommentId=14110183=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14110183 It has made it into the documentation (though perhaps not clearly enough, especially regarding the rationale): https://issues.apache.org/jira/browse/SPARK-3356 https://github.com/apache/spark/pull/2508/files > Misaligned data with RDD.zip after repartition > -- > > Key: SPARK-10939 > URL: https://issues.apache.org/jira/browse/SPARK-10939 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.1, 1.5.0 > Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5 > - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5 >Reporter: Dan Brown > > Split out from https://issues.apache.org/jira/browse/SPARK-10685: > Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces > "misaligned" data, meaning different column values in the same row aren't > matched, as if a zip shuffled the collections before zipping them. It's > difficult to reproduce because it's nondeterministic, doesn't occur in local > mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using > pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 > (bin-without-hadoop). > Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying > to build it ourselves when we ran into this problem. Let me put in my vote > for reopening the issue and supporting {{DataFrame.zip}} in the standard lib. > - https://issues.apache.org/jira/browse/SPARK-7460 > h3. Repro > Fail: RDD.zip after repartition > {code} > df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) > df = df.repartition(100) > rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, > b=y.b)) > [r for r in rdd.collect() if r.a != r.b][:3] # Should be [] > {code} > Sample outputs (nondeterministic): > {code} > [] > [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)] > [] > [] > [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)] > [] > {code} > Test setup: > - local\[8]: {{MASTER=local\[8]}} > - dist\[N]: 1 driver + 1 master + N workers > {code} > "Fail" tests pass? cluster mode spark version > > yes local[8] 1.3.0-cdh5.4.5 > no dist[4] 1.3.0-cdh5.4.5 > yes local[8] 1.4.1 > yes dist[1] 1.4.1 > no dist[2] 1.4.1 > no dist[4] 1.4.1 > yes local[8] 1.5.0 > yes dist[1] 1.5.0 > no dist[2] 1.5.0 > no dist[4] 1.5.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11001) SQLContext doesn't support window function
jixing.ji created SPARK-11001: - Summary: SQLContext doesn't support window function Key: SPARK-11001 URL: https://issues.apache.org/jira/browse/SPARK-11001 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.1 Environment: windows pyspark ubuntu pyspark Reporter: jixing.ji currently, SQLContext doesn't support window function, which made a lot of cool features like lag, rank very hard to be implemented with hivecontext. If spark can support window function in SQLContext, it will help a lot for data analyzing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948460#comment-14948460 ] Ben Moran commented on SPARK-10914: --- On latest master for me .count() also always seems to return 5 for everything! I think that is a separate bug - I think I saw it filed already but I can't find it now. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10883) Document building each module individually
[ https://issues.apache.org/jira/browse/SPARK-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10883. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8993 [https://github.com/apache/spark/pull/8993] > Document building each module individually > -- > > Key: SPARK-10883 > URL: https://issues.apache.org/jira/browse/SPARK-10883 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Jean-Baptiste Onofré >Priority: Trivial > Fix For: 1.6.0 > > > Right now, due to the location of the scalastyle-config.xml location, it's > not possible to build an individual module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10940) Too many open files Spark Shuffle
[ https://issues.apache.org/jira/browse/SPARK-10940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-10940. --- Resolution: Cannot Reproduce > Too many open files Spark Shuffle > - > > Key: SPARK-10940 > URL: https://issues.apache.org/jira/browse/SPARK-10940 > Project: Spark > Issue Type: Bug > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node standalone spark cluster with 1 master and 5 > worker nodes on Centos 6.6 for all nodes. Each node has > 100 GB memory and > 36 cores. >Reporter: Sandeep Pal > > Executing terasort by Spark-SQL on the data generated by teragen in hadoop. > Data size generated is ~456 GB. > Terasort passing with --total-executor-cores = 40, where as failing for > --total-executor-cores = 120. > I have tried to increase the ulimit to 10k but the problem persists. > Note: The above failed configuration of 120 cores worked on spark core code > on the top of rdd. The failure is only in case of using Spark SQL. > Below is the error message from one of the executor node: > java.io.FileNotFoundException: > /tmp/spark-e15993e8-51a4-452a-8b86-da0169445065/executor-0c661152-3837-4711-bba2-2abf4fd15240/blockmgr-973aab72-feb8-4c60-ba3d-1b2ee27a1cc2/3f/temp_shuffle_7741538d-3ccf-4566-869f-265655ca9c90 > (Too many open files) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948513#comment-14948513 ] Ben Moran commented on SPARK-10914: --- I think you've got it - if I also turn off UseCompressedOops for the driver as well as the executor, it gives correct results: bin/spark-shell --master spark://spark-worker:7077 --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" --driver-java-options "-XX:-UseCompressedOops" Does this leave me with a viable workaround? I'm not sure of the impact of UseCompressedOops > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation
[ https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10978: -- Priority: Minor (was: Major) ([~rspitzer] don't set Fix Version) > Allow PrunedFilterScan to eliminate predicates from further evaluation > -- > > Key: SPARK-10978 > URL: https://issues.apache.org/jira/browse/SPARK-10978 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > Fix For: 1.6.0 > > > Currently PrunedFilterScan allows implementors to push down predicates to an > underlying datasource. This is done solely as an optimization as the > predicate will be reapplied on the Spark side as well. This allows for > bloom-filter like operations but ends up doing a redundant scan for those > sources which can do accurate pushdowns. > In addition it makes it difficult for underlying sources to accept queries > which reference non-existent to provide ancillary function. In our case we > allow a solr query to be passed in via a non-existent solr_query column. > Since this column is not returned when Spark does a filter on "solr_query" > nothing passes. > Suggestion on the ML from [~marmbrus] > {quote} > We have to try and maintain binary compatibility here, so probably the > easiest thing to do here would be to add a method to the class. Perhaps > something like: > def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters > By default, this could return all filters so behavior would remain the same, > but specific implementations could override it. There is still a chance that > this would conflict with existing methods, but hopefully that would not be a > problem in practice. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11002) pyspark doesn't support UDAF
jixing.ji created SPARK-11002: - Summary: pyspark doesn't support UDAF Key: SPARK-11002 URL: https://issues.apache.org/jira/browse/SPARK-11002 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.5.1 Environment: windows pyspark ubuntu pyspark Reporter: jixing.ji currently, pyspark doesn't support user defined aggregated function, which made that data analyzer cannot get some specific aggregated result, for example, group by column A, concat column B into a list -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948438#comment-14948438 ] Sean Owen commented on SPARK-10914: --- I ran the latest master in standalone mode, with {{/bin/spark-shell --master spark://localhost:7077 --executor-memory 31g --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"}} {code} scala> val x = sql("select 1 xx union all select 2") x: org.apache.spark.sql.DataFrame = [xx: int] scala> val y = sql("select 1 yy union all select 2") y: org.apache.spark.sql.DataFrame = [yy: int] scala> x.join(y, $"xx" === $"yy").count() res0: Long = 5 {code} Without the {{-XX:-UseCompressedOops}} the answer is 2. Could be unrelated to {{SizeEstimator}}, yes. I get 5 when setting {{spark.test.useCompressedOops=false}} too, which seems to indicate it's not {{SizeEstimator}}. Could it be something in the Tungsten machinery? I see it's part of the plan above. I wasn't sure how to test with that disabled. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10883) Document building each module individually
[ https://issues.apache.org/jira/browse/SPARK-10883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10883: -- Assignee: Jean-Baptiste Onofré > Document building each module individually > -- > > Key: SPARK-10883 > URL: https://issues.apache.org/jira/browse/SPARK-10883 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Jean-Baptiste Onofré >Assignee: Jean-Baptiste Onofré >Priority: Trivial > Fix For: 1.6.0 > > > Right now, due to the location of the scalastyle-config.xml location, it's > not possible to build an individual module. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948494#comment-14948494 ] Ben Moran commented on SPARK-10914: --- Either using the large heap, or -XX:-UseCompressedOops triggers the bug. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) Incorrect empty join sets when executor-memory >= 32g
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948464#comment-14948464 ] Ben Moran commented on SPARK-10914: --- I also don't see it if I run spark-shell without setting --master. > Incorrect empty join sets when executor-memory >= 32g > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10925: -- Component/s: SQL > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140) > {code} > I'm attaching a test case
[jira] [Updated] (SPARK-10939) Misaligned data with RDD.zip after repartition
[ https://issues.apache.org/jira/browse/SPARK-10939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-10939: -- Component/s: Spark Core > Misaligned data with RDD.zip after repartition > -- > > Key: SPARK-10939 > URL: https://issues.apache.org/jira/browse/SPARK-10939 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0, 1.4.1, 1.5.0 > Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5 > - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5 >Reporter: Dan Brown > > Split out from https://issues.apache.org/jira/browse/SPARK-10685: > Here's a weird behavior where {{RDD.zip}} after a {{repartition}} produces > "misaligned" data, meaning different column values in the same row aren't > matched, as if a zip shuffled the collections before zipping them. It's > difficult to reproduce because it's nondeterministic, doesn't occur in local > mode, and requires ≥2 workers (≥3 in one case). I was able to repro it using > pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), and 1.5.0 > (bin-without-hadoop). > Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying > to build it ourselves when we ran into this problem. Let me put in my vote > for reopening the issue and supporting {{DataFrame.zip}} in the standard lib. > - https://issues.apache.org/jira/browse/SPARK-7460 > h3. Repro > Fail: RDD.zip after repartition > {code} > df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1)) > df = df.repartition(100) > rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, > b=y.b)) > [r for r in rdd.collect() if r.a != r.b][:3] # Should be [] > {code} > Sample outputs (nondeterministic): > {code} > [] > [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)] > [] > [] > [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)] > [] > {code} > Test setup: > - local\[8]: {{MASTER=local\[8]}} > - dist\[N]: 1 driver + 1 master + N workers > {code} > "Fail" tests pass? cluster mode spark version > > yes local[8] 1.3.0-cdh5.4.5 > no dist[4] 1.3.0-cdh5.4.5 > yes local[8] 1.4.1 > yes dist[1] 1.4.1 > no dist[2] 1.4.1 > no dist[4] 1.4.1 > yes local[8] 1.5.0 > yes dist[1] 1.5.0 > no dist[2] 1.5.0 > no dist[4] 1.5.0 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10995) Graceful shutdown drops processing in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948693#comment-14948693 ] Sean Owen commented on SPARK-10995: --- TD's the expert, but I don't really get that -- if you're processing the last hour of data each minute, then I'd expect shutdown to process the current minute, not for another hour. Here however your batch interval and window are the same. In that case do you need a window at all? > Graceful shutdown drops processing in Spark Streaming > - > > Key: SPARK-10995 > URL: https://issues.apache.org/jira/browse/SPARK-10995 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Michal Cizmazia > > After triggering the graceful shutdown on the following application, the > application stops before the windowed stream reaches its slide duration. As a > result, the data is not completely processed (i.e. saveToMyStorage is not > called) before shutdown. > According to the documentation, graceful shutdown should ensure that the > data, which has been received, is completely processed before shutdown. > https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#upgrading-application-code > Spark version: 1.4.1 > Code snippet: > {code:java} > Function0 factory = () -> { > JavaStreamingContext context = new JavaStreamingContext(sparkConf, > Durations.minutes(1)); > context.checkpoint("/test"); > JavaDStream records = > context.receiverStream(myReliableReceiver).flatMap(...); > records.persist(StorageLevel.MEMORY_AND_DISK()); > records.foreachRDD(rdd -> { rdd.count(); return null; }); > records > .window(Durations.minutes(15), Durations.minutes(15)) > .foreachRDD(rdd -> saveToMyStorage(rdd)); > return context; > }; > try (JavaStreamingContext context = JavaStreamingContext.getOrCreate("/test", > factory)) { > context.start(); > waitForShutdownSignal(); > Boolean stopSparkContext = true; > Boolean stopGracefully = true; > context.stop(stopSparkContext, stopGracefully); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948678#comment-14948678 ] Imran Rashid commented on SPARK-4105: - [~nadenf] The FetchFailures don't need to be on the same node as the snappy exceptions for the root cause to be SPARK-8029. Of course, we can't be sure that is the cause either, but it is at least a working hypothesis for now. > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:56) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >
[jira] [Created] (SPARK-11003) Allowing UserDefinedTypes to extend primatives
John Muller created SPARK-11003: --- Summary: Allowing UserDefinedTypes to extend primatives Key: SPARK-11003 URL: https://issues.apache.org/jira/browse/SPARK-11003 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.5.1, 1.5.0 Reporter: John Muller Priority: Minor Currently, the classes and constructors of all the primative DataTypes (of StructFields) are private: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/scala/org/apache/spark/sql/types Which means for even simple String-based UDTs users will always have to implement serialize() and deserialize(). UDTs for something as simple as a Northwind database (products, orders, customers) would be very useful for pattern matching / validation. For example: import org.apache.spark.sql.types._ @SQLUserDefinedType(udt = classOf[ProductNameUDT]) case class ProductName(name: String) extends StringType with Validator { import scala.util.matching.Regex private val pattern = """[A-Z][A-Za-z]*""" def validate(): Boolean = { name match { case pattern(_*) => true case _ => false } } } class ProductNameUDT extends UserDefinedType[ProductName] { // No need for this; ProductName is a StringType so we know how to deserialize override def serialize(p: Any): Any = { p match { case p: ProductName => Seq(p.name) } } // Not sure why this override is needed at all; can't we always get this simply by the UDT type param? override def userClass: Class[ProductName] = classOf[ProductName] // Instead of the below, just infer the StructField name via reflection of the wrapper class' name override def sqlType: DataType = StructType(Seq(StructField("ProductName", StringType))) // Still needed. override def deserialize(datum: Any): ProductName = { datum match { case values: Seq[_] => assert(values.length == 1) ProductName(values.head.asInstanceOf[String]) } } } This would simplify the process of creating "primative extension" UDTs down to just 2 steps: 1. Annotated case class that extends a primative DataType 2. The UDT itself just needs a deserializer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7874) Add a global setting for the fine-grained mesos scheduler that limits the number of concurrent tasks of a job
[ https://issues.apache.org/jira/browse/SPARK-7874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948638#comment-14948638 ] Apache Spark commented on SPARK-7874: - User 'dragos' has created a pull request for this issue: https://github.com/apache/spark/pull/9027 > Add a global setting for the fine-grained mesos scheduler that limits the > number of concurrent tasks of a job > - > > Key: SPARK-7874 > URL: https://issues.apache.org/jira/browse/SPARK-7874 > Project: Spark > Issue Type: Wish > Components: Mesos >Affects Versions: 1.3.1 >Reporter: Thomas Dudziak >Priority: Minor > > This would be a very simple yet effective way to prevent a job dominating the > cluster. A way to override it per job would also be nice but not required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10995) Graceful shutdown drops processing in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948667#comment-14948667 ] Michal Cizmazia commented on SPARK-10995: - [On 7 October 2015 at 21:24, Tathagata Das:|http://search-hadoop.com/m/q3RTtftj6Y1Mu9z1] {quote} Aaah, interesting, you are doing 15 minute slide duration. Yeah, internally the streaming scheduler waits for the last "batch" interval which has data to be processed, but if there is a sliding interval (i.e. 15 mins) that is higher than batch interval, then that might not be run. This is indeed a bug and should be fixed. Mind setting up a JIRA and assigning it to me. {quote} > Graceful shutdown drops processing in Spark Streaming > - > > Key: SPARK-10995 > URL: https://issues.apache.org/jira/browse/SPARK-10995 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Michal Cizmazia > > After triggering the graceful shutdown on the following application, the > application stops before the windowed stream reaches its slide duration. As a > result, the data is not completely processed (i.e. saveToMyStorage is not > called) before shutdown. > According to the documentation, graceful shutdown should ensure that the > data, which has been received, is completely processed before shutdown. > https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#upgrading-application-code > Spark version: 1.4.1 > Code snippet: > {code:java} > Function0 factory = () -> { > JavaStreamingContext context = new JavaStreamingContext(sparkConf, > Durations.minutes(1)); > context.checkpoint("/test"); > JavaDStream records = > context.receiverStream(myReliableReceiver).flatMap(...); > records.persist(StorageLevel.MEMORY_AND_DISK()); > records.foreachRDD(rdd -> { rdd.count(); return null; }); > records > .window(Durations.minutes(15), Durations.minutes(15)) > .foreachRDD(rdd -> saveToMyStorage(rdd)); > return context; > }; > try (JavaStreamingContext context = JavaStreamingContext.getOrCreate("/test", > factory)) { > context.start(); > waitForShutdownSignal(); > Boolean stopSparkContext = true; > Boolean stopGracefully = true; > context.stop(stopSparkContext, stopGracefully); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Glenn Strycker updated SPARK-11004: --- Description: Could a feature be added to Spark that would use disk-only MapReduce operations for the very largest RDD joins? MapReduce is able to handle incredibly large table joins in a stable, predictable way with gracious failures and recovery. I have applications that are able to join 2 tables without error in Hive, but these same tables, when converted into RDDs, are unable to join in Spark (I am using the same cluster, and have played around with all of the memory configurations, persisting to disk, checkpointing, etc., and the RDDs are just too big for Spark on my cluster) So, Spark is usually able to handle fairly large RDD joins, but occasionally runs into problems when the tables are just too big (e.g. the notorious 2GB shuffle limit issue, memory problems, etc.) There are so many parameters to adjust (number of partitions, number of cores, memory per core, etc.) that it is difficult to guarantee stability on a shared cluster (say, running Yarn) with other jobs. Could a feature be added to Spark that would use disk-only MapReduce commands to do very large joins? That is, instead of myRDD1.join(myRDD2), we would have a special operation myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run MapReduce, and then convert the results of the join back into a standard RDD. This might add stability for Spark jobs that deal with extremely large data, and enable developers to mix-and-match some Spark and MapReduce operations in the same program, rather than writing Hive scripts and stringing together Spark and MapReduce programs, which has extremely large overhead to convert RDDs to Hive tables and back again. Despite memory-level operations being where most of Spark's speed gains lie, sometimes using disk-only may help with stability! was: Could a feature be added to Spark that would use disk-only MapReduce operations for the very largest RDD joins? MapReduce is able to handle incredibly large table joins in a stable, predictable way with gracious failures and recovery. I have applications that are able to join 2 tables without error in Hive, but these same tables, when converted into RDDs, are unable to join in Spark (I am using the same cluster, and have played around with all of the memory configurations, persisting to disk, checkpointing, etc., and the RDDs are just too big for Spark on my cluster) So, Spark is usually able to handle fairly large RDD joins, but occasionally runs into problems when the tables are just too big (e.g. the notorious 2GB shuffle limit issue, memory problems, etc.) There are so many parameters to adjust (number of partitions, number of cores, memory per core, etc.) that it is difficult to guarantee stability on a shared cluster (say, running Yarn) with other jobs. Could a feature be added to Spark that would use disk-only MapReduce commands to do very large joins? That is, instead of myRDD1.join(myRDD2), we would have a special operation myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run MapReduce, and then convert the results of the join back into a standard RDD. This might add stability for Spark jobs that deal with extremely large data, and enable developers to mix-and-max some Spark and MapReduce operations in the same program, rather than writing Hive scripts and stringing together Spark and MapReduce programs, which has extremely large overhead to convert RDDs to Hive tables and back again. Despite memory-level operations being where most of Spark's speed gains lie, sometimes using disk-only may help with stability! > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of
[jira] [Created] (SPARK-11004) MapReduce Hive-like join operations for RDDs
Glenn Strycker created SPARK-11004: -- Summary: MapReduce Hive-like join operations for RDDs Key: SPARK-11004 URL: https://issues.apache.org/jira/browse/SPARK-11004 Project: Spark Issue Type: New Feature Reporter: Glenn Strycker Could a feature be added to Spark that would use disk-only MapReduce operations for the very largest RDD joins? MapReduce is able to handle incredibly large table joins in a stable, predictable way with gracious failures and recovery. I have applications that are able to join 2 tables without error in Hive, but these same tables, when converted into RDDs, are unable to join in Spark (I am using the same cluster, and have played around with all of the memory configurations, persisting to disk, checkpointing, etc., and the RDDs are just too big for Spark on my cluster) So, Spark is usually able to handle fairly large RDD joins, but occasionally runs into problems when the tables are just too big (e.g. the notorious 2GB shuffle limit issue, memory problems, etc.) There are so many parameters to adjust (number of partitions, number of cores, memory per core, etc.) that it is difficult to guarantee stability on a shared cluster (say, running Yarn) with other jobs. Could a feature be added to Spark that would use disk-only MapReduce commands to do very large joins? That is, instead of myRDD1.join(myRDD2), we would have a special operation myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run MapReduce, and then convert the results of the join back into a standard RDD. This might add stability for Spark jobs that deal with extremely large data, and enable developers to mix-and-max some Spark and MapReduce operations in the same program, rather than writing Hive scripts and stringing together Spark and MapReduce programs, which has extremely large overhead to convert RDDs to Hive tables and back again. Despite memory-level operations being where most of Spark's speed gains lie, sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10999) Physical plan node Coalesce should be able to handle UnsafeRow
[ https://issues.apache.org/jira/browse/SPARK-10999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-10999. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9024 [https://github.com/apache/spark/pull/9024] > Physical plan node Coalesce should be able to handle UnsafeRow > -- > > Key: SPARK-10999 > URL: https://issues.apache.org/jira/browse/SPARK-10999 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Minor > Fix For: 1.6.0 > > > The following PySpark snippet shows the problem: > {noformat} > >>> sqlContext.range(10).selectExpr('id AS a').coalesce(1).explain(True) > ... > == Physical Plan == > Coalesce 1 > ConvertToSafe > TungstenProject [id#3L AS a#4L] >Scan PhysicalRDD[id#3L] > {noformat} > The {{ConvertToSafe}} is unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948943#comment-14948943 ] Glenn Strycker commented on SPARK-11004: True, fixing the 2GB will go a long way. However, this isn't a bug ticket, but a request for a new Spark feature in future versions, that MapReduce could be run on RDDs. This would add a lot of functionality for many possible use cases. This request is asking for an alternate back-end for the join operation, so if the normal rdd.join operation fails, a developer could force using this alternate MapReduce method that would run on disk rather than memory. The reason for this request is that there are actually scenarios where MapReduce really is preferable to Spark, but it is inefficient for developers to mix code and have to run programs with scripts. The request here is for additional back-end functionality for basic RDDs operations that fail at scale for Spark but work correctly in MapReduce. > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11005) Spark 1.5 Shuffle performance - (sort-based shuffle)
[ https://issues.apache.org/jira/browse/SPARK-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-11005: Summary: Spark 1.5 Shuffle performance - (sort-based shuffle) (was: Spark 1.5 Shuffle performance) > Spark 1.5 Shuffle performance - (sort-based shuffle) > > > Key: SPARK-11005 > URL: https://issues.apache.org/jira/browse/SPARK-11005 > Project: Spark > Issue Type: Question > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node cluster with 1 master and 5 worker nodes. > Memory > 100 GB each > Cores = 72 each > Input data ~496 GB >Reporter: Sandeep Pal > > In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the > performance of the map tasks is 14 minutes (around 26s-30s each) where as if > I increase the number of cores to 60(12 cores /executor), the performance of > map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks > are independent of each other in the shuffle. > Each map task has 128 MB input (HDFS block size) in both of the above cases. > So, what makes the performance degradation with increasing number of cores. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10987) yarn-client mode misbehaving with netty-based RPC backend
[ https://issues.apache.org/jira/browse/SPARK-10987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-10987. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 1.6.0 > yarn-client mode misbehaving with netty-based RPC backend > - > > Key: SPARK-10987 > URL: https://issues.apache.org/jira/browse/SPARK-10987 > Project: Spark > Issue Type: Bug > Components: Spark Core, YARN >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Blocker > Fix For: 1.6.0 > > > YARN running in cluster deploy mode seems to be having issues with the new > RPC backend; if you look at unit test runs, tests that run in cluster mode > are taking several minutes to run, instead of the more usual 20-30 seconds. > For example, > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/43349/consoleFull: > {noformat} > [info] YarnClusterSuite: > [info] - run Spark in yarn-client mode (13 seconds, 953 milliseconds) > [info] - run Spark in yarn-cluster mode (6 minutes, 50 seconds) > [info] - run Spark in yarn-cluster mode unsuccessfully (1 minute, 53 seconds) > [info] - run Python application in yarn-client mode (21 seconds, 842 > milliseconds) > [info] - run Python application in yarn-cluster mode (7 minutes, 0 seconds) > [info] - user class path first in client mode (1 minute, 58 seconds) > [info] - user class path first in cluster mode (4 minutes, 49 seconds) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948920#comment-14948920 ] Sean Owen commented on SPARK-11004: --- Spark has had a sort-based shuffle for a while, which is a lot of the battle. Yes there's already a known issue about 2GB block size limits. Is there a specific issue here beyond what's already in JIRA? > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10942) Not all cached RDDs are unpersisted
[ https://issues.apache.org/jira/browse/SPARK-10942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948928#comment-14948928 ] Nick Pritchard commented on SPARK-10942: Thanks [~sowen] for trying! I'll let it go. > Not all cached RDDs are unpersisted > --- > > Key: SPARK-10942 > URL: https://issues.apache.org/jira/browse/SPARK-10942 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Nick Pritchard >Priority: Minor > Attachments: SPARK-10942_1.png, SPARK-10942_2.png, SPARK-10942_3.png > > > I have a Spark Streaming application that caches RDDs inside of a > {{transform}} closure. Looking at the Spark UI, it seems that most of these > RDDs are unpersisted after the batch completes, but not all. > I have copied a minimal reproducible example below to highlight the problem. > I run this and monitor the Spark UI "Storage" tab. The example generates and > caches 30 RDDs, and I see most get cleaned up. However in the end, some still > remain cached. There is some randomness going on because I see different RDDs > remain cached for each run. > I have marked this as Major because I haven't been able to workaround it and > it is a memory leak for my application. I tried setting {{spark.cleaner.ttl}} > but that did not change anything. > {code} > val inputRDDs = mutable.Queue.tabulate(30) { i => > sc.parallelize(Seq(i)) > } > val input: DStream[Int] = ssc.queueStream(inputRDDs) > val output = input.transform { rdd => > if (rdd.isEmpty()) { > rdd > } else { > val rdd2 = rdd.map(identity) > rdd2.setName(rdd.first().toString) > rdd2.cache() > val rdd3 = rdd2.map(identity) > rdd3 > } > } > output.print() > ssc.start() > ssc.awaitTermination() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given
[ https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-10858: - Assignee: Thomas Graves > YARN: archives/jar/files rename with # doesn't work unless scheme given > --- > > Key: SPARK-10858 > URL: https://issues.apache.org/jira/browse/SPARK-10858 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Minor > > The YARN distributed cache feature with --jars, --archives, --files where you > can rename the file/archive using a # symbol only works if you explicitly > include the scheme in the path: > works: > --jars file:///home/foo/my.jar#renamed.jar > doesn't work: > --jars /home/foo/my.jar#renamed.jar > Exception in thread "main" java.io.FileNotFoundException: File > file:/home/foo/my.jar#renamed.jar does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given
[ https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14948941#comment-14948941 ] Thomas Graves commented on SPARK-10858: --- Sorry for the delay on this didn't have time to look at it. Not sure why you are seeing different from me. thanks for looking into this. I agree its in the parsing in the resolveURI where its calling new File(path).getAbsoluteFile().toURI(). When I don't specify file://: 15/10/08 15:35:56 INFO Client: local uri is: file:/homes/tgraves/R_install/R_install.tgz%23R_installation with file:// 15/10/08 15:38:27 INFO Client: local uri is: file:/homes/tgraves/R_install/R_install.tgz#R_installation That is coming back with the %23 encoded versus the #. when I originally wrote those code it wasn't calling the Utils.resolveURIs. Looking at the actual code for File.toURI() you will see its not really parsing the fragment out before calling URI() which I think is the problem: public URI toURI() { try { File f = getAbsoluteFile(); String sp = slashify(f.getPath(), f.isDirectory()); if (sp.startsWith("//")) sp = "//" + sp; return new URI("file", null, sp, null); } catch (URISyntaxException x) { throw new Error(x); // Can't happen } } It seems like a bad idea to call this based on the fact that the string might already be URI format. So we are now going from possible URI to File and back to URI. When we change it to a File its not expecting it to be URI with fragment already so its treating it as part of the path. > YARN: archives/jar/files rename with # doesn't work unless scheme given > --- > > Key: SPARK-10858 > URL: https://issues.apache.org/jira/browse/SPARK-10858 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Priority: Minor > > The YARN distributed cache feature with --jars, --archives, --files where you > can rename the file/archive using a # symbol only works if you explicitly > include the scheme in the path: > works: > --jars file:///home/foo/my.jar#renamed.jar > doesn't work: > --jars /home/foo/my.jar#renamed.jar > Exception in thread "main" java.io.FileNotFoundException: File > file:/home/foo/my.jar#renamed.jar does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11005) Spark 1.5 Shuffle performance
Sandeep Pal created SPARK-11005: --- Summary: Spark 1.5 Shuffle performance Key: SPARK-11005 URL: https://issues.apache.org/jira/browse/SPARK-11005 Project: Spark Issue Type: Question Components: Shuffle, SQL Affects Versions: 1.5.0 Environment: 6 node cluster with 1 master and 5 worker nodes. Memory > 100 GB each Cores = 72 each Input data ~496 GB Reporter: Sandeep Pal In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the performance of the map tasks is 14 minutes (around 26s-30s each) where as if I increase the number of cores to 60(12 cores /executor), the performance of map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks are independent of each other in the shuffle. Each map task has 128 MB input (HDFS block size) in both of the above cases. So, what makes the performance degradation with increasing number of cores. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11005) Spark 1.5 Shuffle performance - (sort-based shuffle)
[ https://issues.apache.org/jira/browse/SPARK-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandeep Pal updated SPARK-11005: Environment: 6 node cluster with 1 master and 5 worker nodes. Memory > 100 GB each Cores = 72 each Input data ~94 GB was: 6 node cluster with 1 master and 5 worker nodes. Memory > 100 GB each Cores = 72 each Input data ~496 GB > Spark 1.5 Shuffle performance - (sort-based shuffle) > > > Key: SPARK-11005 > URL: https://issues.apache.org/jira/browse/SPARK-11005 > Project: Spark > Issue Type: Question > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node cluster with 1 master and 5 worker nodes. > Memory > 100 GB each > Cores = 72 each > Input data ~94 GB >Reporter: Sandeep Pal > > In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the > performance of the map tasks is 14 minutes (around 26s-30s each) where as if > I increase the number of cores to 60(12 cores /executor), the performance of > map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks > are independent of each other in the shuffle. > Each map task has 128 MB input (HDFS block size) in both of the above cases. > So, what makes the performance degradation with increasing number of cores. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10836) SparkR: Add sort function to dataframe
[ https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-10836. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8920 [https://github.com/apache/spark/pull/8920] > SparkR: Add sort function to dataframe > -- > > Key: SPARK-10836 > URL: https://issues.apache.org/jira/browse/SPARK-10836 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > Fix For: 1.6.0 > > > Hi everyone, > the sort function can be used as an alternative to arrange(... ). > As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of > orderings for columns and the list of columns, represented as string names > for example: > sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to > sort some of the columns in the same order > sort(df, decreasing=TRUE, "col1") > sort(df, decreasing=c(TRUE,FALSE), "col1","col2") > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10836) SparkR: Add sort function to dataframe
[ https://issues.apache.org/jira/browse/SPARK-10836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-10836: -- Assignee: Narine Kokhlikyan > SparkR: Add sort function to dataframe > -- > > Key: SPARK-10836 > URL: https://issues.apache.org/jira/browse/SPARK-10836 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Assignee: Narine Kokhlikyan >Priority: Minor > Fix For: 1.6.0 > > > Hi everyone, > the sort function can be used as an alternative to arrange(... ). > As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of > orderings for columns and the list of columns, represented as string names > for example: > sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to > sort some of the columns in the same order > sort(df, decreasing=TRUE, "col1") > sort(df, decreasing=c(TRUE,FALSE), "col1","col2") > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10998) Show non-children in default Expression.toString
[ https://issues.apache.org/jira/browse/SPARK-10998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-10998. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9022 [https://github.com/apache/spark/pull/9022] > Show non-children in default Expression.toString > > > Key: SPARK-10998 > URL: https://issues.apache.org/jira/browse/SPARK-10998 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast
[ https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-8654. - Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8983 [https://github.com/apache/spark/pull/8983] > Analysis exception when using "NULL IN (...)": invalid cast > --- > > Key: SPARK-8654 > URL: https://issues.apache.org/jira/browse/SPARK-8654 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Santiago M. Mola >Priority: Minor > Fix For: 1.6.0 > > > The following query throws an analysis exception: > {code} > SELECT * FROM t WHERE NULL NOT IN (1, 2, 3); > {code} > The exception is: > {code} > org.apache.spark.sql.AnalysisException: invalid cast from int to null; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) > {code} > Here is a test that can be added to AnalysisSuite to check the issue: > {code} > test("SPARK- regression test") { > val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), > "a")() :: Nil, > LocalRelation() > ) > caseInsensitiveAnalyze(plan) > } > {code} > Note that this kind of query is a corner case, but it is still valid SQL. An > expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL > as a result, even if the list contains NULL. So it is safe to translate these > expressions to Literal(null) during analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949060#comment-14949060 ] Sean Owen commented on SPARK-11004: --- Literally run a Mapper and Reducer on Spark? I think it would be more efficient to run then via MapReduce. I imagine that most people using MR still have a Hadoop cluster handy that they're running Spark on too, so this is quite viable. What's the MapReduce method you're referring to? not literally calling to MapReduce right? I'm trying to isolate the distinct change being described here. Otherwise I think this is already covered in existing improvements to Spark itself. > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11005) Spark 1.5 Shuffle performance - (sort-based shuffle)
[ https://issues.apache.org/jira/browse/SPARK-11005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949059#comment-14949059 ] Sean Owen commented on SPARK-11005: --- [~vnayak053] coud I ask you to put this on the mailing list? It's not quite a JIRA as you're asking a question. There's a similar thread at this very moment about why jobs may not scale linearly: https://www.mail-archive.com/user@spark.apache.org/msg38382.html > Spark 1.5 Shuffle performance - (sort-based shuffle) > > > Key: SPARK-11005 > URL: https://issues.apache.org/jira/browse/SPARK-11005 > Project: Spark > Issue Type: Question > Components: Shuffle, SQL >Affects Versions: 1.5.0 > Environment: 6 node cluster with 1 master and 5 worker nodes. > Memory > 100 GB each > Cores = 72 each > Input data ~94 GB >Reporter: Sandeep Pal > > In case of terasort by Spark SQL with 20 total cores(4 cores/ executor), the > performance of the map tasks is 14 minutes (around 26s-30s each) where as if > I increase the number of cores to 60(12 cores /executor), the performance of > map degrades to 30 minutes ( ~2.3 minutes per task). I believe the map tasks > are independent of each other in the shuffle. > Each map task has 128 MB input (HDFS block size) in both of the above cases. > So, what makes the performance degradation with increasing number of cores. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10914: Description: Updated description (by rxin on Oct 8, 2015) To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} *Original bug report description*: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} was: Updated description (by rxin on Oct 8, 2015) To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} Original bug report description: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} > UnsafeRow serialization breaks when two machines have different Oops size > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Updated description (by rxin on Oct 8, 2015) > To reproduce, launch Spark using > {code} > MASTER=local-cluster[2,1,1024] bin/spark-shell --conf > "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" > {code} > *Original bug report description*: > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11007) Add dictionary support for CatalystDecimalConverter
Cheng Lian created SPARK-11007: -- Summary: Add dictionary support for CatalystDecimalConverter Key: SPARK-11007 URL: https://issues.apache.org/jira/browse/SPARK-11007 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.1, 1.5.0, 1.4.1, 1.4.0 Reporter: Cheng Lian Assignee: Cheng Lian Currently {{CatalystDecimalConverter}} doesn't explicitly support dictionary encoding. The consequence is that, the underlying Parquet {{ColumnReader}} always sends raw {{Int}}/{{Long}}/{{Binary}} values decoded from the dictionary to {{CatalystDecimalConverter}} even if the column is encoded using a dictionary. By adding explicit dictionary support (similar to what {{CatalystStringConverter}} does), we can avoid constructing decimals repeatedly. This should be especially effective for binary backed decimals. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10914: Summary: UnsafeRow serialization breaks when two machines have different Oops size (was: Incorrect empty join sets when executor-memory >= 32g) > UnsafeRow serialization breaks when two machines have different Oops size > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949076#comment-14949076 ] Glenn Strycker commented on SPARK-11004: Currently we could do the following from withing a linux script: 1) run Spark through creating the 2 RDDs, 2) save both to a Hive table, 3) end my Spark program, 4) run Hive Beeline on a "select * from table1 join table2 on columnname" and insert this into a new 3rd table, 5) start a new Spark program to continue on from the first program in step 3, loading the output of the Hive join into the results RDD. Rather than doing all of this overhead, it would be pretty cool if we could run the map and reduce jobs as a Hive join would do, only on our existing Spark RDDs, without needing to involve Hive or wrapper scripts, but doing everything from within a single Spark program. I believe the main stability gain is due to Hive performing everything on disk instead of memory. Since we already can checkpoint RDDs to memory, perhaps this ticket request could be accomplished by adding a feature to RDDs that would be performed on the checkpointed files, rather than the in-memory RDDs. We're just looking for a stability gain and ability to increase the potential size of RDDs in their operations. Right now our Hive Beeline scripts are out-performing Spark for these super massive table joins. > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949010#comment-14949010 ] Apache Spark commented on SPARK-11006: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/9028 > Rename NullColumnAccess as NullColumnAccessor > - > > Key: SPARK-11006 > URL: https://issues.apache.org/jira/browse/SPARK-11006 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Trivial > > In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala > , NullColumnAccess should be renmaed as NullColumnAccessor so that same > convention is adhered to for the accessors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11006: Assignee: (was: Apache Spark) > Rename NullColumnAccess as NullColumnAccessor > - > > Key: SPARK-11006 > URL: https://issues.apache.org/jira/browse/SPARK-11006 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Priority: Trivial > > In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala > , NullColumnAccess should be renmaed as NullColumnAccessor so that same > convention is adhered to for the accessors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11006) Rename NullColumnAccess as NullColumnAccessor
[ https://issues.apache.org/jira/browse/SPARK-11006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11006: Assignee: Apache Spark > Rename NullColumnAccess as NullColumnAccessor > - > > Key: SPARK-11006 > URL: https://issues.apache.org/jira/browse/SPARK-11006 > Project: Spark > Issue Type: Task >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Trivial > > In sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala > , NullColumnAccess should be renmaed as NullColumnAccessor so that same > convention is adhered to for the accessors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10914: Description: *Updated description (by rxin on Oct 8, 2015)* To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} *Original bug report description*: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} was: Updated description (by rxin on Oct 8, 2015) To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} *Original bug report description*: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} > UnsafeRow serialization breaks when two machines have different Oops size > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > *Updated description (by rxin on Oct 8, 2015)* > To reproduce, launch Spark using > {code} > MASTER=local-cluster[2,1,1024] bin/spark-shell --conf > "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" > {code} > *Original bug report description*: > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10914: Description: Updated description (by rxin on Oct 8, 2015) To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} Original bug report description: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} was: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} > UnsafeRow serialization breaks when two machines have different Oops size > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > Updated description (by rxin on Oct 8, 2015) > To reproduce, launch Spark using > {code} > MASTER=local-cluster[2,1,1024] bin/spark-shell --conf > "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" > {code} > Original bug report description: > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949079#comment-14949079 ] Reynold Xin commented on SPARK-10914: - OK I figured it out. Updated the description. > UnsafeRow serialization breaks when two machines have different Oops size > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > *Updated description (by rxin on Oct 8, 2015)* > To reproduce, launch Spark using > {code} > MASTER=local-cluster[2,1,1024] bin/spark-shell --conf > "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" > {code} > And then run the following > {code} > scala> sql("select 1 xx").collect() > {code} > The problem is that UnsafeRow contains 3 pieces of information when pointing > to some data in memory (an object, a base offset, and length). When the row > is serialized with Java/Kryo serialization, the object layout in memory can > change if two machines have different pointer width (Oops in JVM). > *Original bug report description*: > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10914) UnsafeRow serialization breaks when two machines have different Oops size
[ https://issues.apache.org/jira/browse/SPARK-10914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-10914: Description: *Updated description (by rxin on Oct 8, 2015)* To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} And then run the following {code} scala> sql("select 1 xx").collect() {code} The problem is that UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM). *Original bug report description*: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} was: *Updated description (by rxin on Oct 8, 2015)* To reproduce, launch Spark using {code} MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" {code} *Original bug report description*: Using an inner join, to match together two integer columns, I generally get no results when there should be matches. But the results vary and depend on whether the dataframes are coming from SQL, JSON, or cached, as well as the order in which I cache things and query them. This minimal example reproduces it consistently for me in the spark-shell, on new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from http://spark.apache.org/downloads.html.) {code} /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ val x = sql("select 1 xx union all select 2") val y = sql("select 1 yy union all select 2") x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ /* If I cache both tables it works: */ x.cache() y.cache() x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ /* but this still doesn't work: */ x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect 1, get 0 */ {code} > UnsafeRow serialization breaks when two machines have different Oops size > - > > Key: SPARK-10914 > URL: https://issues.apache.org/jira/browse/SPARK-10914 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Ubuntu 14.04 (spark-slave), 12.04 (master) >Reporter: Ben Moran > > *Updated description (by rxin on Oct 8, 2015)* > To reproduce, launch Spark using > {code} > MASTER=local-cluster[2,1,1024] bin/spark-shell --conf > "spark.executor.extraJavaOptions=-XX:-UseCompressedOops" > {code} > And then run the following > {code} > scala> sql("select 1 xx").collect() > {code} > The problem is that UnsafeRow contains 3 pieces of information when pointing > to some data in memory (an object, a base offset, and length). When the row > is serialized with Java/Kryo serialization, the object layout in memory can > change if two machines have different pointer width (Oops in JVM). > *Original bug report description*: > Using an inner join, to match together two integer columns, I generally get > no results when there should be matches. But the results vary and depend on > whether the dataframes are coming from SQL, JSON, or cached, as well as the > order in which I cache things and query them. > This minimal example reproduces it consistently for me in the spark-shell, on > new installs of both 1.5.0 and 1.5.1 (pre-built against Hadoop 2.6 from > http://spark.apache.org/downloads.html.) > {code} > /* x is {"xx":1}{"xx":2} and y is just {"yy":1}{"yy:2} */ > val x = sql("select 1 xx union all select 2") > val y = sql("select 1 yy union all select 2") > x.join(y, $"xx" === $"yy").count() /* expect 2, get 0 */ > /* If I cache both tables it works: */ > x.cache() > y.cache() > x.join(y, $"xx" === $"yy").count() /* expect 2, get 2 */ > /* but this still doesn't work: */ > x.join(y, $"xx" === $"yy").filter("yy=1").count() /* expect
[jira] [Updated] (SPARK-11008) Spark window function returns inconsistent/wrong results
[ https://issues.apache.org/jira/browse/SPARK-11008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prasad Chalasani updated SPARK-11008: - Description: Summary: applying a windowing function on a data-frame, followed by count() gives widely varying results in repeated runs: none exceed the correct value, but of course all but one are wrong. On large data-sets I sometimes get as small as HALF of the correct value. A minimal reproducible example is here: (1) start spark-shell (2) run these: val data = 1.to(100).map(x => (x,1)) import sqlContext.implicits._ val tbl = sc.parallelize(data).toDF("id", "time") tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") (3) exit the shell (this is important) (4) start spark-shell again (5) run these: import org.apache.spark.sql.expressions.Window val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") val win = Window.partitionBy("id").orderBy("time") df.select($"id", (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count() I get 98, but the correct result is 100. If I re-run the code in step 5 in the same shell, then the result gets "fixed" and I always get 100. Note this is only a minimal reproducible example to reproduce the error. In my real application the size of the data is much larger and the window function is not trivial as above (i.e. there are multiple "time" values per "id", etc), and I see results sometimes as small as HALF of the correct value (e.g. 120,000 while the correct value is 250,000). So this is a serious problem. was: Summary: applying a windowing function on a data-frame, followed by count() gives widely varying results in repeated runs: none exceed the correct value, but of course all but one are wrong. On large data-sets I sometimes get as small as HALF of the correct value. A minimal reproducible example is here: (1) start spark-shell (2) run these: val data = 1.to(100).map(x => (x,1)) import sqlContext.implicits._ val tbl = sc.parallelize(data).toDF("id", "time") tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") (3) exit the shell (this is important) (4) start spark-shell again (5) run these: import org.apache.spark.sql.expressions.Window val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") val win = Window.partitionBy("id").orderBy("time") df.select($"id", (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count() I get 98, but the correct result is 100. If I re-run the above, then the result gets "fixed" and I always get 100. Note this is only a minimal reproducible example to reproduce the error. In my real application the size of the data is much larger and the window function is not trivial as above (i.e. there are multiple "time" values per "id", etc), and I see results sometimes as small as HALF of the correct value (e.g. 120,000 while the correct value is 250,000). So this is a serious problem. > Spark window function returns inconsistent/wrong results > > > Key: SPARK-11008 > URL: https://issues.apache.org/jira/browse/SPARK-11008 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.4.0, 1.5.0 > Environment: Amazon Linux AMI (Amazon Linux version 2015.09) >Reporter: Prasad Chalasani >Priority: Blocker > > Summary: applying a windowing function on a data-frame, followed by count() > gives widely varying results in repeated runs: none exceed the correct value, > but of course all but one are wrong. On large data-sets I sometimes get as > small as HALF of the correct value. > A minimal reproducible example is here: > (1) start spark-shell > (2) run these: > val data = 1.to(100).map(x => (x,1)) > import sqlContext.implicits._ > val tbl = sc.parallelize(data).toDF("id", "time") > tbl.write.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") > (3) exit the shell (this is important) > (4) start spark-shell again > (5) run these: > import org.apache.spark.sql.expressions.Window > val df = sqlContext.read.parquet("s3n://path/to/mybucket/id-time-tiny.pqt") > val win = Window.partitionBy("id").orderBy("time") > df.select($"id", > (rank().over(win)).alias("rnk")).filter("rnk=1").select("id").count() > I get 98, but the correct result is 100. > If I re-run the code in step 5 in the same shell, then the result gets > "fixed" and I always get 100. > Note this is only a minimal reproducible example to reproduce the error. In > my real application the size of the data is much larger and the window > function is not trivial as above (i.e. there are multiple "time" values per > "id", etc), and I see results sometimes as small as HALF of the correct value > (e.g. 120,000 while the correct value is
[jira] [Updated] (SPARK-11009) RowNumber in HiveContext returns negative values in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-11009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saif Addin Ellafi updated SPARK-11009: -- Environment: Standalone cluster mode. No hadoop/hive is present in the environment (no hive-site.xml), only using HiveContext. Spark build as with hadoop 2.6.0. Default spark configuration variables. cluster has 4 nodes, but happens with n nodes as well. (was: Standalone cluster mode No hadoop/hive is present in the environment (no hive-site.xml), only using HiveContext. Spark build as with hadoop 2.6.0. Default spark configuration variables. cluster has 4 nodes, but happens with n nodes as well.) > RowNumber in HiveContext returns negative values in cluster mode > > > Key: SPARK-11009 > URL: https://issues.apache.org/jira/browse/SPARK-11009 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 > Environment: Standalone cluster mode. No hadoop/hive is present in > the environment (no hive-site.xml), only using HiveContext. Spark build as > with hadoop 2.6.0. Default spark configuration variables. cluster has 4 > nodes, but happens with n nodes as well. >Reporter: Saif Addin Ellafi > > This issue happens when submitting the job into a standalone cluster. Have > not tried YARN or MESOS. Repartition df into 1 piece or default parallelism=1 > does not fix the issue. Also tried having only one node in the cluster, with > same result. Other shuffle configuration changes do not alter the results > either. > The issue does NOT happen in --master local[*]. > val ws = Window. > partitionBy("client_id"). > orderBy("date") > > val nm = "repeatMe" > df.select(df.col("*"), rowNumber().over(ws).as(nm)) > > > df.filter(df("repeatMe").isNotNull).orderBy("repeatMe").take(50).foreach(println(_)) > > ---> > > Long, DateType, Int > [219483904822,2006-06-01,-1863462909] > [219483904822,2006-09-01,-1863462909] > [219483904822,2007-01-01,-1863462909] > [219483904822,2007-08-01,-1863462909] > [219483904822,2007-07-01,-1863462909] > [192489238423,2007-07-01,-1863462774] > [192489238423,2007-02-01,-1863462774] > [192489238423,2006-11-01,-1863462774] > [192489238423,2006-08-01,-1863462774] > [192489238423,2007-08-01,-1863462774] > [192489238423,2006-09-01,-1863462774] > [192489238423,2007-03-01,-1863462774] > [192489238423,2006-10-01,-1863462774] > [192489238423,2007-05-01,-1863462774] > [192489238423,2006-06-01,-1863462774] > [192489238423,2006-12-01,-1863462774] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript
[ https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949167#comment-14949167 ] Felix Cheung commented on SPARK-10971: -- I think he is suggesting the path to R/Rscript to be configurable, but not distribute RScript itself > sparkR: RRunner should allow setting path to Rscript > > > Key: SPARK-10971 > URL: https://issues.apache.org/jira/browse/SPARK-10971 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Thomas Graves > > I'm running spark on yarn and trying to use R in cluster mode. RRunner seems > to just call Rscript and assumes its in the path. But on our YARN deployment > R isn't installed on the nodes so it needs to be distributed along with the > job and we need the ability to point to where it gets installed. sparkR in > client mode has the config spark.sparkr.r.command to point to Rscript. > RRunner should have something similar so it works in cluster mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10995) Graceful shutdown drops processing in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949169#comment-14949169 ] Sean Owen commented on SPARK-10995: --- Ah right what I mean is that your _slide duration_ is equal to your window size. Isn't this the same as using a 15 minute batch duration with no windows? This is probably orthogonal anyway. I understand the point now -- it only keeps running for one more batch interval, not one more slide duration. Maybe a 15 minute batch interval is a workaround here, but would not be in a slightly different situation. I suppose I'm asking what {{waitForShutdownSignal()}} waits for, but that may also be inconsequential. It's not killing threads or initiating other shutdown parallel, nothing else that might interfere. If you see shutdown starting normally then that seems OK. > Graceful shutdown drops processing in Spark Streaming > - > > Key: SPARK-10995 > URL: https://issues.apache.org/jira/browse/SPARK-10995 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 >Reporter: Michal Cizmazia > > After triggering the graceful shutdown on the following application, the > application stops before the windowed stream reaches its slide duration. As a > result, the data is not completely processed (i.e. saveToMyStorage is not > called) before shutdown. > According to the documentation, graceful shutdown should ensure that the > data, which has been received, is completely processed before shutdown. > https://spark.apache.org/docs/1.4.0/streaming-programming-guide.html#upgrading-application-code > Spark version: 1.4.1 > Code snippet: > {code:java} > Function0 factory = () -> { > JavaStreamingContext context = new JavaStreamingContext(sparkConf, > Durations.minutes(1)); > context.checkpoint("/test"); > JavaDStream records = > context.receiverStream(myReliableReceiver).flatMap(...); > records.persist(StorageLevel.MEMORY_AND_DISK()); > records.foreachRDD(rdd -> { rdd.count(); return null; }); > records > .window(Durations.minutes(15), Durations.minutes(15)) > .foreachRDD(rdd -> saveToMyStorage(rdd)); > return context; > }; > try (JavaStreamingContext context = JavaStreamingContext.getOrCreate("/test", > factory)) { > context.start(); > waitForShutdownSignal(); > Boolean stopSparkContext = true; > Boolean stopGracefully = true; > context.stop(stopSparkContext, stopGracefully); > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949171#comment-14949171 ] Glenn Strycker commented on SPARK-11004: So maybe we can simplify this idea down to forcing "disk-only shuffles" only in particular places. Spark could add a "force disk-only" parameter to the existing RDD join function so the command would look like this: rdd1.join(rdd2, diskonly = true) since it is the memory shuffles that seem to be causing my join problems. > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949176#comment-14949176 ] Sean Owen commented on SPARK-11004: --- I suppose I'd be surprised if using disk over memory helped, but you can easily test this already with {{spark.shuffle.memoryFraction=0}} (or maybe it has to be a very small value). Here are the configuration knobs: http://spark.apache.org/docs/latest/configuration.html#shuffle-behavior > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10903) Make sqlContext global
[ https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949185#comment-14949185 ] Felix Cheung commented on SPARK-10903: -- [~sunrui] Agreed. I'd like to propose adding .Deprecated to the existing functions too, and plan to move away from those in the next 1.x release. Thoughts? [~shivaram][~davies][~falaki] > Make sqlContext global > --- > > Key: SPARK-10903 > URL: https://issues.apache.org/jira/browse/SPARK-10903 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Make sqlContext global so that we don't have to always specify it. > e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11004) MapReduce Hive-like join operations for RDDs
[ https://issues.apache.org/jira/browse/SPARK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949187#comment-14949187 ] Glenn Strycker commented on SPARK-11004: Awesome -- thanks, I'll try that out. Is there a way to change this setting dynamically from within a Spark job, so that the fraction can be higher for most of the job and then drop down to 0 only for the difficult parts? > MapReduce Hive-like join operations for RDDs > > > Key: SPARK-11004 > URL: https://issues.apache.org/jira/browse/SPARK-11004 > Project: Spark > Issue Type: New Feature >Reporter: Glenn Strycker > > Could a feature be added to Spark that would use disk-only MapReduce > operations for the very largest RDD joins? > MapReduce is able to handle incredibly large table joins in a stable, > predictable way with gracious failures and recovery. I have applications > that are able to join 2 tables without error in Hive, but these same tables, > when converted into RDDs, are unable to join in Spark (I am using the same > cluster, and have played around with all of the memory configurations, > persisting to disk, checkpointing, etc., and the RDDs are just too big for > Spark on my cluster) > So, Spark is usually able to handle fairly large RDD joins, but occasionally > runs into problems when the tables are just too big (e.g. the notorious 2GB > shuffle limit issue, memory problems, etc.) There are so many parameters to > adjust (number of partitions, number of cores, memory per core, etc.) that it > is difficult to guarantee stability on a shared cluster (say, running Yarn) > with other jobs. > Could a feature be added to Spark that would use disk-only MapReduce commands > to do very large joins? > That is, instead of myRDD1.join(myRDD2), we would have a special operation > myRDD1.mapReduceJoin(myRDD2) that would checkpoint both RDDs to disk, run > MapReduce, and then convert the results of the join back into a standard RDD. > This might add stability for Spark jobs that deal with extremely large data, > and enable developers to mix-and-match some Spark and MapReduce operations in > the same program, rather than writing Hive scripts and stringing together > Spark and MapReduce programs, which has extremely large overhead to convert > RDDs to Hive tables and back again. > Despite memory-level operations being where most of Spark's speed gains lie, > sometimes using disk-only may help with stability! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript
[ https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949199#comment-14949199 ] Thomas Graves commented on SPARK-10971: --- you shouldn't have to install everything a user needs on the YARN nodes. This can cause many different types of issues, the main one being version conflicts and a Maintenance head ache. The only downside to that is if you aren't using the distributed cache properly there is overhead in downloading that. Perhaps there are distributions that don't recommend or cases you want it installed for performance reasons but a general use YARN cluster needs to allow users to send their dependencies with their applications. So yes I am just suggesting the path to Rscript be configurable. You should be able to set a config like spark.sparkr.r.command to point to where Rscript is located. > sparkR: RRunner should allow setting path to Rscript > > > Key: SPARK-10971 > URL: https://issues.apache.org/jira/browse/SPARK-10971 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Thomas Graves > > I'm running spark on yarn and trying to use R in cluster mode. RRunner seems > to just call Rscript and assumes its in the path. But on our YARN deployment > R isn't installed on the nodes so it needs to be distributed along with the > job and we need the ability to point to where it gets installed. sparkR in > client mode has the config spark.sparkr.r.command to point to Rscript. > RRunner should have something similar so it works in cluster mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10382) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949326#comment-14949326 ] Xusen Yin commented on SPARK-10382: --- Or I can custom it to use in Spark project easily. > Make example code in user guide testable > > > Key: SPARK-10382 > URL: https://issues.apache.org/jira/browse/SPARK-10382 > Project: Spark > Issue Type: Brainstorming > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala ml.KMeansExample guide %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "guide" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Just one way to implement this. It would be nice to hear more ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8546) PMML export for Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-8546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949361#comment-14949361 ] Xusen Yin commented on SPARK-8546: -- Hi [~mengxr], I'd like to work on it. > PMML export for Naive Bayes > --- > > Key: SPARK-8546 > URL: https://issues.apache.org/jira/browse/SPARK-8546 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > The naive Bayes section of PMML standard can be found at > http://www.dmg.org/v4-1/NaiveBayes.html. We should first figure out how to > generate PMML for both binomial and multinomial naive Bayes models using > JPMML (maybe [~vfed] can help). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10858) YARN: archives/jar/files rename with # doesn't work unless scheme given
[ https://issues.apache.org/jira/browse/SPARK-10858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949372#comment-14949372 ] Apache Spark commented on SPARK-10858: -- User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/9035 > YARN: archives/jar/files rename with # doesn't work unless scheme given > --- > > Key: SPARK-10858 > URL: https://issues.apache.org/jira/browse/SPARK-10858 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Minor > > The YARN distributed cache feature with --jars, --archives, --files where you > can rename the file/archive using a # symbol only works if you explicitly > include the scheme in the path: > works: > --jars file:///home/foo/my.jar#renamed.jar > doesn't work: > --jars /home/foo/my.jar#renamed.jar > Exception in thread "main" java.io.FileNotFoundException: File > file:/home/foo/my.jar#renamed.jar does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:534) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:416) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) > at > org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:240) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:329) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:393) > at > org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$6$$anonfun$apply$2.apply(Client.scala:392) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11015) Add computeCost and clusterCenters to KMeansModel in spark.ml package
Richard Garris created SPARK-11015: -- Summary: Add computeCost and clusterCenters to KMeansModel in spark.ml package Key: SPARK-11015 URL: https://issues.apache.org/jira/browse/SPARK-11015 Project: Spark Issue Type: Improvement Components: ML Reporter: Richard Garris The Transformer version of KMeansModel does not currently have methods to computeCost or get the centers of the cluster centroids. If there could be a way to get this either by exposing the parentModel or by adding these method it would make things easier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000
[ https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949525#comment-14949525 ] Charles Allen commented on SPARK-5949: -- [~lemire] pinging to see if you have any suggestions on how to handle situations like this. > Driver program has to register roaring bitmap classes used by spark with Kryo > when number of partitions is greater than 2000 > > > Key: SPARK-5949 > URL: https://issues.apache.org/jira/browse/SPARK-5949 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Peter Torok >Assignee: Imran Rashid > Labels: kryo, partitioning, serialization > Fix For: 1.4.0 > > > When more than 2000 partitions are being used with Kryo, the following > classes need to be registered by driver program: > - org.apache.spark.scheduler.HighlyCompressedMapStatus > - org.roaringbitmap.RoaringBitmap > - org.roaringbitmap.RoaringArray > - org.roaringbitmap.ArrayContainer > - org.roaringbitmap.RoaringArray$Element > - org.roaringbitmap.RoaringArray$Element[] > - short[] > Our project doesn't have dependency on roaring bitmap and > HighlyCompressedMapStatus is intended for internal spark usage. Spark should > take care of this registration when Kryo is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10382) Make example code in user guide testable
[ https://issues.apache.org/jira/browse/SPARK-10382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949324#comment-14949324 ] Xusen Yin commented on SPARK-10382: --- Hi Xiangrui, As far as I know, there are several plugins of Jekyll have already implemented this function. Here are the links: - https://github.com/imathis/octopress/blob/master/plugins/include_code.rb This one insert code from local file - https://github.com/bwillis/jekyll-github-sample This one insert code from GitHub - https://github.com/octopress/render-code This one insert code from local file And the first one is exactly what we want. So I think, instead of writing a new one, we can use the first one above. What do you think? > Make example code in user guide testable > > > Key: SPARK-10382 > URL: https://issues.apache.org/jira/browse/SPARK-10382 > Project: Spark > Issue Type: Brainstorming > Components: Documentation, ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xusen Yin >Priority: Critical > > The example code in the user guide is embedded in the markdown and hence it > is not easy to test. It would be nice to automatically test them. This JIRA > is to discuss options to automate example code testing and see what we can do > in Spark 1.6. > One option I propose is to move actual example code to spark/examples and > test compilation in Jenkins builds. Then in the markdown, we can reference > part of the code to show in the user guide. This requires adding a Jekyll tag > that is similar to > https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, > e.g., called include_example. > {code} > {% include_example scala ml.KMeansExample guide %} > {code} > Jekyll will find > `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` > and pick code blocks marked "guide" and put them under `{% highlight %}` in > the markdown. We can discuss the syntax for marker comments. > Just one way to implement this. It would be nice to hear more ideas. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11014) RPC Time Out Exceptions
Gurpreet Singh created SPARK-11014: -- Summary: RPC Time Out Exceptions Key: SPARK-11014 URL: https://issues.apache.org/jira/browse/SPARK-11014 Project: Spark Issue Type: Bug Affects Versions: 1.5.1 Environment: YARN Reporter: Gurpreet Singh I am seeing lots of the following RPC exception messages in YARN logs: 15/10/08 13:04:27 WARN executor.Executor: Issue communicating with driver in heartbeater org.apache.spark.SparkException: Error sending message [message = Heartbeat(437,[Lscala.Tuple2;@34199eb1,BlockManagerId(437, phxaishdc9dn1294.stratus.phx.ebay.com, 47480))] at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:118) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:452) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:472) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:472) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:472) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) ... 14 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241) ... 15 more ## -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6723) Model import/export for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949548#comment-14949548 ] Jayant Shekhar commented on SPARK-6723: --- Thanks [~fliang] [~mengxr] Can you trigger tests on the PR. Thanks. > Model import/export for ChiSqSelector > - > > Key: SPARK-6723 > URL: https://issues.apache.org/jira/browse/SPARK-6723 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10936) UDAF "mode" for categorical variables
[ https://issues.apache.org/jira/browse/SPARK-10936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949306#comment-14949306 ] Frederick Reiss commented on SPARK-10936: - Mode is not an algebraic aggregate. To find the mode in a single pass over the original data, one needs to track the full set of distinct values in the underlying set, as well as the number of times each value occurs in the records seen so far. For high-cardinality columns, this requirement will result in unbounded state. I can see three ways forward here: a) Stuff a hash table into the PartialAggregate2 API's result buffer, and hope that this buffer does not exhaust the heap, produce O(n^2) behavior when the column cardinality is high, or stop working on future (or present?) versions of codegen. b) Implement an approximate mode with fixed-size intermediate state (for example, a compressed reservoir sample), similar to how the current HyperLogLog++ aggregate works. Approximate computation of the mode will work well most of the time but will have unbounded error in corner cases. c) Add mode as another member of the "distinct" family of Spark aggregates, such as SUM/COUNT/AVERAGE DISTINCT. Use the same pre-Tungsten style of processing to handle mode for now. Create a follow-on JIRA to move mode over to the fast path at the same time that the other DISTINCT aggregates switch over. I think that (c) is the best option overall, but I'm happy to defer to others with deeper understanding. My thinking is that, while it would be good to have a mode aggregate available, mode is a relatively uncommon use case. Slow-path processing for mode is ok as a short-term expedient. Once SUM DISTINCT and related aggregates are fully moved onto the new framework, transitioning mode to the fast path should be easy. > UDAF "mode" for categorical variables > - > > Key: SPARK-10936 > URL: https://issues.apache.org/jira/browse/SPARK-10936 > Project: Spark > Issue Type: Sub-task > Components: ML, SQL >Reporter: Xiangrui Meng > > This is similar to frequent items except that we don't have a threshold on > the frequency. So an exact implementation might require a global shuffle. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11013) SparkPlan may mistakenly register child plan's accumulators for SQL metrics
Wenchen Fan created SPARK-11013: --- Summary: SparkPlan may mistakenly register child plan's accumulators for SQL metrics Key: SPARK-11013 URL: https://issues.apache.org/jira/browse/SPARK-11013 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan The reason is that: when we call RDD API inside SparkPlan, we are very likely to reference the SparkPlan in the closure and thus serialize and transfer a SparkPlan tree to executor side. When we deserialize it, the accumulators in child SparkPlan are also deserialized and registered, and always report zero value. This is not a problem currently because we only have one operation to aggregate the accumulators: add. However, if we wanna support more complex metric like min, the extra zero values will lead to wrong result. Take TungstenAggregate as an example, I logged "stageId, partitionId, accumName, accumId" when an accumulator is deserialized and registered, and logged the "accumId -> accumValue" map when a task ends. The output is: {code} scala> val df = Seq(1 -> "a", 2 -> "b").toDF("a", "b").groupBy().count() df: org.apache.spark.sql.DataFrame = [count: bigint] scala> df.collect register: 0 0 Some(number of input rows) 4 register: 0 0 Some(number of output rows) 5 register: 1 0 Some(number of input rows) 4 register: 1 0 Some(number of output rows) 5 register: 1 0 Some(number of input rows) 2 register: 1 0 Some(number of output rows) 3 Map(5 -> 1, 4 -> 2, 6 -> 4458496) Map(5 -> 0, 2 -> 1, 7 -> 4458496, 3 -> 1, 4 -> 0) res0: Array[org.apache.spark.sql.Row] = Array([2]) {code} The best choice is to avoid serialize and deserialize a SparkPlan tree, which can be achieved by LocalNode. Or we can do some workaround to fix this serialization problem for the problematic SparkPlans like TungstenAggregate, TungstenSort. Or we can improve the SQL metrics framework to make it more robust to this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8654) Analysis exception when using "NULL IN (...)": invalid cast
[ https://issues.apache.org/jira/browse/SPARK-8654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949360#comment-14949360 ] Apache Spark commented on SPARK-8654: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/9034 > Analysis exception when using "NULL IN (...)": invalid cast > --- > > Key: SPARK-8654 > URL: https://issues.apache.org/jira/browse/SPARK-8654 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Santiago M. Mola >Priority: Minor > Fix For: 1.6.0 > > > The following query throws an analysis exception: > {code} > SELECT * FROM t WHERE NULL NOT IN (1, 2, 3); > {code} > The exception is: > {code} > org.apache.spark.sql.AnalysisException: invalid cast from int to null; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:66) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52) > {code} > Here is a test that can be added to AnalysisSuite to check the issue: > {code} > test("SPARK- regression test") { > val plan = Project(Alias(In(Literal(null), Seq(Literal(1), Literal(2))), > "a")() :: Nil, > LocalRelation() > ) > caseInsensitiveAnalyze(plan) > } > {code} > Note that this kind of query is a corner case, but it is still valid SQL. An > expression such as "NULL IN (...)" or "NULL NOT IN (...)" always gives NULL > as a result, even if the list contains NULL. So it is safe to translate these > expressions to Literal(null) during analysis. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10988) Reduce duplication in Aggregate2's expression rewriting logic
[ https://issues.apache.org/jira/browse/SPARK-10988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-10988. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9015 [https://github.com/apache/spark/pull/9015] > Reduce duplication in Aggregate2's expression rewriting logic > - > > Key: SPARK-10988 > URL: https://issues.apache.org/jira/browse/SPARK-10988 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.0 > > > In `aggregate/utils.scala`, there is a substantial amount of duplication in > the expression-rewriting logic. As a prerequisite to supporting imperative > aggregate functions in `TungstenAggregate`, we should refactor this file so > that the same expression-rewriting logic is used for both `SortAggregate` and > `TungstenAggregate`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5949) Driver program has to register roaring bitmap classes used by spark with Kryo when number of partitions is greater than 2000
[ https://issues.apache.org/jira/browse/SPARK-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949518#comment-14949518 ] Charles Allen commented on SPARK-5949: -- This breaks when using more recent versions of Roaring where org.roaringbitmap.RoaringArray$Element is no longer present. The following stack trace appears: {code} A needed class was not found. This could be due to an error in your runpath. Missing class: org/roaringbitmap/RoaringArray$Element java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element at org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) at org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) at org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) at org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) at io.druid.indexer.spark.SparkDruidIndexer$$anonfun$2.apply(SparkDruidIndexer.scala:84) at io.druid.indexer.spark.SparkDruidIndexer$$anonfun$2.apply(SparkDruidIndexer.scala:84) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at io.druid.indexer.spark.SparkDruidIndexer$.loadData(SparkDruidIndexer.scala:84) at io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply$mcV$sp(TestSparkDruidIndexer.scala:131) at io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply(TestSparkDruidIndexer.scala:40) at io.druid.indexer.spark.TestSparkDruidIndexer$$anonfun$1.apply(TestSparkDruidIndexer.scala:40) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FlatSpecLike$$anon$1.apply(FlatSpecLike.scala:1647) at org.scalatest.Suite$class.withFixture(Suite.scala:1122) at org.scalatest.FlatSpec.withFixture(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$class.invokeWithFixture$1(FlatSpecLike.scala:1644) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.FlatSpecLike$$anonfun$runTest$1.apply(FlatSpecLike.scala:1656) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FlatSpecLike$class.runTest(FlatSpecLike.scala:1656) at org.scalatest.FlatSpec.runTest(FlatSpec.scala:1683) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at org.scalatest.FlatSpecLike$$anonfun$runTests$1.apply(FlatSpecLike.scala:1714) at