[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17617: Fix Version/s: 1.6.3 > Remainder(%) expression.eval returns incorrect result > - > > Key: SPARK-17617 > URL: https://issues.apache.org/jira/browse/SPARK-17617 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Zhong > Labels: correctness > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > h2.Problem > Remainder(%) expression returns incorrect result when using expression.eval > to calculate the result. expression.eval is called in case like constant > folding. > {code} > scala> -5083676433652386516D % 10 > res19: Double = -6.0 > // Wrong answer with eval!!! > scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show > |(value % 10)| > ++ > | 0.0| > ++ > // Triggers codegen, will not do constant folding > scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % > 10).show > ++ > |(value % 10)| > ++ > |-6.0| > ++ > {code} > Behavior of postgres: > {code} > seanzhong=# select -5083676433652386516.0 % 10; > ?column? > -- > -6.0 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()
[ https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509301#comment-15509301 ] Sreelal S L commented on SPARK-17621: - Hi. Our actual code is bit different from what i have given. We use streaming,and use tranform() to reuse couple of the DataFrame operations from other part of our code base. Dont have much control to change code there. (Worst case have to do changes there) . But feel something is wrong here. I hit the issue there, and was trying out samples to figure out exactly where the issue is coming from . Looks like its little unexpected behaviour, since if it works for groupBy(), the behaviour should be same for orderBy(). Also the map() which increments the accumulator is invoked only once. So it has something to do with stage result calculated twice. I can understand the accumulator map() calling twice if some task failure happens. But thats not the case here. All tasks are successful and the map() doing accumulator addition is called only once. > Accumulator value is doubled when using DataFrame.orderBy() > --- > > Key: SPARK-17621 > URL: https://issues.apache.org/jira/browse/SPARK-17621 > Project: Spark > Issue Type: Bug > Components: Scheduler, SQL >Affects Versions: 2.0.0 > Environment: Development environment. (Eclipse . Single process) >Reporter: Sreelal S L >Priority: Minor > > We are tracing the records read by our source using an accumulator. We do a > orderBy on the Dataframe before the output operation. When the job is > completed, the accumulator values is becoming double of the expected value . > . > Below is the sample code i ran . > {code} > val sqlContext = SparkSession.builder() > .config("spark.sql.retainGroupColumns", > false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]") > .getOrCreate() > val sc = sqlContext.sparkContext > val accumulator1 = sc.accumulator(0, "accumulator1") > val usersDF = sqlContext.read.json("C:\\users.json") // single row > {"name":"sreelal" ,"country":"IND"} > val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x }); > val counterDF = sqlContext.createDataFrame(usersDFwithCount, > usersDF.schema); > val oderedDF = counterDF.orderBy("name") > val collected = oderedDF.collect() > collected.foreach { x => println(x) } > println("accumulator1 : " + accumulator1.value) > println("Done"); > {code} > I have only one row in the users.json file. I expect accumulator1 to have > value 1. But its coming as 2. > In the Spark Sql UI , i see two jobs getting generated for the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17599) Folder deletion after globbing may fail StructuredStreaming jobs
[ https://issues.apache.org/jira/browse/SPARK-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17599. - Resolution: Fixed Assignee: Burak Yavuz Fix Version/s: 2.1.0 > Folder deletion after globbing may fail StructuredStreaming jobs > > > Key: SPARK-17599 > URL: https://issues.apache.org/jira/browse/SPARK-17599 > Project: Spark > Issue Type: Bug > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.1.0 > > > The FileStreamSource used by StructuredStreaming first resolves globs, and > then creates a ListingFileCatalog which listFiles with the resolved glob > patterns. If a folder is deleted after glob resolution but before the > ListingFileCatalog can list the files, we can run into a > 'FileNotFoundException'. > This should not be a fatal exception for a streaming job. However we should > include a warn message. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17219) QuantileDiscretizer does strange things with NaN values
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17219: -- Assignee: Vincent > QuantileDiscretizer does strange things with NaN values > --- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2 >Reporter: Barry Becker >Assignee: Vincent > Fix For: 2.1.0 > > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
[ https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17583. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15138 [https://github.com/apache/spark/pull/15138] > Remove unused rowSeparator variable and set auto-expanding buffer as default > for maxCharsPerColumn option in CSV > > > Key: SPARK-17583 > URL: https://issues.apache.org/jira/browse/SPARK-17583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This JIRA includes several changes below: > 1. Upgrade Univocity library from 2.1.1 to 2.2.1 > This includes some performance improvement and also enabling auto-extending > buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release > notes|https://github.com/uniVocity/univocity-parsers/releases]. > 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}} > We have this variable in > [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127] > but it seems possibly causing confusion that it actually does not care of > {{\r\n}}. For example, we have an issue open about this SPARK-17227 > describing this variable > This options is virtually not being used because we rely on > {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}. > 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending > We are setting 100 for the length of each column. It'd be more sensible > we allow auto-expending rather than fixed length by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV
[ https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17583: -- Assignee: Hyukjin Kwon > Remove unused rowSeparator variable and set auto-expanding buffer as default > for maxCharsPerColumn option in CSV > > > Key: SPARK-17583 > URL: https://issues.apache.org/jira/browse/SPARK-17583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.1.0 > > > This JIRA includes several changes below: > 1. Upgrade Univocity library from 2.1.1 to 2.2.1 > This includes some performance improvement and also enabling auto-extending > buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release > notes|https://github.com/uniVocity/univocity-parsers/releases]. > 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}} > We have this variable in > [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127] > but it seems possibly causing confusion that it actually does not care of > {{\r\n}}. For example, we have an issue open about this SPARK-17227 > describing this variable > This options is virtually not being used because we rely on > {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}. > 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending > We are setting 100 for the length of each column. It'd be more sensible > we allow auto-expending rather than fixed length by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()
[ https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509213#comment-15509213 ] Sreelal S L commented on SPARK-17621: - Hi Sean, Thanks for your quick reply. I didnt understand what you meant by "evaluating usersDFwithCount twice". Does creating a DataFrame from existing RDD fires a extra job. One more catch is i am observing this only for orderBy() . If i try a groupBy , ie : counterDF.groupBy("name").count().collect() the accumulator value is proper. In groupBy case also i create DataFrame from the rdd. What could be the difference here. > Accumulator value is doubled when using DataFrame.orderBy() > --- > > Key: SPARK-17621 > URL: https://issues.apache.org/jira/browse/SPARK-17621 > Project: Spark > Issue Type: Bug > Components: Scheduler, SQL >Affects Versions: 2.0.0 > Environment: Development environment. (Eclipse . Single process) >Reporter: Sreelal S L >Priority: Minor > > We are tracing the records read by our source using an accumulator. We do a > orderBy on the Dataframe before the output operation. When the job is > completed, the accumulator values is becoming double of the expected value . > . > Below is the sample code i ran . > {code} > val sqlContext = SparkSession.builder() > .config("spark.sql.retainGroupColumns", > false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]") > .getOrCreate() > val sc = sqlContext.sparkContext > val accumulator1 = sc.accumulator(0, "accumulator1") > val usersDF = sqlContext.read.json("C:\\users.json") // single row > {"name":"sreelal" ,"country":"IND"} > val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x }); > val counterDF = sqlContext.createDataFrame(usersDFwithCount, > usersDF.schema); > val oderedDF = counterDF.orderBy("name") > val collected = oderedDF.collect() > collected.foreach { x => println(x) } > println("accumulator1 : " + accumulator1.value) > println("Done"); > {code} > I have only one row in the users.json file. I expect accumulator1 to have > value 1. But its coming as 2. > In the Spark Sql UI , i see two jobs getting generated for the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17617: Fix Version/s: 2.0.1 > Remainder(%) expression.eval returns incorrect result > - > > Key: SPARK-17617 > URL: https://issues.apache.org/jira/browse/SPARK-17617 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Zhong > Labels: correctness > Fix For: 2.0.1, 2.1.0 > > > h2.Problem > Remainder(%) expression returns incorrect result when using expression.eval > to calculate the result. expression.eval is called in case like constant > folding. > {code} > scala> -5083676433652386516D % 10 > res19: Double = -6.0 > // Wrong answer with eval!!! > scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show > |(value % 10)| > ++ > | 0.0| > ++ > // Triggers codegen, will not do constant folding > scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % > 10).show > ++ > |(value % 10)| > ++ > |-6.0| > ++ > {code} > Behavior of postgres: > {code} > seanzhong=# select -5083676433652386516.0 % 10; > ?column? > -- > -6.0 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17017) Add a chiSquare Selector based on False Positive Rate (FPR) test
[ https://issues.apache.org/jira/browse/SPARK-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17017: -- Assignee: Peng Meng > Add a chiSquare Selector based on False Positive Rate (FPR) test > > > Key: SPARK-17017 > URL: https://issues.apache.org/jira/browse/SPARK-17017 > Project: Spark > Issue Type: New Feature >Reporter: Peng Meng >Assignee: Peng Meng >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Univariate feature selection works by selecting the best features based on > univariate statistical tests. False Positive Rate (FPR) is a popular > univariate statistical test for feature selection. Is it necessary to add a > chiSquare Selector based on False Positive Rate (FPR) test, like it is > implemented in scikit-learn. > http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()
[ https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509168#comment-15509168 ] Sean Owen commented on SPARK-17621: --- I think you've found the issue. You're actually evaluating usersDFwithCount twice here. I think the other one has to do with creating the data frame. So the accumulator is incremented twice. > Accumulator value is doubled when using DataFrame.orderBy() > --- > > Key: SPARK-17621 > URL: https://issues.apache.org/jira/browse/SPARK-17621 > Project: Spark > Issue Type: Bug > Components: Scheduler, SQL >Affects Versions: 2.0.0 > Environment: Development environment. (Eclipse . Single process) >Reporter: Sreelal S L >Priority: Minor > > We are tracing the records read by our source using an accumulator. We do a > orderBy on the Dataframe before the output operation. When the job is > completed, the accumulator values is becoming double of the expected value . > . > Below is the sample code i ran . > {code} > val sqlContext = SparkSession.builder() > .config("spark.sql.retainGroupColumns", > false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]") > .getOrCreate() > val sc = sqlContext.sparkContext > val accumulator1 = sc.accumulator(0, "accumulator1") > val usersDF = sqlContext.read.json("C:\\users.json") // single row > {"name":"sreelal" ,"country":"IND"} > val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x }); > val counterDF = sqlContext.createDataFrame(usersDFwithCount, > usersDF.schema); > val oderedDF = counterDF.orderBy("name") > val collected = oderedDF.collect() > collected.foreach { x => println(x) } > println("accumulator1 : " + accumulator1.value) > println("Done"); > {code} > I have only one row in the users.json file. I expect accumulator1 to have > value 1. But its coming as 2. > In the Spark Sql UI , i see two jobs getting generated for the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11918) Better error from WLS for cases like singular input
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11918: Assignee: Apache Spark (was: Sean Owen) > Better error from WLS for cases like singular input > --- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11918) Better error from WLS for cases like singular input
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11918: Assignee: Sean Owen (was: Apache Spark) > Better error from WLS for cases like singular input > --- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Sean Owen >Priority: Minor > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) Better error from WLS for cases like singular input
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11918: -- Assignee: Sean Owen Labels: (was: starter) Summary: Better error from WLS for cases like singular input (was: WLS can not resolve some kinds of equation) > Better error from WLS for cases like singular input > --- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Sean Owen >Priority: Minor > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17556: Assignee: Apache Spark > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17596) Streaming job lacks Scala runtime methods
[ https://issues.apache.org/jira/browse/SPARK-17596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509229#comment-15509229 ] Evgeniy Tsvigun commented on SPARK-17596: - Thanks Sean! One more check revealed I had SPARK_HOME pointing to a 1.6.2 Spark distro in my profile. > Streaming job lacks Scala runtime methods > - > > Key: SPARK-17596 > URL: https://issues.apache.org/jira/browse/SPARK-17596 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 > Environment: Linux 4.4.20 x86_64 GNU/Linux > openjdk version "1.8.0_102" > Scala 2.11.8 >Reporter: Evgeniy Tsvigun > Labels: kafka-0.8, streaming > > When using -> in Spark Streaming 2.0.0 jobs, or using > spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit, I > get the following error: > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent > failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > This only happens with spark-streaming, using ArrowAssoc in plain > non-streaming Spark jobs works fine. > I put a brief illustration of this phenomenon to a GitHub repo: > https://github.com/utgarda/spark-2-streaming-nosuchmethod-arrowassoc > Putting only provided dependencies to build.sbt > "org.apache.spark" %% "spark-core" % "2.0.0" % "provided", > "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided" > using -> anywhere in the driver code, packing it with sbt-assembly and > submitting the job results in an error. This isn't a big problem by itself, > using ArrayAssoc can be avoided, but spark-streaming-kafka-0-8_2.11 v2.0.0 > has it somewhere inside, and generates the same error. > Packing with scala-library, can see the class in the jar after packing, but > it's still reported missing in runtime. > The issue reported on StackOverflow: > http://stackoverflow.com/questions/39395521/spark-2-0-0-streaming-job-packed-with-sbt-assembly-lacks-scala-runtime-methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10835) Change Output of NGram to Array(String, True)
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509249#comment-15509249 ] Apache Spark commented on SPARK-10835: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15179 > Change Output of NGram to Array(String, True) > - > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10835) Change Output of NGram to Array(String, True)
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10835: Assignee: Apache Spark (was: yuhao yang) > Change Output of NGram to Array(String, True) > - > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: Apache Spark >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10835) Change Output of NGram to Array(String, True)
[ https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10835: Assignee: yuhao yang (was: Apache Spark) > Change Output of NGram to Array(String, True) > - > > Key: SPARK-10835 > URL: https://issues.apache.org/jira/browse/SPARK-10835 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Sumit Chawla >Assignee: yuhao yang >Priority: Minor > > Currently output type of NGram is Array(String, false), which is not > compatible with LDA since their input type is Array(String, true). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509228#comment-15509228 ] Apache Spark commented on SPARK-17556: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/15178 > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins
[ https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17556: Assignee: (was: Apache Spark) > Executor side broadcast for broadcast joins > --- > > Key: SPARK-17556 > URL: https://issues.apache.org/jira/browse/SPARK-17556 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Reporter: Reynold Xin > > Currently in Spark SQL, in order to perform a broadcast join, the driver must > collect the result of an RDD and then broadcast it. This introduces some > extra latency. It might be possible to broadcast directly from executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms
[ https://issues.apache.org/jira/browse/SPARK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17595. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15150 [https://github.com/apache/spark/pull/15150] > Inefficient selection in Word2VecModel.findSynonyms > --- > > Key: SPARK-17595 > URL: https://issues.apache.org/jira/browse/SPARK-17595 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: William Benton >Priority: Minor > Fix For: 2.1.0 > > > The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements > with the highest similarity to the query vector currently sorts the > similarities for every vocabulary element. This involves making multiple > copies of the collection of similarities while doing a (relatively) expensive > sort. It would be more efficient to find the best matches by maintaining a > bounded priority queue and populating it with a single pass over the > vocabulary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms
[ https://issues.apache.org/jira/browse/SPARK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17595: -- Assignee: William Benton > Inefficient selection in Word2VecModel.findSynonyms > --- > > Key: SPARK-17595 > URL: https://issues.apache.org/jira/browse/SPARK-17595 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: William Benton >Assignee: William Benton >Priority: Minor > Fix For: 2.1.0 > > > The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements > with the highest similarity to the query vector currently sorts the > similarities for every vocabulary element. This involves making multiple > copies of the collection of similarities while doing a (relatively) expensive > sort. It would be more efficient to find the best matches by maintaining a > bounded priority queue and populating it with a single pass over the > vocabulary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17617: Assignee: Sean Zhong > Remainder(%) expression.eval returns incorrect result > - > > Key: SPARK-17617 > URL: https://issues.apache.org/jira/browse/SPARK-17617 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong >Assignee: Sean Zhong > Labels: correctness > Fix For: 2.0.1, 2.1.0 > > > h2.Problem > Remainder(%) expression returns incorrect result when using expression.eval > to calculate the result. expression.eval is called in case like constant > folding. > {code} > scala> -5083676433652386516D % 10 > res19: Double = -6.0 > // Wrong answer with eval!!! > scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show > |(value % 10)| > ++ > | 0.0| > ++ > // Triggers codegen, will not do constant folding > scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % > 10).show > ++ > |(value % 10)| > ++ > |-6.0| > ++ > {code} > Behavior of postgres: > {code} > seanzhong=# select -5083676433652386516.0 % 10; > ?column? > -- > -6.0 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17617) Remainder(%) expression.eval returns incorrect result
[ https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17617. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15171 [https://github.com/apache/spark/pull/15171] > Remainder(%) expression.eval returns incorrect result > - > > Key: SPARK-17617 > URL: https://issues.apache.org/jira/browse/SPARK-17617 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Sean Zhong > Labels: correctness > Fix For: 2.1.0 > > > h2.Problem > Remainder(%) expression returns incorrect result when using expression.eval > to calculate the result. expression.eval is called in case like constant > folding. > {code} > scala> -5083676433652386516D % 10 > res19: Double = -6.0 > // Wrong answer with eval!!! > scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show > |(value % 10)| > ++ > | 0.0| > ++ > // Triggers codegen, will not do constant folding > scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % > 10).show > ++ > |(value % 10)| > ++ > |-6.0| > ++ > {code} > Behavior of postgres: > {code} > seanzhong=# select -5083676433652386516.0 % 10; > ?column? > -- > -6.0 > (1 row) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17017) Add a chiSquare Selector based on False Positive Rate (FPR) test
[ https://issues.apache.org/jira/browse/SPARK-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17017. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14597 [https://github.com/apache/spark/pull/14597] > Add a chiSquare Selector based on False Positive Rate (FPR) test > > > Key: SPARK-17017 > URL: https://issues.apache.org/jira/browse/SPARK-17017 > Project: Spark > Issue Type: New Feature >Reporter: Peng Meng >Priority: Minor > Fix For: 2.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Univariate feature selection works by selecting the best features based on > univariate statistical tests. False Positive Rate (FPR) is a popular > univariate statistical test for feature selection. Is it necessary to add a > chiSquare Selector based on False Positive Rate (FPR) test, like it is > implemented in scikit-learn. > http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-17585: --- Assignee: Yanbo Liang > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > Users would like to add a directory as dependency in some cases, they can use > {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add > all files under the directory by using Scala. But Python users can only add > file not directory, we should also make it supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively
[ https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang resolved SPARK-17585. - Resolution: Fixed Fix Version/s: 2.1.0 > PySpark SparkContext.addFile supports adding files recursively > -- > > Key: SPARK-17585 > URL: https://issues.apache.org/jira/browse/SPARK-17585 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 2.1.0 > > > Users would like to add a directory as dependency in some cases, they can use > {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add > all files under the directory by using Scala. But Python users can only add > file not directory, we should also make it supported. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) Better error from WLS for cases like singular input
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509222#comment-15509222 ] Apache Spark commented on SPARK-11918: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15177 > Better error from WLS for cases like singular input > --- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Sean Owen >Priority: Minor > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17622) Cannot run SparkR function on Windows- Spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renzhi he updated SPARK-17622: -- Description: sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = list(spark.driver.memory = "2g")) df <- as.DataFrame(faithful) get error below: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. sc1 <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g")) sqlContext <- sparkRSQL.init(sc1) df <- as.DataFrame(sqlContext,faithful) was: sc <- sparkR.session(master="spark://spark01.cmua.dom:7077", appName="sparkR", sparkConfig = list(spark.driver.memory = "2g")) df <- as.DataFrame(faithful) get error below: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt > Cannot run SparkR function on Windows- Spark 2.0.0 > -- > > Key: SPARK-17622 > URL: https://issues.apache.org/jira/browse/SPARK-17622 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 10 > R 3.3.1 > RStudio 1.0.20 >Reporter: renzhi he > Labels: windows > Fix For: 1.6.1, 1.6.2 > > > sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = > list(spark.driver.memory = "2g")) > df <- as.DataFrame(faithful) > get error below: > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) > at org.apache.spark.sql.hive.HiveSharedSt > on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. > sc1 <- sparkR.init(master = "local[*]", sparkEnvir = > list(spark.driver.memory="2g")) > sqlContext <- sparkRSQL.init(sc1) > df <- as.DataFrame(sqlContext,faithful) -- This message was sent by
[jira] [Updated] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17614: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu >Priority: Minor > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()
[ https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509234#comment-15509234 ] Sean Owen commented on SPARK-17621: --- I think you're generally relying on the RDD being evaluated once, but that's not the case in some of your examples. Why not just use count()? You said you see two jobs generated, and that will tell you what is running that may evaluate the RDD. > Accumulator value is doubled when using DataFrame.orderBy() > --- > > Key: SPARK-17621 > URL: https://issues.apache.org/jira/browse/SPARK-17621 > Project: Spark > Issue Type: Bug > Components: Scheduler, SQL >Affects Versions: 2.0.0 > Environment: Development environment. (Eclipse . Single process) >Reporter: Sreelal S L >Priority: Minor > > We are tracing the records read by our source using an accumulator. We do a > orderBy on the Dataframe before the output operation. When the job is > completed, the accumulator values is becoming double of the expected value . > . > Below is the sample code i ran . > {code} > val sqlContext = SparkSession.builder() > .config("spark.sql.retainGroupColumns", > false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]") > .getOrCreate() > val sc = sqlContext.sparkContext > val accumulator1 = sc.accumulator(0, "accumulator1") > val usersDF = sqlContext.read.json("C:\\users.json") // single row > {"name":"sreelal" ,"country":"IND"} > val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x }); > val counterDF = sqlContext.createDataFrame(usersDFwithCount, > usersDF.schema); > val oderedDF = counterDF.orderBy("name") > val collected = oderedDF.collect() > collected.foreach { x => println(x) } > println("accumulator1 : " + accumulator1.value) > println("Done"); > {code} > I have only one row in the users.json file. I expect accumulator1 to have > value 1. But its coming as 2. > In the Spark Sql UI , i see two jobs getting generated for the same. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-17596) Streaming job lacks Scala runtime methods
[ https://issues.apache.org/jira/browse/SPARK-17596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Evgeniy Tsvigun closed SPARK-17596. --- Resolution: Not A Problem Found that my SPARK_HOME environment variable was pointing to a wrong Spark version. > Streaming job lacks Scala runtime methods > - > > Key: SPARK-17596 > URL: https://issues.apache.org/jira/browse/SPARK-17596 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 > Environment: Linux 4.4.20 x86_64 GNU/Linux > openjdk version "1.8.0_102" > Scala 2.11.8 >Reporter: Evgeniy Tsvigun > Labels: kafka-0.8, streaming > > When using -> in Spark Streaming 2.0.0 jobs, or using > spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit, I > get the following error: > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent > failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): > java.lang.NoSuchMethodError: > scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; > This only happens with spark-streaming, using ArrowAssoc in plain > non-streaming Spark jobs works fine. > I put a brief illustration of this phenomenon to a GitHub repo: > https://github.com/utgarda/spark-2-streaming-nosuchmethod-arrowassoc > Putting only provided dependencies to build.sbt > "org.apache.spark" %% "spark-core" % "2.0.0" % "provided", > "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided" > using -> anywhere in the driver code, packing it with sbt-assembly and > submitting the job results in an error. This isn't a big problem by itself, > using ArrayAssoc can be avoided, but spark-streaming-kafka-0-8_2.11 v2.0.0 > has it somewhere inside, and generates the same error. > Packing with scala-library, can see the class in the jar after packing, but > it's still reported missing in runtime. > The issue reported on StackOverflow: > http://stackoverflow.com/questions/39395521/spark-2-0-0-streaming-job-packed-with-sbt-assembly-lacks-scala-runtime-methods -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17622) Cannot run SparkR function on Windows- Spark 2.0.0
renzhi he created SPARK-17622: - Summary: Cannot run SparkR function on Windows- Spark 2.0.0 Key: SPARK-17622 URL: https://issues.apache.org/jira/browse/SPARK-17622 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.0.0 Environment: windows 10 R 3.3.1 RStudio 1.0.20 Reporter: renzhi he Fix For: 1.6.2, 1.6.1 sc <- sparkR.session(master="spark://spark01.cmua.dom:7077", appName="sparkR", sparkConfig = list(spark.driver.memory = "2g")) df <- as.DataFrame(faithful) get error below: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17219) QuantileDiscretizer does strange things with NaN values
[ https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17219. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14858 [https://github.com/apache/spark/pull/14858] > QuantileDiscretizer does strange things with NaN values > --- > > Key: SPARK-17219 > URL: https://issues.apache.org/jira/browse/SPARK-17219 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.2 >Reporter: Barry Becker > Fix For: 2.1.0 > > > How is the QuantileDiscretizer supposed to handle null values? > Actual nulls are not allowed, so I replace them with Double.NaN. > However, when you try to run the QuantileDiscretizer on a column that > contains NaNs, it will create (possibly more than one) NaN split(s) before > the final PositiveInfinity value. > I am using the attache titanic csv data and trying to bin the "age" column > using the QuantileDiscretizer with 10 bins specified. The age column as a lot > of null values. > These are the splits that I get: > {code} > -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity > {code} > Is that expected. It seems to imply that NaN is larger than any positive > number and less than infinity. > I'm not sure of the best way to handle nulls, but I think they need a bucket > all their own. My suggestions would be to include an initial NaN split value > that is always there, just like the sentinel Infinities are. If that were the > case, then the splits for the example above might look like this: > {code} > NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity > {code} > This does not seem great either because a bucket that is [NaN, -Inf] doesn't > make much sense. Not sure if the NaN bucket counts toward numBins or not. I > do think it should always be there though in case future data has null even > though the fit data did not. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17057) ProbabilisticClassifierModels' thresholds should be > 0
[ https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17057: -- Summary: ProbabilisticClassifierModels' thresholds should be > 0 (was: ProbabilisticClassifierModels' thresholds should be > 0 and sum < 1 to match randomForest cutoff) > ProbabilisticClassifierModels' thresholds should be > 0 > --- > > Key: SPARK-17057 > URL: https://issues.apache.org/jira/browse/SPARK-17057 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.0.0 >Reporter: zhengruifeng >Assignee: Sean Owen >Priority: Minor > > {code} > val path = "./data/mllib/sample_multiclass_classification_data.txt" > val data = spark.read.format("libsvm").load(path) > val rfm = rf.fit(data) > scala> rfm.setThresholds(Array(0.0,0.0,0.0)) > res4: org.apache.spark.ml.classification.RandomForestClassificationModel = > RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees > scala> rfm.transform(data).show(5) > +-++--+-+--+ > |label|features| rawPrediction| probability|prediction| > +-++--+-+--+ > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]| 0.0| > | 0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]| 0.0| > +-++--+-+--+ > only showing top 5 rows > {code} > If multi thresholds are set zero, the prediction of > {{ProbabilisticClassificationModel}} is the first index whose corresponding > threshold is 0. > However, in this case, the index with max {{probability}} among indices with > 0-threshold should be more reasonable to mark as > {{prediction}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15071) Check the result of all TPCDS queries
[ https://issues.apache.org/jira/browse/SPARK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509559#comment-15509559 ] Nirman Narang commented on SPARK-15071: --- Started working on this. > Check the result of all TPCDS queries > - > > Key: SPARK-15071 > URL: https://issues.apache.org/jira/browse/SPARK-15071 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Davies Liu > > We should compare the results of all TPCDS query again other Database that > could support all of them (for example, IBM Big SQL, PostgreSQL) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17590) Analyze CTE definitions at once and allow CTE subquery to define CTE
[ https://issues.apache.org/jira/browse/SPARK-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17590. --- Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.1.0 > Analyze CTE definitions at once and allow CTE subquery to define CTE > > > Key: SPARK-17590 > URL: https://issues.apache.org/jira/browse/SPARK-17590 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh > Fix For: 2.1.0 > > > We substitute logical plan with CTE definitions in the analyzer rule > CTESubstitution. A CTE definition can be used in the logical plan for > multiple times, and its analyzed logical plan should be the same. We should > not analyze CTE definitions multiple times when they are reused in the query. > By analyzing CTE definitions before substitution, we can support defining CTE > in subquery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata
[ https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510098#comment-15510098 ] Shawn Lavelle commented on SPARK-9686: -- It's been a few months, any progress on this bug? > Spark Thrift server doesn't return correct JDBC metadata > - > > Key: SPARK-9686 > URL: https://issues.apache.org/jira/browse/SPARK-9686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2 >Reporter: pin_zhang >Assignee: Cheng Lian >Priority: Critical > Attachments: SPARK-9686.1.patch.txt > > > 1. Start start-thriftserver.sh > 2. connect with beeline > 3. create table > 4.show tables, the new created table returned > 5. > Class.forName("org.apache.hive.jdbc.HiveDriver"); > String URL = "jdbc:hive2://localhost:1/default"; >Properties info = new Properties(); > Connection conn = DriverManager.getConnection(URL, info); > ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(), >null, null, null); > Problem: >No tables with returned this API, that work in spark1.3 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renzhi he updated SPARK-17622: -- Description: Under spark2.0.0- on Windows- when try to load or create data with the similar codes below, I also get error message and cannot execute the functions. |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = "2g")) | |df <- as.DataFrame(faithful) | Here is the error message: #Error in invokeJava(isStatic = TRUE, className, methodName, ...) : #java.lang.reflect.InvocationTargetException #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) #at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) #at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) #at java.lang.reflect.Constructor.newInstance(Constructor.java:423) #at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) #at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) #at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) #at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) #at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) #at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) #at org.apache.spark.sql.hive.HiveSharedSt However, under spark1.6.1 or spark1.6.2, run the same functional functions, there will be no problem. |sc1 <- sparkR.init(master = "local", sparkEnvir = list(spark.driver.memory="2g"))| |sqlContext <- sparkRSQL.init(sc1)| |df <- as.DataFrame(sqlContext,faithful| was: Under spark2.0.0- on Windows- when try to load or create data with the similar codes below, I also get error message and cannot execute the functions. |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = "2g")) | |df <- as.DataFrame(faithful) | Here is the error message: #Error in invokeJava(isStatic = TRUE, className, methodName, ...) : #java.lang.reflect.InvocationTargetException #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) #at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) #at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) #at java.lang.reflect.Constructor.newInstance(Constructor.java:423) #at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) #at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) #at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) #at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) #at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) #at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) #at org.apache.spark.sql.hive.HiveSharedSt However, under spark1.6.1 or spark1.6.2, run the same functional functions, there will be no problem. |sc1 <- sparkR.init(master = "local", sparkEnvir = list(spark.driver.memory="2g"))| |sqlContext <- sparkRSQL.init(sc1)| |df <- as.DataFrame(sqlContext,faithful| > Cannot run create or load DF on Windows- Spark 2.0.0 > > > Key: SPARK-17622 > URL: https://issues.apache.org/jira/browse/SPARK-17622 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 10 > R 3.3.1 > RStudio 1.0.20 >Reporter: renzhi he > Labels: windows > Fix For: 1.6.1, 1.6.2 > > > Under spark2.0.0- on Windows- when try to load or create data with the > similar codes below, I also get error message and cannot execute the > functions. > |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = > "2g")) | > |df <- as.DataFrame(faithful) | > Here is the error message: > #Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > > #java.lang.reflect.InvocationTargetException > #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > #at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > #at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > #at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > #at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > #at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) >
[jira] [Closed] (SPARK-17610) The failed stage caused by FetchFailed may never be resubmitted
[ https://issues.apache.org/jira/browse/SPARK-17610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang closed SPARK-17610. Resolution: Not A Problem > The failed stage caused by FetchFailed may never be resubmitted > --- > > Key: SPARK-17610 > URL: https://issues.apache.org/jira/browse/SPARK-17610 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: Tao Wang >Priority: Critical > > We have a problem in our environment, in which the failed stage has not been > resubmitted ever. Because it is caused by FetchFailed exception, I took a > look at the corresponsive code segment and found some issues: > In DAGScheduler.handleTaskCompletion, it first check if the `failedStages` is > empty, and do two steps when the answer is true: > 1. send `ResubmitFailedStages` to evnetProcessLoop > 2. add failed stages into `failedStages` > in `eventProcessLoop`, it first take all elements in `failedStages` to > resubmit them, then clear the set. > If the event happens like below, there might be some problem: > assume t1 < t2 < t3 > at t1, failed stage 1 was handled, the ResubmitFailedStages was send to > eventProcessLoop > at t2, eventProcessLoop handle the ResubmitFailedStages and clear the empty > `failedStages` > at t3, failed stage 1 was added into `failedStages` > now failed stage 1 has not been resubmitted for now. > after anytime at t3, the `failedStages` will never be empty even if we have > new failed stages caused by FetchFailed coming in, because the `failedStages` > containing failed stage 1 is not empty. > The codes is below: > {code} > } else if (failedStages.isEmpty) { > // Don't schedule an event to resubmit failed stages if failed > isn't empty, because > // in that case the event will already have been scheduled. > // TODO: Cancel running tasks in the stage > logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + > s"$failedStage (${failedStage.name}) due to fetch failure") > messageScheduler.schedule(new Runnable { > override def run(): Unit = > eventProcessLoop.post(ResubmitFailedStages) > }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) > } > failedStages += failedStage > failedStages += mapStage > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renzhi he updated SPARK-17622: -- Description: (was: sc <- sparkR.session(master="local[*]", sparkConfig = list(spark.driver.memory = "2g")) df <- as.DataFrame(faithful) get error below: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. sc1 <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g")) sqlContext <- sparkRSQL.init(sc1) df <- as.DataFrame(sqlContext,faithful)) > Cannot run create or load DF on Windows- Spark 2.0.0 > > > Key: SPARK-17622 > URL: https://issues.apache.org/jira/browse/SPARK-17622 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 10 > R 3.3.1 > RStudio 1.0.20 >Reporter: renzhi he > Labels: windows > Fix For: 1.6.1, 1.6.2 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17610) The failed stage caused by FetchFailed may never be resubmitted
[ https://issues.apache.org/jira/browse/SPARK-17610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509712#comment-15509712 ] Tao Wang commented on SPARK-17610: -- As reason mentioned in https://github.com/apache/spark/pull/15176, this is not a bug so close this. > The failed stage caused by FetchFailed may never be resubmitted > --- > > Key: SPARK-17610 > URL: https://issues.apache.org/jira/browse/SPARK-17610 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0 >Reporter: Tao Wang >Priority: Critical > > We have a problem in our environment, in which the failed stage has not been > resubmitted ever. Because it is caused by FetchFailed exception, I took a > look at the corresponsive code segment and found some issues: > In DAGScheduler.handleTaskCompletion, it first check if the `failedStages` is > empty, and do two steps when the answer is true: > 1. send `ResubmitFailedStages` to evnetProcessLoop > 2. add failed stages into `failedStages` > in `eventProcessLoop`, it first take all elements in `failedStages` to > resubmit them, then clear the set. > If the event happens like below, there might be some problem: > assume t1 < t2 < t3 > at t1, failed stage 1 was handled, the ResubmitFailedStages was send to > eventProcessLoop > at t2, eventProcessLoop handle the ResubmitFailedStages and clear the empty > `failedStages` > at t3, failed stage 1 was added into `failedStages` > now failed stage 1 has not been resubmitted for now. > after anytime at t3, the `failedStages` will never be empty even if we have > new failed stages caused by FetchFailed coming in, because the `failedStages` > containing failed stage 1 is not empty. > The codes is below: > {code} > } else if (failedStages.isEmpty) { > // Don't schedule an event to resubmit failed stages if failed > isn't empty, because > // in that case the event will already have been scheduled. > // TODO: Cancel running tasks in the stage > logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + > s"$failedStage (${failedStage.name}) due to fetch failure") > messageScheduler.schedule(new Runnable { > override def run(): Unit = > eventProcessLoop.post(ResubmitFailedStages) > }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) > } > failedStages += failedStage > failedStages += mapStage > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17607) --driver-url doesn't point to my master_ip.
[ https://issues.apache.org/jira/browse/SPARK-17607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509773#comment-15509773 ] Sasi commented on SPARK-17607: -- It's different, because I was able to start my master with ip 10.5.5.2 and I saw that each worker regsitered on 10.5.5.2. I also was able to see that each worker as its own private ip, e.g. worker1 - 10.5.5.3, worker3 - 10.5.5.4 and etc. My spark-env.sh contained bother master-ip and worker-ip has it should be. The problem on my bug is that once I requested data from each worker then it set the driverUrl, in my case 10.5.5.2, to 10.0.42.230. Do you want guide me on which logs/info you need for this issue? Thanks, Sasi > --driver-url doesn't point to my master_ip. > --- > > Key: SPARK-17607 > URL: https://issues.apache.org/jira/browse/SPARK-17607 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 >Reporter: Sasi >Priority: Critical > > Hi, > I have master machine and slave machine. > My master machine contains 2 interfaces. > First interface has the following ip 10.5.5.2, and the other interface has > the following ip 10.0.42.230. > I configured the MASTER_IP to be 10.5.5.2, so once the master goes up and its > worker I see the following INFO lines: > {code} > 16/09/20 12:32:32 INFO Worker: Successfully registered with master > spark://10.5.5.2:7077 > 16/09/20 12:39:15 INFO Worker: Asked to launch executor > app-20160920123915-/0 for Spark-DataAccessor-JBoss > {code} > I set the SPARK_LOCAL_IP on each worker to be its own ip, e.g 10.5.5.5. > Both constants were configured on spark-env.sh. > The problem started when I tried to get data from my workers. > I got the following INFO line in each worker log. > {code} > "--driver-url" > "akka.tcp://sparkDriver@10.0.42.230:43683/user/CoarseGrainedScheduler" " > {code} > As you can see the masterIp is different then the driver-url ip. > Master ip is 10.5.5.2 but driver-url is 10.0.42.230, therefore i'm getting > the following errors: > {code} > 16/09/20 12:17:57 INFO Slf4jLogger: Slf4jLogger started > 16/09/20 12:17:57 INFO Remoting: Starting remoting > 16/09/20 12:17:57 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://driverPropsFetcher@10.5.5.5:34961] > 16/09/20 12:17:57 INFO Utils: Successfully started service > 'driverPropsFetcher' on port 34961. > 16/09/20 12:19:00 WARN ReliableDeliverySupervisor: Association with remote > system [akka.tcp://sparkDriver@10.0.42.230:36711] has failed, address is now > gated for [5000] ms. Reason: [Association failed with > [akka.tcp://sparkDriver@10.0.42.230:36711]] Caused by: [Connection timed out: > /10.0.42.230:36711] > Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: > ActorSelection[Anchor(akka.tcp://sparkDriver@10.0.42.230:36711/), > Path(/user/CoarseGrainedScheduler)] > at > {code} > {code} > "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" > "akka.tcp://sparkDriver@10.0.42.230:43683/user/CoarseGrainedScheduler" > {code} > The master is listen and open for communicate via 10.5.5.2 and not > 10.0.42.230. > Looks like the driver-url ignore the real MASTER_IP. > Thanks, > Sasi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renzhi he updated SPARK-17622: -- Summary: Cannot run create or load DF on Windows- Spark 2.0.0 (was: Cannot run SparkR function on Windows- Spark 2.0.0) > Cannot run create or load DF on Windows- Spark 2.0.0 > > > Key: SPARK-17622 > URL: https://issues.apache.org/jira/browse/SPARK-17622 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 10 > R 3.3.1 > RStudio 1.0.20 >Reporter: renzhi he > Labels: windows > Fix For: 1.6.1, 1.6.2 > > > sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = > list(spark.driver.memory = "2g")) > df <- as.DataFrame(faithful) > get error below: > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) > at org.apache.spark.sql.hive.HiveSharedSt > on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. > sc1 <- sparkR.init(master = "local[*]", sparkEnvir = > list(spark.driver.memory="2g")) > sqlContext <- sparkRSQL.init(sc1) > df <- as.DataFrame(sqlContext,faithful) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renzhi he updated SPARK-17622: -- Description: Under spark2.0.0- on Windows- when try to load or create data with the similar codes below, I also get error message and cannot execute the functions. |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = "2g")) | |df <- as.DataFrame(faithful) | Here is the error message: #Error in invokeJava(isStatic = TRUE, className, methodName, ...) : #java.lang.reflect.InvocationTargetException #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) #at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) #at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) #at java.lang.reflect.Constructor.newInstance(Constructor.java:423) #at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) #at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) #at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) #at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) #at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) #at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) #at org.apache.spark.sql.hive.HiveSharedSt However, under spark1.6.1 or spark1.6.2, run the same functional functions, there will be no problem. |sc1 <- sparkR.init(master = "local", sparkEnvir = list(spark.driver.memory="2g"))| |sqlContext <- sparkRSQL.init(sc1)| |df <- as.DataFrame(sqlContext,faithful| > Cannot run create or load DF on Windows- Spark 2.0.0 > > > Key: SPARK-17622 > URL: https://issues.apache.org/jira/browse/SPARK-17622 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 10 > R 3.3.1 > RStudio 1.0.20 >Reporter: renzhi he > Labels: windows > Fix For: 1.6.1, 1.6.2 > > > Under spark2.0.0- on Windows- when try to load or create data with the > similar codes below, I also get error message and cannot execute the > functions. > |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = > "2g")) | > |df <- as.DataFrame(faithful) | > Here is the error message: > #Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > > #java.lang.reflect.InvocationTargetException > #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > #at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > #at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > #at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > #at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > #at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > #at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > #at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > #at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > #at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) > #at org.apache.spark.sql.hive.HiveSharedSt > However, under spark1.6.1 or spark1.6.2, run the same functional functions, > there will be no problem. > |sc1 <- sparkR.init(master = "local", sparkEnvir = > list(spark.driver.memory="2g"))| > |sqlContext <- sparkRSQL.init(sc1)| > |df <- as.DataFrame(sqlContext,faithful| -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17606) New batches are not created when there are 1000 created after restarting streaming from checkpoint.
[ https://issues.apache.org/jira/browse/SPARK-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509821#comment-15509821 ] etienne commented on SPARK-17606: - Sorry I ask Ops for logs, but they have been lost. I have to wait another long break in my spark streaming to have theses. The must would be to test in test environment, and take care of logs. > New batches are not created when there are 1000 created after restarting > streaming from checkpoint. > --- > > Key: SPARK-17606 > URL: https://issues.apache.org/jira/browse/SPARK-17606 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.1 >Reporter: etienne > > When spark restarts from a checkpoint after being down for a while. > It recreates missing batch since the down time. > When there are few missing batches, spark creates new incoming batch every > batchTime, but when there is enough missing time to create 1000 batches no > new batch is created. > So when all these batch are completed the stream is idle ... > I think there is a rigid limit set somewhere. > I was expecting that spark continue to recreate missed batches, maybe not all > at once ( because it's look like it's cause driver memory problem ), and then > recreate batches each batchTime. > Another solution would be to not create missing batches but still restart the > direct input. > Right know for me the only solution to restart a stream after a long break it > to remove the checkpoint to allow the creation of a new stream. But losing > all my states. > ps : I'm speaking about direct Kafka input because it's the source I'm > currently using, I don't know what happens with other sources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this
Imran Rashid created SPARK-17623: Summary: Failed tasks end reason is always a TaskFailedReason, types should reflect this Key: SPARK-17623 URL: https://issues.apache.org/jira/browse/SPARK-17623 Project: Spark Issue Type: Improvement Components: Scheduler, Spark Core Affects Versions: 2.0.0 Reporter: Imran Rashid Assignee: Imran Rashid Priority: Minor Minor code cleanup. In TaskResultGetter, enqueueFailedTask currently deserializes the result as a TaskEndReason. But the type is actually more specific, its a TaskFailedReason. This just leads to more blind casting later on -- it would be more clear if the msg was cast to the right type immediately, so method parameter types could be tightened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this
[ https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17623: Assignee: Apache Spark (was: Imran Rashid) > Failed tasks end reason is always a TaskFailedReason, types should reflect > this > --- > > Key: SPARK-17623 > URL: https://issues.apache.org/jira/browse/SPARK-17623 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Apache Spark >Priority: Minor > > Minor code cleanup. In TaskResultGetter, enqueueFailedTask currently > deserializes the result as a TaskEndReason. But the type is actually more > specific, its a TaskFailedReason. This just leads to more blind casting > later on -- it would be more clear if the msg was cast to the right type > immediately, so method parameter types could be tightened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this
[ https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510377#comment-15510377 ] Apache Spark commented on SPARK-17623: -- User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/15181 > Failed tasks end reason is always a TaskFailedReason, types should reflect > this > --- > > Key: SPARK-17623 > URL: https://issues.apache.org/jira/browse/SPARK-17623 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > Minor code cleanup. In TaskResultGetter, enqueueFailedTask currently > deserializes the result as a TaskEndReason. But the type is actually more > specific, its a TaskFailedReason. This just leads to more blind casting > later on -- it would be more clear if the msg was cast to the right type > immediately, so method parameter types could be tightened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this
[ https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17623: Assignee: Imran Rashid (was: Apache Spark) > Failed tasks end reason is always a TaskFailedReason, types should reflect > this > --- > > Key: SPARK-17623 > URL: https://issues.apache.org/jira/browse/SPARK-17623 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > > Minor code cleanup. In TaskResultGetter, enqueueFailedTask currently > deserializes the result as a TaskEndReason. But the type is actually more > specific, its a TaskFailedReason. This just leads to more blind casting > later on -- it would be more clear if the msg was cast to the right type > immediately, so method parameter types could be tightened. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17044) Add window function test in SQLQueryTestSuite
[ https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510464#comment-15510464 ] Dongjoon Hyun commented on SPARK-17044: --- Hi, [~rxin] Could you review this issue? > Add window function test in SQLQueryTestSuite > - > > Key: SPARK-17044 > URL: https://issues.apache.org/jira/browse/SPARK-17044 > Project: Spark > Issue Type: Improvement >Reporter: Dongjoon Hyun >Priority: Minor > > This issue adds a SQL query test for Window functions for new > `SQLQueryTestSuite`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17624) Flaky test? StateStoreSuite maintenance
Adam Roberts created SPARK-17624: Summary: Flaky test? StateStoreSuite maintenance Key: SPARK-17624 URL: https://issues.apache.org/jira/browse/SPARK-17624 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.0.1 Reporter: Adam Roberts Priority: Minor I've noticed this test failing consistently (25x in a row) with a two core machine but not on an eight core machine If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 being the default in Spark), the test reliably passes, we can also gain reliability by setting the master to be anything other than just local. Is there a reason spark.rpc.numRetries is set to be 1? I see this failure is also mentioned here so it's been flaky for a while http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html If we run without the "quietly" code so we get debug info: {code} 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error sending message [message = VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)] in 1 attempts org.apache.spark.SparkException: Exception thrown in awaitResult at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) at org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91) at org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) at org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227) at org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199) at org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197) at org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.lang.Thread.run(Thread.java:785) Caused by: org.apache.spark.SparkException: Could not find StateStoreCoordinator. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:508) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) ... 19 more 16:26:15.217 WARN org.apache.spark.sql.execution.streaming.state.StateStore: Error managing StateStore[id = (op=0, part=0), dir =
[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] renzhi he updated SPARK-17622: -- Description: sc <- sparkR.session(master="local[*]", sparkConfig = list(spark.driver.memory = "2g")) df <- as.DataFrame(faithful) get error below: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. sc1 <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g")) sqlContext <- sparkRSQL.init(sc1) df <- as.DataFrame(sqlContext,faithful) was: sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = list(spark.driver.memory = "2g")) df <- as.DataFrame(faithful) get error below: Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) at org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) at org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) at org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) at org.apache.spark.sql.hive.HiveSharedSt on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. sc1 <- sparkR.init(master = "local[*]", sparkEnvir = list(spark.driver.memory="2g")) sqlContext <- sparkRSQL.init(sc1) df <- as.DataFrame(sqlContext,faithful) > Cannot run create or load DF on Windows- Spark 2.0.0 > > > Key: SPARK-17622 > URL: https://issues.apache.org/jira/browse/SPARK-17622 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: windows 10 > R 3.3.1 > RStudio 1.0.20 >Reporter: renzhi he > Labels: windows > Fix For: 1.6.1, 1.6.2 > > > sc <- sparkR.session(master="local[*]", sparkConfig = > list(spark.driver.memory = "2g")) > df <- as.DataFrame(faithful) > get error below: > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39) > at > org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38) > at > org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46) > at org.apache.spark.sql.hive.HiveSharedSt > on spark 1.6.1 and spark 1.6.2 can run the corresponding codes. > sc1 <- sparkR.init(master = "local[*]", sparkEnvir = > list(spark.driver.memory="2g")) > sqlContext <-
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510198#comment-15510198 ] Seth Hendrickson commented on SPARK-17134: -- Hmm, it would be nice to see this vs the old mlor in rdd API, just as a sanity check. I conducted performance testing against mllib initially, though, so there shouldn't be any regressions. > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17625: Assignee: Apache Spark > expectedOutputAttributes should be set when converting SimpleCatalogRelation > to LogicalRelation > --- > > Key: SPARK-17625 > URL: https://issues.apache.org/jira/browse/SPARK-17625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark >Priority: Minor > > expectedOutputAttributes should be set when converting SimpleCatalogRelation > to LogicalRelation, otherwise the outputs of LogicalRelation are different > from outputs of SimpleCatalogRelation - they have different exprId's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510658#comment-15510658 ] Apache Spark commented on SPARK-17625: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/15182 > expectedOutputAttributes should be set when converting SimpleCatalogRelation > to LogicalRelation > --- > > Key: SPARK-17625 > URL: https://issues.apache.org/jira/browse/SPARK-17625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > expectedOutputAttributes should be set when converting SimpleCatalogRelation > to LogicalRelation, otherwise the outputs of LogicalRelation are different > from outputs of SimpleCatalogRelation - they have different exprId's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-17614: Comment: was deleted (was: Create pull request: https://github.com/apache/spark/pull/15183) > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510626#comment-15510626 ] Paul Wu edited comment on SPARK-17614 at 9/21/16 5:42 PM: -- No, Custom JdbcDialect won't resolve the problem since DataFrameReader uses JDBCRDD and the later has a hard code line val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0") for getting the table existence. See line 61 at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala Line 61 needs to use the Dialect's "table existence" rather than hard-coded the query there. was (Author: zwu@gmail.com): No, Custom JdbcDialect won't resolve the problem since DataFrameReader uses JDBCRDD and the later has a hard code line val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0") for getting the table existence. See line 61 at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation
Zhenhua Wang created SPARK-17625: Summary: expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation Key: SPARK-17625 URL: https://issues.apache.org/jira/browse/SPARK-17625 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Zhenhua Wang Priority: Minor expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation, otherwise the outputs of LogicalRelation are different from outputs of SimpleCatalogRelation - they have different exprId's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510672#comment-15510672 ] DB Tsai commented on SPARK-17134: - I'll try the old mlor in rdd tonight when the cluster is not busy. Actually, this is a very large training dataset, and around 160GB in memory. Since there are 22533 classes, and 100 features, the total parameters are 2.2M. I expect that level 2 blas will help significantly in this case. > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
Ioana Delaney created SPARK-17626: - Summary: TPC-DS performance improvements using star-schema heuristics Key: SPARK-17626 URL: https://issues.apache.org/jira/browse/SPARK-17626 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 2.1.0 Reporter: Ioana Delaney Priority: Critical *TPC-DS performance improvements using star-schema heuristics* \\ \\ TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business. \\ \\ As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. \\ \\ *Summary of the results:* {code} Passed 99 Failed 0 Total q time (s) 14,962 Max time1,467 Min time3 Mean time 145 Geomean44 {code} *Compared to baseline* (Negative = improvement; Positive = Degradation): {code} End to end improved (%) -19% Mean time improved (%) -19% Geomean improved (%) -24% End to end improved (seconds) -3,603 Number of queries improved (>10%) 45 Number of queries degraded (>10%) 6 Number of queries unchanged48 Top 10 queries improved (%) -20% {code} Cluster: 20-node cluster with each node having: * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. * Total memory for the cluster: 2.5TB * Total storage: 400TB * Total CPU cores: 480 Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA Database info: * Schema: TPCDS * Scale factor: 1TB total space * Storage format: Parquet with Snappy compression Our investigation and results are included in the attached document. There are two parts to this improvement: # Join reordering using star schema detection # New selectivity hint to specify the selectivity of the predicates over base tables. \\ \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510707#comment-15510707 ] Sean Owen commented on SPARK-17614: --- Yup, that much is clearly a bug. Go for a fix, anyone who wants to - or I'll fix that to try to unblock further experimentation. > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510709#comment-15510709 ] Paul Wu commented on SPARK-17614: - Create pull request: https://github.com/apache/spark/pull/15183 > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510704#comment-15510704 ] Apache Spark commented on SPARK-17614: -- User 'paulzwu' has created a pull request for this issue: https://github.com/apache/spark/pull/15183 > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17614: Assignee: (was: Apache Spark) > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17614: Assignee: Apache Spark > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu >Assignee: Apache Spark > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-17626: -- Description: *TPC-DS performance improvements using star-schema heuristics* \\ \\ TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business. \\ \\ As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. \\ \\ *Summary of the results:* {code} Passed 99 Failed 0 Total q time (s) 14,962 Max time1,467 Min time3 Mean time 145 Geomean44 {code} *Compared to baseline* (Negative = improvement; Positive = Degradation): {code} End to end improved (%) -19% Mean time improved (%) -19% Geomean improved (%) -24% End to end improved (seconds) -3,603 Number of queries improved (>10%) 45 Number of queries degraded (>10%) 6 Number of queries unchanged48 Top 10 queries improved (%) -20% {code} Cluster: 20-node cluster with each node having: * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. * Total memory for the cluster: 2.5TB * Total storage: 400TB * Total CPU cores: 480 Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA Database info: * Schema: TPCDS * Scale factor: 1TB total space * Storage format: Parquet with Snappy compression Our investigation and results are included in the attached document. There are two parts to this improvement: # Join reordering using star schema detection # New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests. \\ was: *TPC-DS performance improvements using star-schema heuristics* \\ \\ TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business. \\ \\ As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. \\ \\ *Summary of the results:* {code} Passed 99 Failed 0 Total q time (s) 14,962 Max time1,467 Min time3 Mean time 145 Geomean44 {code} *Compared to baseline* (Negative = improvement; Positive = Degradation): {code} End to end improved (%) -19% Mean time improved (%) -19% Geomean improved (%) -24% End to end improved (seconds) -3,603 Number of queries improved (>10%) 45 Number of queries degraded (>10%) 6 Number of queries unchanged48 Top 10 queries improved (%) -20% {code} Cluster: 20-node cluster with each node having: * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. * Total memory for the cluster: 2.5TB * Total storage: 400TB * Total CPU cores: 480 Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA Database info: * Schema: TPCDS * Scale factor: 1TB total space * Storage format: Parquet with Snappy compression Our investigation and results are included in the attached document. There are two parts to this improvement: # Join reordering using star schema detection # New selectivity hint to specify the selectivity of the predicates over base tables. \\ \\ > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL:
[jira] [Commented] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version
[ https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510748#comment-15510748 ] Joey Paskhay commented on SPARK-11702: -- Apologies for the super late response, Sabs. In case you or anyone else is still having issues with the work-around, the guava JAR needs to be in both the spark.driver.extraClassPath and spark.executor.extraClassPath properties. So our spark-defaults.conf ended up containing something like the following: ... spark.driver.extraClassPath=/usr/lib/hive/lib/guava-15.0.jar: spark.executor.extraClassPath=/usr/lib/hive/lib/guava-15.0.jar: ... Hope that helps, Joey > Guava ClassLoading Issue When Using Different Hive Metastore Version > > > Key: SPARK-11702 > URL: https://issues.apache.org/jira/browse/SPARK-11702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Joey Paskhay > > A Guava classloading error can occur when using a different version of the > Hive metastore. > Running the latest version of Spark at this time (1.5.1) and patched versions > of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to > "1.0.0" and "spark.sql.hive.metastore.jars" to > "/lib/*:". When trying to > launch the spark-shell, the sqlContext would fail to initialize with: > {code} > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > com/google/common/base/Predicate when creating Hive client using classpath: > > Please make sure that jars for your version of hive and hadoop are included > in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, > defaultValue=builtin, doc=... > {code} > We verified the Guava libraries are in the huge list of the included jars, > but we saw that in the > org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it > seems to assume that *all* "com.google" (excluding "com.google.cloud") > classes should be loaded from the base class loader. The Spark libraries seem > to have *some* "com.google.common.base" classes shaded in but not all. > See > [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E] > and its replies. > The work-around is to add the guava JAR to the "spark.driver.extraClassPath" > property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14849) shuffle broken when accessing standalone cluster through NAT
[ https://issues.apache.org/jira/browse/SPARK-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510745#comment-15510745 ] Shixiong Zhu commented on SPARK-14849: -- [~skyluc] do you still see the error in Spark 2.0.0? > shuffle broken when accessing standalone cluster through NAT > > > Key: SPARK-14849 > URL: https://issues.apache.org/jira/browse/SPARK-14849 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Luc Bourlier > Labels: nat, network > > I have the following network configuration: > {code} > ++ > || > | spark-shell | > || > +- ip: 10.110.101.2 -+ >| >| > +- ip: 10.110.101.1 -+ > || NAT + routing > | spark-master | configured > || > +- ip: 10.110.100.1 -+ >| > ++ > || > +- ip: 10.110.101.2 -++- ip: 10.110.101.3 -+ > |||| > | spark-worker 1|| spark-worker 2| > |||| > ++++ > {code} > I have NAT, DNS and routing correctly configure such as each machine can > communicate with each other. > Launch spark-shell against the cluster works well. Simple map operations work > too: > {code} > scala> sc.makeRDD(1 to 5).map(_ * 5).collect > res0: Array[Int] = Array(5, 10, 15, 20, 25) > {code} > But operations requiring shuffling fail: > {code} > scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect > 16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, > 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), > shuffleId=0, mapId=6, reduceId=4, message= > org.apache.spark.shuffle.FetchFailedException: Failed to connect to > /10.110.101.1:42842 > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > [ ... ] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) > [ ... ] > at org.apache.spark.network.shuffle.RetryingBlockFetcher.access > [ ... ] > {code} > It makes sense that a connection to 10.110.101.1:42842 would fail, no part of > the system should have a direct knowledge of the IP address 10.110.101.1. > So a part of the system is wrongly discovering this IP address. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) Better error from WLS for cases like singular input
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510792#comment-15510792 ] DB Tsai commented on SPARK-11918: - +1 on QR decomposition. We may add a feature that using LBFGS/OWLQN to optimize the objective function once AtA is computed. Thus, we can do one-pass LiR with elastic net. This approach will not suffer from ill-condition issues. +[~sethah] who is interested in one-pass LiR with elastic net using OWLQN. > Better error from WLS for cases like singular input > --- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Sean Owen >Priority: Minor > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510525#comment-15510525 ] Paul Wu commented on SPARK-17614: - Thanks. I tried to register my custom dialect as following, but it does not reach the getTableExistsQuery() method. Could anyone help? import org.apache.spark.sql.jdbc.JdbcDialect; public class NRSCassandraDialect extends JdbcDialect { @Override public boolean canHandle(String url) { System.out.println("came here.."+ url.startsWith("jdbc:cassandra")); return url.startsWith("jdbc:cassandra"); } @Override public String getTableExistsQuery (String table) { System.out.println("query?"); return "SELECT * from " + table + " LIMIT 1"; } } -- public class CassJDBC implements Serializable { private static final org.apache.log4j.Logger LOGGER = org.apache.log4j.Logger.getLogger(CassJDBC.class); private static final String _CONNECTION_URL = "jdbc:cassandra://ulpd326..com/test?loadbalancing=DCAwareRoundRobinPolicy(%22datacenter1%22)"; private static final String _USERNAME = ""; private static final String _PWD = ""; private static final SparkSession sparkSession = SparkSession.builder() .config("spark.sql.warehouse.dir", "file:///home/zw251y/tmp").master("local[*]").appName("Spark2JdbcDs").getOrCreate(); public static void main(String[] args) { JdbcDialects.registerDialect(new NRSCassandraDialect()); final Properties connectionProperties = new Properties(); final String dbTable= "sql_demo"; Dataset jdbcDF = sparkSession.read() .jdbc(_CONNECTION_URL, dbTable, connectionProperties); jdbcDF.show(); } } Error message: came here..true parameters = "datacenter1" Exception in thread "main" java.sql.SQLTransientException: com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) at com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) at com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) at com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) at com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu >Priority: Minor > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Wu updated SPARK-17614: Priority: Major (was: Minor) > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support
[ https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510626#comment-15510626 ] Paul Wu commented on SPARK-17614: - No, Custom JdbcDialect won't resolve the problem since DataFrameReader uses JDBCRDD and the later has a hard code line val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0") for getting the table existence. See line 61 at https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala > sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra > does not support > - > > Key: SPARK-17614 > URL: https://issues.apache.org/jira/browse/SPARK-17614 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 > Environment: Any Spark Runtime >Reporter: Paul Wu >Priority: Minor > Labels: cassandra-jdbc, sql > > I have the code like the following with Cassandra JDBC > (https://github.com/adejanovski/cassandra-jdbc-wrapper): > final String dbTable= "sql_demo"; > Dataset jdbcDF > = sparkSession.read() > .jdbc(CASSANDRA_CONNECTION_URL, dbTable, > connectionProperties); > List rows = jdbcDF.collectAsList(); > It threw the error: > Exception in thread "main" java.sql.SQLTransientException: > com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable > alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...) > at > com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348) > at > com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48) > The reason is that the Spark jdbc code uses the sql syntax "where 1=0" > somewhere (to get the schema?), but Cassandra does not support this syntax. > Not sure how this issue can be resolved...this is because CQL is not standard > sql. > The following log shows more information: > 16/09/20 13:16:35 INFO CassandraConnection 138: Datacenter: %s; Host: %s; > Rack: %s > 16/09/20 13:16:35 TRACE CassandraPreparedStatement 98: CQL: SELECT * FROM > sql_demo WHERE 1=0 > 16/09/20 13:16:35 TRACE RequestHandler 71: [19400322] > com.datastax.driver.core.Statement$1@41ccb3b9 > 16/09/20 13:16:35 TRACE RequestHandler 272: [19400322-1] Starting -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation
[ https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17625: Assignee: (was: Apache Spark) > expectedOutputAttributes should be set when converting SimpleCatalogRelation > to LogicalRelation > --- > > Key: SPARK-17625 > URL: https://issues.apache.org/jira/browse/SPARK-17625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > expectedOutputAttributes should be set when converting SimpleCatalogRelation > to LogicalRelation, otherwise the outputs of LogicalRelation are different > from outputs of SimpleCatalogRelation - they have different exprId's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics
[ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ioana Delaney updated SPARK-17626: -- Attachment: StarSchemaJoinReordering.pptx > TPC-DS performance improvements using star-schema heuristics > > > Key: SPARK-17626 > URL: https://issues.apache.org/jira/browse/SPARK-17626 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Ioana Delaney >Priority: Critical > Attachments: StarSchemaJoinReordering.pptx > > > *TPC-DS performance improvements using star-schema heuristics* > \\ > \\ > TPC-DS consists of multiple snowflake schema, which are multiple star schema > with dimensions linking to dimensions. A star schema consists of a fact table > referencing a number of dimension tables. Fact table holds the main data > about a business. Dimension table, a usually smaller table, describes data > reflecting the dimension/attribute of a business. > \\ > \\ > As part of the benchmark performance investigation, we observed a pattern of > sub-optimal execution plans of large fact tables joins. Manual rewrite of > some of the queries into selective fact-dimensions joins resulted in > significant performance improvement. This prompted us to develop a simple > join reordering algorithm based on star schema detection. The performance > testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. > \\ > \\ > *Summary of the results:* > {code} > Passed 99 > Failed 0 > Total q time (s) 14,962 > Max time1,467 > Min time3 > Mean time 145 > Geomean44 > {code} > *Compared to baseline* (Negative = improvement; Positive = Degradation): > {code} > End to end improved (%) -19% > Mean time improved (%) -19% > Geomean improved (%) -24% > End to end improved (seconds) -3,603 > Number of queries improved (>10%) 45 > Number of queries degraded (>10%) 6 > Number of queries unchanged48 > Top 10 queries improved (%) -20% > {code} > Cluster: 20-node cluster with each node having: > * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 > v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet. > * Total memory for the cluster: 2.5TB > * Total storage: 400TB > * Total CPU cores: 480 > Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA > Database info: > * Schema: TPCDS > * Scale factor: 1TB total space > * Storage format: Parquet with Snappy compression > Our investigation and results are included in the attached document. > There are two parts to this improvement: > # Join reordering using star schema detection > # New selectivity hint to specify the selectivity of the predicates over base > tables. > \\ > \\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version
[ https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joey Paskhay updated SPARK-11702: - Description: A Guava classloading error can occur when using a different version of the Hive metastore. Running the latest version of Spark at this time (1.5.1) and patched versions of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to "1.0.0" and "spark.sql.hive.metastore.jars" to "/lib/*:". When trying to launch the spark-shell, the sqlContext would fail to initialize with: {code} java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: com/google/common/base/Predicate when creating Hive client using classpath: Please make sure that jars for your version of hive and hadoop are included in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, defaultValue=builtin, doc=... {code} We verified the Guava libraries are in the huge list of the included jars, but we saw that in the org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it seems to assume that *all* "com.google" (excluding "com.google.cloud") classes should be loaded from the base class loader. The Spark libraries seem to have *some* "com.google.common.base" classes shaded in but not all. See [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E] and its replies. The work-around is to add the guava JAR to the "spark.driver.extraClassPath" and "spark.executor.extraClassPath" properties. was: A Guava classloading error can occur when using a different version of the Hive metastore. Running the latest version of Spark at this time (1.5.1) and patched versions of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to "1.0.0" and "spark.sql.hive.metastore.jars" to "/lib/*:". When trying to launch the spark-shell, the sqlContext would fail to initialize with: {code} java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: com/google/common/base/Predicate when creating Hive client using classpath: Please make sure that jars for your version of hive and hadoop are included in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, defaultValue=builtin, doc=... {code} We verified the Guava libraries are in the huge list of the included jars, but we saw that in the org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it seems to assume that *all* "com.google" (excluding "com.google.cloud") classes should be loaded from the base class loader. The Spark libraries seem to have *some* "com.google.common.base" classes shaded in but not all. See [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E] and its replies. The work-around is to add the guava JAR to the "spark.driver.extraClassPath" property. > Guava ClassLoading Issue When Using Different Hive Metastore Version > > > Key: SPARK-11702 > URL: https://issues.apache.org/jira/browse/SPARK-11702 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Joey Paskhay > > A Guava classloading error can occur when using a different version of the > Hive metastore. > Running the latest version of Spark at this time (1.5.1) and patched versions > of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to > "1.0.0" and "spark.sql.hive.metastore.jars" to > "/lib/*:". When trying to > launch the spark-shell, the sqlContext would fail to initialize with: > {code} > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > com/google/common/base/Predicate when creating Hive client using classpath: > > Please make sure that jars for your version of hive and hadoop are included > in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, > defaultValue=builtin, doc=... > {code} > We verified the Guava libraries are in the huge list of the included jars, > but we saw that in the > org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it > seems to assume that *all* "com.google" (excluding "com.google.cloud") > classes should be loaded from the base class loader. The Spark libraries seem > to have *some* "com.google.common.base" classes shaded in but not all. > See > [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E] > and its replies. > The work-around is to add the guava JAR to the "spark.driver.extraClassPath" > and "spark.executor.extraClassPath" properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510787#comment-15510787 ] Michael Armbrust commented on SPARK-16407: -- I'm still a little unclear on the use cases we are trying to enable, so the dev list sounds like a good place to me. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17418) Spark release must NOT distribute Kinesis related assembly artifact
[ https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17418. Resolution: Fixed Assignee: Josh Rosen Fix Version/s: 2.1.0 2.0.1 1.6.3 Fixed by my PR for master, branch-2.0, and branch-1.6. > Spark release must NOT distribute Kinesis related assembly artifact > --- > > Key: SPARK-17418 > URL: https://issues.apache.org/jira/browse/SPARK-17418 > Project: Spark > Issue Type: Bug > Components: Build, Streaming >Affects Versions: 1.6.2, 2.0.0 >Reporter: Luciano Resende >Assignee: Josh Rosen >Priority: Blocker > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > The Kinesis streaming connector is based on the Amazon Software License, and > based on the Apache Legal resolved issues > (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be > distributed by Apache projects. > More details is available in LEGAL-198 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "
[ https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17616. Resolution: Duplicate > Getting "java.lang.RuntimeException: Distinct columns cannot exist in > Aggregate " > - > > Key: SPARK-17616 > URL: https://issues.apache.org/jira/browse/SPARK-17616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Minor > > I execute: > {code} > select platform, > collect_set(user_auth) as paid_types, > count(distinct sessionid) as sessions > from non_hss.session > where > event = 'stop' and platform != 'testplatform' and > not (month = MONTH(current_date()) AND year = YEAR(current_date()) > and day = day(current_date())) and > ( > (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = > YEAR(add_months(CURRENT_DATE(), -5))) > OR > (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > > YEAR(add_months(CURRENT_DATE(), -5))) > ) > group by platform > {code} > I get: > {code} > java.lang.RuntimeException: Distinct columns cannot exist in Aggregate > operator containing aggregate functions which don't support partial > aggregation. > {code} > IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still > don't understand what I do incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11918) Better error from WLS for cases like singular input
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-11918. - Resolution: Fixed > Better error from WLS for cases like singular input > --- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Sean Owen >Priority: Minor > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17618: --- Description: We were getting incorrect results from the DataFrame except method - all rows were being returned instead of the ones that intersected. Calling subtract on the underlying RDD returned the correct result. We tracked it down to the use of coalesce - the following is the simplest example case we created that reproduces the issue: {code} val schema = new StructType().add("test", types.IntegerType ) val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> Row(i)), schema) val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> Row(i)), schema) val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") println("Count using normal except = " + t1.except(t3).count()) println("Count using coalesce = " + t1.coalesce(8).except(t3.coalesce(8)).count()) {code} We should get the same result from both uses of except, but the one using coalesce returns 100 instead of 94. was: We were getting incorrect results from the DataFrame except method - all rows were being returned instead of the ones that intersected. Calling subtract on the underlying RDD returned the correct result. We tracked it down to the use of coalesce - the following is the simplest example case we created that reproduces the issue: val schema = new StructType().add("test", types.IntegerType ) val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> Row(i)), schema) val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> Row(i)), schema) val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") println("Count using normal except = " + t1.except(t3).count()) println("Count using coalesce = " + t1.coalesce(8).except(t3.coalesce(8)).count()) We should get the same result from both uses of except, but the one using coalesce returns 100 instead of 94. > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Graeme Edwards >Priority: Minor > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17592: Labels: (was: correctness) > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-17592: Fix Version/s: (was: 2.0.1) (was: 2.1.0) > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17019) Expose off-heap memory usage in various places
[ https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17019: --- Target Version/s: 2.1.0 (was: 2.0.1, 2.1.0) > Expose off-heap memory usage in various places > -- > > Key: SPARK-17019 > URL: https://issues.apache.org/jira/browse/SPARK-17019 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Saisai Shao >Priority: Minor > > With SPARK-13992, Spark supports persisting data into off-heap memory, but > the usage of off-heap is not exposed currently, it is not so convenient for > user to monitor and profile, so here propose to expose off-heap memory as > well as on-heap memory usage in various places: > 1. Spark UI's executor page will display both on-heap and off-heap memory > usage. > 2. REST request returns both on-heap and off-heap memory. > 3. Also these two memory usage can be obtained programmatically from > SparkListener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17618: --- Affects Version/s: 1.6.2 > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Priority: Minor > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17618: --- Labels: correctness (was: ) > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Priority: Minor > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17618: --- Priority: Blocker (was: Minor) > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17618: --- Target Version/s: 1.6.3 > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510934#comment-15510934 ] Josh Rosen commented on SPARK-17618: Yep, the problem is that {{Coalesce}} advertises that it accepts Unsafe rows but misdeclares its row output format as being regular rows. Comparing an UnsafeRow to any other row type for equality always returns false (its {{equals()}} implementation is compatible with Java universal equality, so it doesn't throw when performing a comparison against a different type). As a result, the Except compares safe and unsafe rows, causing the comparisons to be incorrect and leading to the wrong answer that you saw here. I'm marking this as a blocker for 1.6.3 and am working on a fix which will fix this issue. > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510874#comment-15510874 ] Josh Rosen commented on SPARK-17618: It looks like this affects 1.6.2 as well, but I was unable to reproduce in 2.x. Comparing the two physical plans, I wonder if the issue has to do with Tungsten vs. regular internal row formats. For {{t1.except(t3).explain(true)}}: {code} == Physical Plan == Except :- Scan ExistingRDD[test#35] +- ConvertToSafe +- LeftSemiJoinHash [test#35], [test#36], None :- TungstenExchange hashpartitioning(test#35,200), None : +- ConvertToUnsafe : +- Scan ExistingRDD[test#35] +- TungstenExchange hashpartitioning(test#36,200), None +- ConvertToUnsafe +- Scan ExistingRDD[test#36] {code} whereas {{t1.coalesce(8).except(t3.coalesce(8)).explain(true)}} produces {code} Except :- Coalesce 8 : +- Scan ExistingRDD[test#35] +- Coalesce 8 +- LeftSemiJoinHash [test#35], [test#36], None :- TungstenExchange hashpartitioning(test#35,200), None : +- ConvertToUnsafe : +- Scan ExistingRDD[test#35] +- TungstenExchange hashpartitioning(test#36,200), None +- ConvertToUnsafe +- Scan ExistingRDD[test#36] {code} My hunch is that Except is inappropriately mixing Tungsten and non-Tungsten row formats due to a bug in the row format conversion rules. > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Priority: Minor > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511034#comment-15511034 ] Apache Spark commented on SPARK-17618: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15185 > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17618: Assignee: Apache Spark (was: Josh Rosen) > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17618: Assignee: Josh Rosen (was: Apache Spark) > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce
[ https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-17618: -- Assignee: Josh Rosen > Dataframe except returns incorrect results when combined with coalesce > -- > > Key: SPARK-17618 > URL: https://issues.apache.org/jira/browse/SPARK-17618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 1.6.2 >Reporter: Graeme Edwards >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > > We were getting incorrect results from the DataFrame except method - all rows > were being returned instead of the ones that intersected. Calling subtract on > the underlying RDD returned the correct result. > We tracked it down to the use of coalesce - the following is the simplest > example case we created that reproduces the issue: > {code} > val schema = new StructType().add("test", types.IntegerType ) > val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> > Row(i)), schema) > val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> > Row(i)), schema) > val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi") > println("Count using normal except = " + t1.except(t3).count()) > println("Count using coalesce = " + > t1.coalesce(8).except(t3.coalesce(8)).count()) > {code} > We should get the same result from both uses of except, but the one using > coalesce returns 100 instead of 94. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17628) Name of "object StreamingExamples" should be more self-explanatory
[ https://issues.apache.org/jira/browse/SPARK-17628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17628: Assignee: Apache Spark > Name of "object StreamingExamples" should be more self-explanatory > --- > > Key: SPARK-17628 > URL: https://issues.apache.org/jira/browse/SPARK-17628 > Project: Spark > Issue Type: Bug > Components: Examples, Streaming >Affects Versions: 2.0.0 >Reporter: Xin Ren >Assignee: Apache Spark >Priority: Minor > > `object StreamingExamples` is more of a utility object, and the name is too > general and I thought it's an actual streaming example at the very beginning. > {code} > /** Utility functions for Spark Streaming examples. */ > object StreamingExamples extends Logging { > /** Set reasonable logging levels for streaming if the user has not > configured log4j. */ > def setStreamingLogLevels() { > val log4jInitialized = > Logger.getRootLogger.getAllAppenders.hasMoreElements > if (!log4jInitialized) { > // We first log something to initialize Spark's default logging, then > we override the > // logging level. > logInfo("Setting log level to [WARN] for streaming example." + > " To override add a custom log4j.properties to the classpath.") > Logger.getRootLogger.setLevel(Level.WARN) > } > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)
[ https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suresh Thalamati reopened SPARK-14536: -- SPARK-10186 added array data type support for postgres in 1.6. NPE issues still exists. I was able repro in the master. > NPE in JDBCRDD when array column contains nulls (postgresql) > > > Key: SPARK-14536 > URL: https://issues.apache.org/jira/browse/SPARK-14536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Jeremy Smith > Labels: NullPointerException > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453 > it is assumed that the JDBC driver will definitely return a non-null `Array` > object from the call to `getArray`, and that in the event of a null array it > will return an non-null `Array` object with a null underlying array. But as > you can see here > https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387 > that isn't the case, at least for PostgreSQL. This causes a > `NullPointerException` whenever an array column contains null values. It > seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but > even so there should be a null check in JDBCRDD. I'm happy to submit a PR if > that would be helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15717) Cannot perform RDD operations on a checkpointed VertexRDD.
[ https://issues.apache.org/jira/browse/SPARK-15717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511727#comment-15511727 ] Asher Krim commented on SPARK-15717: Any update on this issue? We are experiencing ClassCastExceptions when using checkpointing and LDA with the EM optimizer. > Cannot perform RDD operations on a checkpointed VertexRDD. > -- > > Key: SPARK-15717 > URL: https://issues.apache.org/jira/browse/SPARK-15717 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.1 >Reporter: Anderson de Andrade > > A checkpointed (materialized) VertexRDD throws the following exception when > collected: > bq. java.lang.ArrayStoreException: > org.apache.spark.graphx.impl.ShippableVertexPartition > Can be replicated by running: > {code:java} > graph.vertices.checkpoint() > graph.vertices.count() // materialize > graph.vertices.collect() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)
[ https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511837#comment-15511837 ] Hyukjin Kwon commented on SPARK-14536: -- I see. I rushed to read this and didn't noticed that this is actually a PostgreSQL specific issue (I thought this JIRA describes a general JDBC problem). Yea, {{ArrayType}} seems only supported in {{PostgreSQL}} in Spark. Maybe we should make some relations with those JIRAs with SPARK-8500 to prevent confusion. > NPE in JDBCRDD when array column contains nulls (postgresql) > > > Key: SPARK-14536 > URL: https://issues.apache.org/jira/browse/SPARK-14536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Jeremy Smith > Labels: NullPointerException > > At > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453 > it is assumed that the JDBC driver will definitely return a non-null `Array` > object from the call to `getArray`, and that in the event of a null array it > will return an non-null `Array` object with a null underlying array. But as > you can see here > https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387 > that isn't the case, at least for PostgreSQL. This causes a > `NullPointerException` whenever an array column contains null values. It > seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but > even so there should be a null check in JDBCRDD. I'm happy to submit a PR if > that would be helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "
[ https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17616: Assignee: Herman van Hovell (was: Apache Spark) > Getting "java.lang.RuntimeException: Distinct columns cannot exist in > Aggregate " > - > > Key: SPARK-17616 > URL: https://issues.apache.org/jira/browse/SPARK-17616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Herman van Hovell >Priority: Minor > > I execute: > {code} > select platform, > collect_set(user_auth) as paid_types, > count(distinct sessionid) as sessions > from non_hss.session > where > event = 'stop' and platform != 'testplatform' and > not (month = MONTH(current_date()) AND year = YEAR(current_date()) > and day = day(current_date())) and > ( > (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = > YEAR(add_months(CURRENT_DATE(), -5))) > OR > (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > > YEAR(add_months(CURRENT_DATE(), -5))) > ) > group by platform > {code} > I get: > {code} > java.lang.RuntimeException: Distinct columns cannot exist in Aggregate > operator containing aggregate functions which don't support partial > aggregation. > {code} > IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still > don't understand what I do incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "
[ https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17616: Assignee: Apache Spark (was: Herman van Hovell) > Getting "java.lang.RuntimeException: Distinct columns cannot exist in > Aggregate " > - > > Key: SPARK-17616 > URL: https://issues.apache.org/jira/browse/SPARK-17616 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Apache Spark >Priority: Minor > > I execute: > {code} > select platform, > collect_set(user_auth) as paid_types, > count(distinct sessionid) as sessions > from non_hss.session > where > event = 'stop' and platform != 'testplatform' and > not (month = MONTH(current_date()) AND year = YEAR(current_date()) > and day = day(current_date())) and > ( > (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = > YEAR(add_months(CURRENT_DATE(), -5))) > OR > (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > > YEAR(add_months(CURRENT_DATE(), -5))) > ) > group by platform > {code} > I get: > {code} > java.lang.RuntimeException: Distinct columns cannot exist in Aggregate > operator containing aggregate functions which don't support partial > aggregation. > {code} > IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still > don't understand what I do incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org