[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17617:

Fix Version/s: 1.6.3

> Remainder(%) expression.eval returns incorrect result
> -
>
> Key: SPARK-17617
> URL: https://issues.apache.org/jira/browse/SPARK-17617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>  Labels: correctness
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> h2.Problem
> Remainder(%) expression returns incorrect result when using expression.eval 
> to calculate the result. expression.eval is called in case like constant 
> folding.
> {code}
> scala> -5083676433652386516D  % 10
> res19: Double = -6.0
> // Wrong answer with eval!!!
> scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
> |(value % 10)|
> ++
> | 0.0|
> ++
> // Triggers codegen, will  not do constant folding
> scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 
> 10).show
> ++
> |(value % 10)|
> ++
> |-6.0|
> ++
> {code}
> Behavior of postgres:
> {code}
> seanzhong=# select -5083676433652386516.0  % 10;
>  ?column? 
> --
>  -6.0
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()

2016-09-21 Thread Sreelal S L (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509301#comment-15509301
 ] 

Sreelal S L commented on SPARK-17621:
-

Hi. 

Our actual code is bit different from what i have given. We use streaming,and 
use tranform() to reuse couple of the DataFrame operations from other part of 
our code base. Dont have much control to change code there. (Worst case have to 
do changes there) . But feel something is wrong here. 

I hit the issue there, and was trying out samples to figure out exactly where 
the issue is coming from . 
Looks like its little unexpected behaviour, since if it works for groupBy(), 
the behaviour should be same for orderBy(). 

Also the map() which increments the accumulator is invoked only once. So it has 
something to do with stage result calculated twice. 

I can understand the accumulator map() calling twice if some task failure 
happens. But thats not the case here. All tasks are successful and the map() 
doing accumulator addition is called only once.  



> Accumulator value is doubled when using DataFrame.orderBy()
> ---
>
> Key: SPARK-17621
> URL: https://issues.apache.org/jira/browse/SPARK-17621
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, SQL
>Affects Versions: 2.0.0
> Environment: Development environment. (Eclipse . Single process) 
>Reporter: Sreelal S L
>Priority: Minor
>
> We are tracing the records read by our source using an accumulator.  We do a 
> orderBy on the Dataframe before the output operation. When the job is 
> completed, the accumulator values is becoming double of the expected value . 
> . 
> Below is the sample code i ran . 
> {code} 
>  val sqlContext = SparkSession.builder() 
>   .config("spark.sql.retainGroupColumns", 
> false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
>   .getOrCreate()
> val sc = sqlContext.sparkContext
> val accumulator1 = sc.accumulator(0, "accumulator1")
> val usersDF = sqlContext.read.json("C:\\users.json") //  single row 
> {"name":"sreelal" ,"country":"IND"}
> val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
> val counterDF = sqlContext.createDataFrame(usersDFwithCount, 
> usersDF.schema);
> val oderedDF = counterDF.orderBy("name")
> val collected = oderedDF.collect()
> collected.foreach { x => println(x) }
> println("accumulator1 : " + accumulator1.value)
> println("Done");
> {code}
> I have only one row in the users.json file.  I expect accumulator1 to have 
> value 1. But its coming as 2. 
> In the Spark Sql UI , i see two jobs getting generated for the same. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17599) Folder deletion after globbing may fail StructuredStreaming jobs

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17599.
-
   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 2.1.0

> Folder deletion after globbing may fail StructuredStreaming jobs
> 
>
> Key: SPARK-17599
> URL: https://issues.apache.org/jira/browse/SPARK-17599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.1.0
>
>
> The FileStreamSource used by StructuredStreaming first resolves globs, and 
> then creates a ListingFileCatalog which listFiles with the resolved glob 
> patterns. If a folder is deleted after glob resolution but before the 
> ListingFileCatalog can list the files, we can run into a 
> 'FileNotFoundException'.
> This should not be a fatal exception for a streaming job. However we should 
> include a warn message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17219:
--
Assignee: Vincent

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17583.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15138
[https://github.com/apache/spark/pull/15138]

> Remove unused rowSeparator variable and set auto-expanding buffer as default 
> for maxCharsPerColumn option in CSV
> 
>
> Key: SPARK-17583
> URL: https://issues.apache.org/jira/browse/SPARK-17583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This JIRA includes several changes below:
> 1. Upgrade Univocity library from 2.1.1 to 2.2.1
> This includes some performance improvement and also enabling auto-extending 
> buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release 
> notes|https://github.com/uniVocity/univocity-parsers/releases].
> 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}}
> We have this variable in 
> [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127]
>  but it seems possibly causing confusion that it actually does not care of 
> {{\r\n}}. For example, we have an issue open about this SPARK-17227 
> describing this variable
> This options is virtually not being used because we rely on 
> {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}.
> 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending 
> We are setting 100 for the length of each column. It'd be more sensible 
> we allow auto-expending rather than fixed length by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17583) Remove unused rowSeparator variable and set auto-expanding buffer as default for maxCharsPerColumn option in CSV

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17583:
--
Assignee: Hyukjin Kwon

> Remove unused rowSeparator variable and set auto-expanding buffer as default 
> for maxCharsPerColumn option in CSV
> 
>
> Key: SPARK-17583
> URL: https://issues.apache.org/jira/browse/SPARK-17583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.1.0
>
>
> This JIRA includes several changes below:
> 1. Upgrade Univocity library from 2.1.1 to 2.2.1
> This includes some performance improvement and also enabling auto-extending 
> buffer in {{maxCharsPerColumn}} option in CSV. Please refer the [release 
> notes|https://github.com/uniVocity/univocity-parsers/releases].
> 2. Remove {{rowSeparator}} variable existing in {{CSVOptions}}
> We have this variable in 
> [CSVOptions|https://github.com/apache/spark/blob/29952ed096fd2a0a19079933ff691671d6f00835/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L127]
>  but it seems possibly causing confusion that it actually does not care of 
> {{\r\n}}. For example, we have an issue open about this SPARK-17227 
> describing this variable
> This options is virtually not being used because we rely on 
> {{LineRecordReader}} in Hadoop which deals with only both {{\n}} and {{\r\n}}.
> 3. Setting the default value of {{maxCharsPerColumn}} to auto-expending 
> We are setting 100 for the length of each column. It'd be more sensible 
> we allow auto-expending rather than fixed length by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()

2016-09-21 Thread Sreelal S L (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509213#comment-15509213
 ] 

Sreelal S L commented on SPARK-17621:
-

Hi Sean, 
Thanks for your quick reply. 
I didnt understand what you meant by "evaluating usersDFwithCount twice". Does 
creating a DataFrame from existing RDD fires a extra job. 

One more catch is i am observing this only for orderBy() . 
If i try a groupBy , ie :  counterDF.groupBy("name").count().collect()  the 
accumulator value is proper. 
In groupBy case also i create DataFrame from the rdd. 

What could be the difference here. 

 


> Accumulator value is doubled when using DataFrame.orderBy()
> ---
>
> Key: SPARK-17621
> URL: https://issues.apache.org/jira/browse/SPARK-17621
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, SQL
>Affects Versions: 2.0.0
> Environment: Development environment. (Eclipse . Single process) 
>Reporter: Sreelal S L
>Priority: Minor
>
> We are tracing the records read by our source using an accumulator.  We do a 
> orderBy on the Dataframe before the output operation. When the job is 
> completed, the accumulator values is becoming double of the expected value . 
> . 
> Below is the sample code i ran . 
> {code} 
>  val sqlContext = SparkSession.builder() 
>   .config("spark.sql.retainGroupColumns", 
> false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
>   .getOrCreate()
> val sc = sqlContext.sparkContext
> val accumulator1 = sc.accumulator(0, "accumulator1")
> val usersDF = sqlContext.read.json("C:\\users.json") //  single row 
> {"name":"sreelal" ,"country":"IND"}
> val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
> val counterDF = sqlContext.createDataFrame(usersDFwithCount, 
> usersDF.schema);
> val oderedDF = counterDF.orderBy("name")
> val collected = oderedDF.collect()
> collected.foreach { x => println(x) }
> println("accumulator1 : " + accumulator1.value)
> println("Done");
> {code}
> I have only one row in the users.json file.  I expect accumulator1 to have 
> value 1. But its coming as 2. 
> In the Spark Sql UI , i see two jobs getting generated for the same. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17617:

Fix Version/s: 2.0.1

> Remainder(%) expression.eval returns incorrect result
> -
>
> Key: SPARK-17617
> URL: https://issues.apache.org/jira/browse/SPARK-17617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> h2.Problem
> Remainder(%) expression returns incorrect result when using expression.eval 
> to calculate the result. expression.eval is called in case like constant 
> folding.
> {code}
> scala> -5083676433652386516D  % 10
> res19: Double = -6.0
> // Wrong answer with eval!!!
> scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
> |(value % 10)|
> ++
> | 0.0|
> ++
> // Triggers codegen, will  not do constant folding
> scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 
> 10).show
> ++
> |(value % 10)|
> ++
> |-6.0|
> ++
> {code}
> Behavior of postgres:
> {code}
> seanzhong=# select -5083676433652386516.0  % 10;
>  ?column? 
> --
>  -6.0
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17017) Add a chiSquare Selector based on False Positive Rate (FPR) test

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17017:
--
Assignee: Peng Meng

> Add a chiSquare Selector based on False Positive Rate (FPR) test
> 
>
> Key: SPARK-17017
> URL: https://issues.apache.org/jira/browse/SPARK-17017
> Project: Spark
>  Issue Type: New Feature
>Reporter: Peng Meng
>Assignee: Peng Meng
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Univariate feature selection works by selecting the best features based on 
> univariate statistical tests. False Positive Rate (FPR) is a popular 
> univariate statistical test for feature selection. Is it necessary to add a 
> chiSquare Selector based on False Positive Rate (FPR) test, like it is 
> implemented in scikit-learn. 
> http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()

2016-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509168#comment-15509168
 ] 

Sean Owen commented on SPARK-17621:
---

I think you've found the issue. You're actually evaluating usersDFwithCount 
twice here. I think the other one has to do with creating the data frame. So 
the accumulator is incremented twice.

> Accumulator value is doubled when using DataFrame.orderBy()
> ---
>
> Key: SPARK-17621
> URL: https://issues.apache.org/jira/browse/SPARK-17621
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, SQL
>Affects Versions: 2.0.0
> Environment: Development environment. (Eclipse . Single process) 
>Reporter: Sreelal S L
>Priority: Minor
>
> We are tracing the records read by our source using an accumulator.  We do a 
> orderBy on the Dataframe before the output operation. When the job is 
> completed, the accumulator values is becoming double of the expected value . 
> . 
> Below is the sample code i ran . 
> {code} 
>  val sqlContext = SparkSession.builder() 
>   .config("spark.sql.retainGroupColumns", 
> false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
>   .getOrCreate()
> val sc = sqlContext.sparkContext
> val accumulator1 = sc.accumulator(0, "accumulator1")
> val usersDF = sqlContext.read.json("C:\\users.json") //  single row 
> {"name":"sreelal" ,"country":"IND"}
> val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
> val counterDF = sqlContext.createDataFrame(usersDFwithCount, 
> usersDF.schema);
> val oderedDF = counterDF.orderBy("name")
> val collected = oderedDF.collect()
> collected.foreach { x => println(x) }
> println("accumulator1 : " + accumulator1.value)
> println("Done");
> {code}
> I have only one row in the users.json file.  I expect accumulator1 to have 
> value 1. But its coming as 2. 
> In the Spark Sql UI , i see two jobs getting generated for the same. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11918) Better error from WLS for cases like singular input

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11918:


Assignee: Apache Spark  (was: Sean Owen)

> Better error from WLS for cases like singular input
> ---
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11918) Better error from WLS for cases like singular input

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11918:


Assignee: Sean Owen  (was: Apache Spark)

> Better error from WLS for cases like singular input
> ---
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Sean Owen
>Priority: Minor
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) Better error from WLS for cases like singular input

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11918:
--
Assignee: Sean Owen
  Labels:   (was: starter)
 Summary: Better error from WLS for cases like singular input  (was: WLS 
can not resolve some kinds of equation)

> Better error from WLS for cases like singular input
> ---
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Sean Owen
>Priority: Minor
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17556:


Assignee: Apache Spark

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17596) Streaming job lacks Scala runtime methods

2016-09-21 Thread Evgeniy Tsvigun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509229#comment-15509229
 ] 

Evgeniy Tsvigun commented on SPARK-17596:
-

Thanks Sean! One more check revealed I had SPARK_HOME pointing to a 1.6.2 Spark 
distro in my profile.

> Streaming job lacks Scala runtime methods
> -
>
> Key: SPARK-17596
> URL: https://issues.apache.org/jira/browse/SPARK-17596
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
> Environment: Linux 4.4.20 x86_64 GNU/Linux
> openjdk version "1.8.0_102"
> Scala 2.11.8
>Reporter: Evgeniy Tsvigun
>  Labels: kafka-0.8, streaming
>
> When using -> in Spark Streaming 2.0.0 jobs, or using 
> spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit, I 
> get the following error:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): 
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> This only happens with spark-streaming, using ArrowAssoc in plain 
> non-streaming Spark jobs works fine.
> I put a brief illustration of this phenomenon to a GitHub repo: 
> https://github.com/utgarda/spark-2-streaming-nosuchmethod-arrowassoc
> Putting only provided dependencies to build.sbt
> "org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
> "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided"
> using -> anywhere in the driver code, packing it with sbt-assembly and 
> submitting the job results in an error. This isn't a big problem by itself, 
> using ArrayAssoc can be avoided, but spark-streaming-kafka-0-8_2.11 v2.0.0 
> has it somewhere inside, and generates the same error.
> Packing with scala-library, can see the class in the jar after packing, but 
> it's still reported missing in runtime.
> The issue reported on StackOverflow: 
> http://stackoverflow.com/questions/39395521/spark-2-0-0-streaming-job-packed-with-sbt-assembly-lacks-scala-runtime-methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10835) Change Output of NGram to Array(String, True)

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509249#comment-15509249
 ] 

Apache Spark commented on SPARK-10835:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15179

> Change Output of NGram to Array(String, True)
> -
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10835) Change Output of NGram to Array(String, True)

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10835:


Assignee: Apache Spark  (was: yuhao yang)

> Change Output of NGram to Array(String, True)
> -
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: Apache Spark
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10835) Change Output of NGram to Array(String, True)

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10835:


Assignee: yuhao yang  (was: Apache Spark)

> Change Output of NGram to Array(String, True)
> -
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509228#comment-15509228
 ] 

Apache Spark commented on SPARK-17556:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15178

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17556:


Assignee: (was: Apache Spark)

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17595.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15150
[https://github.com/apache/spark/pull/15150]

> Inefficient selection in Word2VecModel.findSynonyms
> ---
>
> Key: SPARK-17595
> URL: https://issues.apache.org/jira/browse/SPARK-17595
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: William Benton
>Priority: Minor
> Fix For: 2.1.0
>
>
> The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements 
> with the highest similarity to the query vector currently sorts the 
> similarities for every vocabulary element.  This involves making multiple 
> copies of the collection of similarities while doing a (relatively) expensive 
> sort.  It would be more efficient to find the best matches by maintaining a 
> bounded priority queue and populating it with a single pass over the 
> vocabulary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17595) Inefficient selection in Word2VecModel.findSynonyms

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17595:
--
Assignee: William Benton

> Inefficient selection in Word2VecModel.findSynonyms
> ---
>
> Key: SPARK-17595
> URL: https://issues.apache.org/jira/browse/SPARK-17595
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: William Benton
>Assignee: William Benton
>Priority: Minor
> Fix For: 2.1.0
>
>
> The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements 
> with the highest similarity to the query vector currently sorts the 
> similarities for every vocabulary element.  This involves making multiple 
> copies of the collection of similarities while doing a (relatively) expensive 
> sort.  It would be more efficient to find the best matches by maintaining a 
> bounded priority queue and populating it with a single pass over the 
> vocabulary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17617) Remainder(%) expression.eval returns incorrect result

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17617:

Assignee: Sean Zhong

> Remainder(%) expression.eval returns incorrect result
> -
>
> Key: SPARK-17617
> URL: https://issues.apache.org/jira/browse/SPARK-17617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>Assignee: Sean Zhong
>  Labels: correctness
> Fix For: 2.0.1, 2.1.0
>
>
> h2.Problem
> Remainder(%) expression returns incorrect result when using expression.eval 
> to calculate the result. expression.eval is called in case like constant 
> folding.
> {code}
> scala> -5083676433652386516D  % 10
> res19: Double = -6.0
> // Wrong answer with eval!!!
> scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
> |(value % 10)|
> ++
> | 0.0|
> ++
> // Triggers codegen, will  not do constant folding
> scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 
> 10).show
> ++
> |(value % 10)|
> ++
> |-6.0|
> ++
> {code}
> Behavior of postgres:
> {code}
> seanzhong=# select -5083676433652386516.0  % 10;
>  ?column? 
> --
>  -6.0
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17617) Remainder(%) expression.eval returns incorrect result

2016-09-21 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17617.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15171
[https://github.com/apache/spark/pull/15171]

> Remainder(%) expression.eval returns incorrect result
> -
>
> Key: SPARK-17617
> URL: https://issues.apache.org/jira/browse/SPARK-17617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sean Zhong
>  Labels: correctness
> Fix For: 2.1.0
>
>
> h2.Problem
> Remainder(%) expression returns incorrect result when using expression.eval 
> to calculate the result. expression.eval is called in case like constant 
> folding.
> {code}
> scala> -5083676433652386516D  % 10
> res19: Double = -6.0
> // Wrong answer with eval!!!
> scala> Seq("-5083676433652386516D").toDF.select($"value" % 10).show
> |(value % 10)|
> ++
> | 0.0|
> ++
> // Triggers codegen, will  not do constant folding
> scala> sc.makeRDD(Seq("-5083676433652386516D")).toDF.select($"value" % 
> 10).show
> ++
> |(value % 10)|
> ++
> |-6.0|
> ++
> {code}
> Behavior of postgres:
> {code}
> seanzhong=# select -5083676433652386516.0  % 10;
>  ?column? 
> --
>  -6.0
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17017) Add a chiSquare Selector based on False Positive Rate (FPR) test

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17017.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14597
[https://github.com/apache/spark/pull/14597]

> Add a chiSquare Selector based on False Positive Rate (FPR) test
> 
>
> Key: SPARK-17017
> URL: https://issues.apache.org/jira/browse/SPARK-17017
> Project: Spark
>  Issue Type: New Feature
>Reporter: Peng Meng
>Priority: Minor
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Univariate feature selection works by selecting the best features based on 
> univariate statistical tests. False Positive Rate (FPR) is a popular 
> univariate statistical test for feature selection. Is it necessary to add a 
> chiSquare Selector based on False Positive Rate (FPR) test, like it is 
> implemented in scikit-learn. 
> http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-17585:
---

Assignee: Yanbo Liang

> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Users would like to add a directory as dependency in some cases, they can use 
> {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
> all files under the directory by using Scala. But Python users can only add 
> file not directory, we should also make it supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17585) PySpark SparkContext.addFile supports adding files recursively

2016-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-17585.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> PySpark SparkContext.addFile supports adding files recursively
> --
>
> Key: SPARK-17585
> URL: https://issues.apache.org/jira/browse/SPARK-17585
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> Users would like to add a directory as dependency in some cases, they can use 
> {{SparkContext.addFile}} with argument {{recursive=true}} to recursively add 
> all files under the directory by using Scala. But Python users can only add 
> file not directory, we should also make it supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) Better error from WLS for cases like singular input

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509222#comment-15509222
 ] 

Apache Spark commented on SPARK-11918:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15177

> Better error from WLS for cases like singular input
> ---
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Sean Owen
>Priority: Minor
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17622) Cannot run SparkR function on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he updated SPARK-17622:
--
Description: 
sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = 
list(spark.driver.memory = "2g"))

df <- as.DataFrame(faithful)

get error below:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedSt


on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc1)
df <- as.DataFrame(sqlContext,faithful)

  was:
sc <- sparkR.session(master="spark://spark01.cmua.dom:7077", appName="sparkR", 
sparkConfig = list(spark.driver.memory = "2g"))

df <- as.DataFrame(faithful)


get error below:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
   at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedSt


> Cannot run SparkR function on Windows- Spark 2.0.0
> --
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
> Fix For: 1.6.1, 1.6.2
>
>
> sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = 
> list(spark.driver.memory = "2g"))
> df <- as.DataFrame(faithful)
> get error below:
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> at org.apache.spark.sql.hive.HiveSharedSt
> on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
> sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
> list(spark.driver.memory="2g"))
> sqlContext <- sparkRSQL.init(sc1)
> df <- as.DataFrame(sqlContext,faithful)



--
This message was sent by 

[jira] [Updated] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17614:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Priority: Minor
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17621) Accumulator value is doubled when using DataFrame.orderBy()

2016-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509234#comment-15509234
 ] 

Sean Owen commented on SPARK-17621:
---

I think you're generally relying on the RDD being evaluated once, but that's 
not the case in some of your examples. Why not just use count()?
You said you see two jobs generated, and that will tell you what is running 
that may evaluate the RDD.

> Accumulator value is doubled when using DataFrame.orderBy()
> ---
>
> Key: SPARK-17621
> URL: https://issues.apache.org/jira/browse/SPARK-17621
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, SQL
>Affects Versions: 2.0.0
> Environment: Development environment. (Eclipse . Single process) 
>Reporter: Sreelal S L
>Priority: Minor
>
> We are tracing the records read by our source using an accumulator.  We do a 
> orderBy on the Dataframe before the output operation. When the job is 
> completed, the accumulator values is becoming double of the expected value . 
> . 
> Below is the sample code i ran . 
> {code} 
>  val sqlContext = SparkSession.builder() 
>   .config("spark.sql.retainGroupColumns", 
> false).config("spark.sql.warehouse.dir", "file:///C:/Test").master("local[*]")
>   .getOrCreate()
> val sc = sqlContext.sparkContext
> val accumulator1 = sc.accumulator(0, "accumulator1")
> val usersDF = sqlContext.read.json("C:\\users.json") //  single row 
> {"name":"sreelal" ,"country":"IND"}
> val usersDFwithCount = usersDF.rdd.map(x => { accumulator1 += 1; x });
> val counterDF = sqlContext.createDataFrame(usersDFwithCount, 
> usersDF.schema);
> val oderedDF = counterDF.orderBy("name")
> val collected = oderedDF.collect()
> collected.foreach { x => println(x) }
> println("accumulator1 : " + accumulator1.value)
> println("Done");
> {code}
> I have only one row in the users.json file.  I expect accumulator1 to have 
> value 1. But its coming as 2. 
> In the Spark Sql UI , i see two jobs getting generated for the same. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-17596) Streaming job lacks Scala runtime methods

2016-09-21 Thread Evgeniy Tsvigun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evgeniy Tsvigun closed SPARK-17596.
---
Resolution: Not A Problem

Found that my SPARK_HOME environment variable was pointing to a wrong Spark 
version.

> Streaming job lacks Scala runtime methods
> -
>
> Key: SPARK-17596
> URL: https://issues.apache.org/jira/browse/SPARK-17596
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
> Environment: Linux 4.4.20 x86_64 GNU/Linux
> openjdk version "1.8.0_102"
> Scala 2.11.8
>Reporter: Evgeniy Tsvigun
>  Labels: kafka-0.8, streaming
>
> When using -> in Spark Streaming 2.0.0 jobs, or using 
> spark-streaming-kafka-0-8_2.11 v2.0.0, and submitting it with spark-submit, I 
> get the following error:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task 0 in stage 72.0 failed 1 times, most recent 
> failure: Lost task 0.0 in stage 72.0 (TID 37, localhost): 
> java.lang.NoSuchMethodError: 
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> This only happens with spark-streaming, using ArrowAssoc in plain 
> non-streaming Spark jobs works fine.
> I put a brief illustration of this phenomenon to a GitHub repo: 
> https://github.com/utgarda/spark-2-streaming-nosuchmethod-arrowassoc
> Putting only provided dependencies to build.sbt
> "org.apache.spark" %% "spark-core" % "2.0.0" % "provided",
> "org.apache.spark" %% "spark-streaming" % "2.0.0" % "provided"
> using -> anywhere in the driver code, packing it with sbt-assembly and 
> submitting the job results in an error. This isn't a big problem by itself, 
> using ArrayAssoc can be avoided, but spark-streaming-kafka-0-8_2.11 v2.0.0 
> has it somewhere inside, and generates the same error.
> Packing with scala-library, can see the class in the jar after packing, but 
> it's still reported missing in runtime.
> The issue reported on StackOverflow: 
> http://stackoverflow.com/questions/39395521/spark-2-0-0-streaming-job-packed-with-sbt-assembly-lacks-scala-runtime-methods



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17622) Cannot run SparkR function on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)
renzhi he created SPARK-17622:
-

 Summary: Cannot run SparkR function on Windows- Spark 2.0.0
 Key: SPARK-17622
 URL: https://issues.apache.org/jira/browse/SPARK-17622
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.0.0
 Environment: windows 10
R 3.3.1
RStudio 1.0.20
Reporter: renzhi he
 Fix For: 1.6.2, 1.6.1


sc <- sparkR.session(master="spark://spark01.cmua.dom:7077", appName="sparkR", 
sparkConfig = list(spark.driver.memory = "2g"))

df <- as.DataFrame(faithful)


get error below:
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
   at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedSt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17219.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14858
[https://github.com/apache/spark/pull/14858]

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17057) ProbabilisticClassifierModels' thresholds should be > 0

2016-09-21 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17057:
--
Summary: ProbabilisticClassifierModels' thresholds should be > 0  (was: 
ProbabilisticClassifierModels' thresholds should be > 0 and sum < 1 to match 
randomForest cutoff)

> ProbabilisticClassifierModels' thresholds should be > 0
> ---
>
> Key: SPARK-17057
> URL: https://issues.apache.org/jira/browse/SPARK-17057
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: zhengruifeng
>Assignee: Sean Owen
>Priority: Minor
>
> {code}
> val path = "./data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rfm = rf.fit(data)
> scala> rfm.setThresholds(Array(0.0,0.0,0.0))
> res4: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees
> scala> rfm.transform(data).show(5)
> +-++--+-+--+
> |label|features| rawPrediction|  probability|prediction|
> +-++--+-+--+
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]|   0.0|
> +-++--+-+--+
> only showing top 5 rows
> {code}
> If multi thresholds are set zero, the prediction of 
> {{ProbabilisticClassificationModel}} is the first index whose corresponding 
> threshold is 0. 
> However, in this case, the index with max {{probability}} among indices with 
> 0-threshold should be more reasonable to mark as
> {{prediction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15071) Check the result of all TPCDS queries

2016-09-21 Thread Nirman Narang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509559#comment-15509559
 ] 

Nirman Narang commented on SPARK-15071:
---

Started working on this.

> Check the result of all TPCDS queries
> -
>
> Key: SPARK-15071
> URL: https://issues.apache.org/jira/browse/SPARK-15071
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Davies Liu
>
> We should compare the results of all TPCDS query again other Database that 
> could support all of them (for example, IBM Big SQL, PostgreSQL)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17590) Analyze CTE definitions at once and allow CTE subquery to define CTE

2016-09-21 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17590.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.1.0

> Analyze CTE definitions at once and allow CTE subquery to define CTE
> 
>
> Key: SPARK-17590
> URL: https://issues.apache.org/jira/browse/SPARK-17590
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
>
> We substitute logical plan with CTE definitions in the analyzer rule 
> CTESubstitution. A CTE definition can be used in the logical plan for 
> multiple times, and its analyzed logical plan should be the same. We should 
> not analyze CTE definitions multiple times when they are reused in the query.
> By analyzing CTE definitions before substitution, we can support defining CTE 
> in subquery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark Thrift server doesn't return correct JDBC metadata

2016-09-21 Thread Shawn Lavelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510098#comment-15510098
 ] 

Shawn Lavelle commented on SPARK-9686:
--

It's been a few months, any progress on this bug?

> Spark Thrift server doesn't return correct JDBC metadata 
> -
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2
>Reporter: pin_zhang
>Assignee: Cheng Lian
>Priority: Critical
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he updated SPARK-17622:
--
Description: 
Under spark2.0.0- on Windows- when try to load or create data with the similar 
codes below, I also get error message and cannot execute the functions.
|sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
"2g")) |
|df <- as.DataFrame(faithful) |


Here is the error message:
#Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
#java.lang.reflect.InvocationTargetException
#at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
#at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
#at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
#at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
#at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
#at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
#at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
#at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
#at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
#at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
#at org.apache.spark.sql.hive.HiveSharedSt


However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
there will be no problem.
|sc1 <- sparkR.init(master = "local", sparkEnvir = 
list(spark.driver.memory="2g"))|
|sqlContext <- sparkRSQL.init(sc1)|
|df <- as.DataFrame(sqlContext,faithful|

  was:
Under spark2.0.0- on Windows- when try to load or create data with the similar 
codes below, I also get error message and cannot execute the functions.
|sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
"2g")) |
|df <- as.DataFrame(faithful) |


Here is the error message:
#Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
#java.lang.reflect.InvocationTargetException
#at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
#at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
#at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
#at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
#at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
#at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
#at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
#at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
#at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
#at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
#at org.apache.spark.sql.hive.HiveSharedSt


However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
there will be no problem.
|sc1 <- sparkR.init(master = "local", sparkEnvir = 
list(spark.driver.memory="2g"))|
|sqlContext <- sparkRSQL.init(sc1)|
|df <- as.DataFrame(sqlContext,faithful|



> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
> Fix For: 1.6.1, 1.6.2
>
>
> Under spark2.0.0- on Windows- when try to load or create data with the 
> similar codes below, I also get error message and cannot execute the 
> functions.
> |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
> "2g")) |
> |df <- as.DataFrame(faithful) |
> Here is the error message:
> #Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
>  
> #java.lang.reflect.InvocationTargetException
> #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> #at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> #at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> #at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> #at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> 

[jira] [Closed] (SPARK-17610) The failed stage caused by FetchFailed may never be resubmitted

2016-09-21 Thread Tao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang closed SPARK-17610.

Resolution: Not A Problem

> The failed stage caused by FetchFailed may never be resubmitted
> ---
>
> Key: SPARK-17610
> URL: https://issues.apache.org/jira/browse/SPARK-17610
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> We have a problem in our environment, in which the failed stage has not been 
> resubmitted ever. Because it is caused by FetchFailed exception, I took a 
> look at the corresponsive code segment and found some issues:
> In DAGScheduler.handleTaskCompletion, it first check if the `failedStages` is 
> empty, and do two steps when the answer is true:
> 1. send `ResubmitFailedStages` to evnetProcessLoop 
> 2. add failed stages into `failedStages`
> in `eventProcessLoop`, it first take all elements in `failedStages` to 
> resubmit them, then clear the set.
> If the event happens like below, there might be some problem:
> assume t1 < t2 < t3
> at t1, failed stage 1 was handled, the ResubmitFailedStages was send to 
> eventProcessLoop
> at t2, eventProcessLoop handle the ResubmitFailedStages and clear the empty 
> `failedStages`
> at t3, failed stage 1 was added into `failedStages`
> now failed stage 1 has not been resubmitted for now.
> after anytime at t3, the `failedStages` will never be empty even if we have 
> new failed stages caused by FetchFailed coming in, because the `failedStages` 
> containing failed stage 1 is not empty.
> The codes is below: 
> {code}
> } else if (failedStages.isEmpty) {
> // Don't schedule an event to resubmit failed stages if failed 
> isn't empty, because
> // in that case the event will already have been scheduled.
> // TODO: Cancel running tasks in the stage
> logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
>   s"$failedStage (${failedStage.name}) due to fetch failure")
> messageScheduler.schedule(new Runnable {
>   override def run(): Unit = 
> eventProcessLoop.post(ResubmitFailedStages)
> }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
>   }
>   failedStages += failedStage
>   failedStages += mapStage
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he updated SPARK-17622:
--
Description: (was: sc <- sparkR.session(master="local[*]",  sparkConfig 
= list(spark.driver.memory = "2g"))

df <- as.DataFrame(faithful)

get error below:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedSt


on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc1)
df <- as.DataFrame(sqlContext,faithful))

> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
> Fix For: 1.6.1, 1.6.2
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17610) The failed stage caused by FetchFailed may never be resubmitted

2016-09-21 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509712#comment-15509712
 ] 

Tao Wang commented on SPARK-17610:
--

As reason mentioned in https://github.com/apache/spark/pull/15176, this is not 
a bug so close this.

> The failed stage caused by FetchFailed may never be resubmitted
> ---
>
> Key: SPARK-17610
> URL: https://issues.apache.org/jira/browse/SPARK-17610
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> We have a problem in our environment, in which the failed stage has not been 
> resubmitted ever. Because it is caused by FetchFailed exception, I took a 
> look at the corresponsive code segment and found some issues:
> In DAGScheduler.handleTaskCompletion, it first check if the `failedStages` is 
> empty, and do two steps when the answer is true:
> 1. send `ResubmitFailedStages` to evnetProcessLoop 
> 2. add failed stages into `failedStages`
> in `eventProcessLoop`, it first take all elements in `failedStages` to 
> resubmit them, then clear the set.
> If the event happens like below, there might be some problem:
> assume t1 < t2 < t3
> at t1, failed stage 1 was handled, the ResubmitFailedStages was send to 
> eventProcessLoop
> at t2, eventProcessLoop handle the ResubmitFailedStages and clear the empty 
> `failedStages`
> at t3, failed stage 1 was added into `failedStages`
> now failed stage 1 has not been resubmitted for now.
> after anytime at t3, the `failedStages` will never be empty even if we have 
> new failed stages caused by FetchFailed coming in, because the `failedStages` 
> containing failed stage 1 is not empty.
> The codes is below: 
> {code}
> } else if (failedStages.isEmpty) {
> // Don't schedule an event to resubmit failed stages if failed 
> isn't empty, because
> // in that case the event will already have been scheduled.
> // TODO: Cancel running tasks in the stage
> logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +
>   s"$failedStage (${failedStage.name}) due to fetch failure")
> messageScheduler.schedule(new Runnable {
>   override def run(): Unit = 
> eventProcessLoop.post(ResubmitFailedStages)
> }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)
>   }
>   failedStages += failedStage
>   failedStages += mapStage
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17607) --driver-url doesn't point to my master_ip.

2016-09-21 Thread Sasi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509773#comment-15509773
 ] 

Sasi commented on SPARK-17607:
--

It's different, because I was able to start my master with ip 10.5.5.2 and I 
saw that each worker regsitered on 10.5.5.2.
I also was able to see that each worker as its own private ip, e.g. worker1 - 
10.5.5.3, worker3 - 10.5.5.4 and etc.
My spark-env.sh contained bother master-ip and worker-ip has it should be.

The problem on my bug is that once I requested data from each worker then it 
set the driverUrl, in my case 10.5.5.2, to 10.0.42.230.

Do you want guide me on which logs/info you need for this issue?
Thanks,
Sasi

> --driver-url doesn't point to my master_ip.
> ---
>
> Key: SPARK-17607
> URL: https://issues.apache.org/jira/browse/SPARK-17607
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
>Reporter: Sasi
>Priority: Critical
>
> Hi,
> I have master machine and slave machine.
> My master machine contains 2 interfaces.
> First interface has the following ip 10.5.5.2, and the other interface has 
> the following ip 10.0.42.230.
> I configured the MASTER_IP to be 10.5.5.2, so once the master goes up and its 
> worker I see the following INFO lines:
> {code}
> 16/09/20 12:32:32 INFO Worker: Successfully registered with master 
> spark://10.5.5.2:7077
> 16/09/20 12:39:15 INFO Worker: Asked to launch executor 
> app-20160920123915-/0 for Spark-DataAccessor-JBoss
> {code}
> I set the SPARK_LOCAL_IP on each worker to be its own ip, e.g 10.5.5.5.
> Both constants were configured on spark-env.sh.
> The problem started when I tried to get data from my workers.
> I got the following INFO line in each worker log.
> {code} 
> "--driver-url" 
> "akka.tcp://sparkDriver@10.0.42.230:43683/user/CoarseGrainedScheduler" "
> {code}
> As you can see the masterIp is different then the driver-url ip.
> Master ip is 10.5.5.2 but driver-url is 10.0.42.230, therefore i'm getting 
> the following errors:
> {code}
> 16/09/20 12:17:57 INFO Slf4jLogger: Slf4jLogger started
> 16/09/20 12:17:57 INFO Remoting: Starting remoting
> 16/09/20 12:17:57 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://driverPropsFetcher@10.5.5.5:34961]
> 16/09/20 12:17:57 INFO Utils: Successfully started service 
> 'driverPropsFetcher' on port 34961.
> 16/09/20 12:19:00 WARN ReliableDeliverySupervisor: Association with remote 
> system [akka.tcp://sparkDriver@10.0.42.230:36711] has failed, address is now 
> gated for [5000] ms. Reason: [Association failed with 
> [akka.tcp://sparkDriver@10.0.42.230:36711]] Caused by: [Connection timed out: 
> /10.0.42.230:36711]
> Exception in thread "main" akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://sparkDriver@10.0.42.230:36711/), 
> Path(/user/CoarseGrainedScheduler)]
> at
> {code}
> {code}
>  "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "akka.tcp://sparkDriver@10.0.42.230:43683/user/CoarseGrainedScheduler"
> {code}
> The master is listen and open for communicate via 10.5.5.2 and not 
> 10.0.42.230.
> Looks like the driver-url ignore the real MASTER_IP.
> Thanks,
> Sasi



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he updated SPARK-17622:
--
Summary: Cannot run create or load DF on Windows- Spark 2.0.0  (was: Cannot 
run SparkR function on Windows- Spark 2.0.0)

> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
> Fix For: 1.6.1, 1.6.2
>
>
> sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = 
> list(spark.driver.memory = "2g"))
> df <- as.DataFrame(faithful)
> get error below:
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> at org.apache.spark.sql.hive.HiveSharedSt
> on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
> sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
> list(spark.driver.memory="2g"))
> sqlContext <- sparkRSQL.init(sc1)
> df <- as.DataFrame(sqlContext,faithful)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he updated SPARK-17622:
--
Description: 
Under spark2.0.0- on Windows- when try to load or create data with the similar 
codes below, I also get error message and cannot execute the functions.
|sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
"2g")) |
|df <- as.DataFrame(faithful) |


Here is the error message:
#Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
#java.lang.reflect.InvocationTargetException
#at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
#at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
#at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
#at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
#at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
#at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
#at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
#at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
#at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
#at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
#at org.apache.spark.sql.hive.HiveSharedSt


However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
there will be no problem.
|sc1 <- sparkR.init(master = "local", sparkEnvir = 
list(spark.driver.memory="2g"))|
|sqlContext <- sparkRSQL.init(sc1)|
|df <- as.DataFrame(sqlContext,faithful|


> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
> Fix For: 1.6.1, 1.6.2
>
>
> Under spark2.0.0- on Windows- when try to load or create data with the 
> similar codes below, I also get error message and cannot execute the 
> functions.
> |sc <- sparkR.session(master="local",sparkConfig = list(spark.driver.memory = 
> "2g")) |
> |df <- as.DataFrame(faithful) |
> Here is the error message:
> #Error in invokeJava(isStatic = TRUE, className, methodName, ...) :   
>  
> #java.lang.reflect.InvocationTargetException
> #at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> #at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> #at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> #at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> #at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> #at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> #at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> #at org.apache.spark.sql.hive.HiveSharedSt
> However, under spark1.6.1 or spark1.6.2, run the same functional functions, 
> there will be no problem.
> |sc1 <- sparkR.init(master = "local", sparkEnvir = 
> list(spark.driver.memory="2g"))|
> |sqlContext <- sparkRSQL.init(sc1)|
> |df <- as.DataFrame(sqlContext,faithful|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17606) New batches are not created when there are 1000 created after restarting streaming from checkpoint.

2016-09-21 Thread etienne (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15509821#comment-15509821
 ] 

etienne commented on SPARK-17606:
-

Sorry I ask Ops for logs, but they have been lost. 
I have to wait another long break in my spark streaming to have theses.
The must would be to test in test environment, and take care of logs.

> New batches are not created when there are 1000 created after restarting 
> streaming from checkpoint.
> ---
>
> Key: SPARK-17606
> URL: https://issues.apache.org/jira/browse/SPARK-17606
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.1
>Reporter: etienne
>
> When spark restarts from a checkpoint after being down for a while.
> It recreates missing batch since the down time.
> When there are few missing batches, spark creates new incoming batch every 
> batchTime, but when there is enough missing time to create 1000 batches no 
> new batch is created.
> So when all these batch are completed the stream is idle ...
> I think there is a rigid limit set somewhere.
> I was expecting that spark continue to recreate missed batches, maybe not all 
> at once ( because it's look like it's cause driver memory problem ), and then 
> recreate batches each batchTime.
> Another solution would be to not create missing batches but still restart the 
> direct input.
> Right know for me the only solution to restart a stream after a long break it 
> to remove the checkpoint to allow the creation of a new stream. But losing 
> all my states.
> ps : I'm speaking about direct Kafka input because it's the source I'm 
> currently using, I don't know what happens with other sources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this

2016-09-21 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-17623:


 Summary: Failed tasks end reason is always a TaskFailedReason, 
types should reflect this
 Key: SPARK-17623
 URL: https://issues.apache.org/jira/browse/SPARK-17623
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Affects Versions: 2.0.0
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor


Minor code cleanup.  In TaskResultGetter, enqueueFailedTask currently 
deserializes the result as a TaskEndReason.  But the type is actually more 
specific, its a TaskFailedReason.  This just leads to more blind casting later 
on -- it would be more clear if the msg was cast to the right type immediately, 
so method parameter types could be tightened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17623:


Assignee: Apache Spark  (was: Imran Rashid)

> Failed tasks end reason is always a TaskFailedReason, types should reflect 
> this
> ---
>
> Key: SPARK-17623
> URL: https://issues.apache.org/jira/browse/SPARK-17623
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Apache Spark
>Priority: Minor
>
> Minor code cleanup.  In TaskResultGetter, enqueueFailedTask currently 
> deserializes the result as a TaskEndReason.  But the type is actually more 
> specific, its a TaskFailedReason.  This just leads to more blind casting 
> later on -- it would be more clear if the msg was cast to the right type 
> immediately, so method parameter types could be tightened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510377#comment-15510377
 ] 

Apache Spark commented on SPARK-17623:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/15181

> Failed tasks end reason is always a TaskFailedReason, types should reflect 
> this
> ---
>
> Key: SPARK-17623
> URL: https://issues.apache.org/jira/browse/SPARK-17623
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> Minor code cleanup.  In TaskResultGetter, enqueueFailedTask currently 
> deserializes the result as a TaskEndReason.  But the type is actually more 
> specific, its a TaskFailedReason.  This just leads to more blind casting 
> later on -- it would be more clear if the msg was cast to the right type 
> immediately, so method parameter types could be tightened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17623) Failed tasks end reason is always a TaskFailedReason, types should reflect this

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17623:


Assignee: Imran Rashid  (was: Apache Spark)

> Failed tasks end reason is always a TaskFailedReason, types should reflect 
> this
> ---
>
> Key: SPARK-17623
> URL: https://issues.apache.org/jira/browse/SPARK-17623
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
>
> Minor code cleanup.  In TaskResultGetter, enqueueFailedTask currently 
> deserializes the result as a TaskEndReason.  But the type is actually more 
> specific, its a TaskFailedReason.  This just leads to more blind casting 
> later on -- it would be more clear if the msg was cast to the right type 
> immediately, so method parameter types could be tightened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17044) Add window function test in SQLQueryTestSuite

2016-09-21 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510464#comment-15510464
 ] 

Dongjoon Hyun commented on SPARK-17044:
---

Hi, [~rxin]
Could you review this issue?

> Add window function test in SQLQueryTestSuite
> -
>
> Key: SPARK-17044
> URL: https://issues.apache.org/jira/browse/SPARK-17044
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue adds a SQL query test for Window functions for new 
> `SQLQueryTestSuite`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17624) Flaky test? StateStoreSuite maintenance

2016-09-21 Thread Adam Roberts (JIRA)
Adam Roberts created SPARK-17624:


 Summary: Flaky test? StateStoreSuite maintenance
 Key: SPARK-17624
 URL: https://issues.apache.org/jira/browse/SPARK-17624
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.0.1
Reporter: Adam Roberts
Priority: Minor


I've noticed this test failing consistently (25x in a row) with a two core 
machine but not on an eight core machine

If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
being the default in Spark), the test reliably passes, we can also gain 
reliability by setting the master to be anything other than just local.

Is there a reason spark.rpc.numRetries is set to be 1?

I see this failure is also mentioned here so it's been flaky for a while 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html

If we run without the "quietly" code so we get debug info:
{code}
16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error sending 
message [message = 
VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
 in 1 attempts
org.apache.spark.SparkException: Exception thrown in awaitResult
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:785)
Caused by: org.apache.spark.SparkException: Could not find 
StateStoreCoordinator.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:225)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:508)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
... 19 more
16:26:15.217 WARN org.apache.spark.sql.execution.streaming.state.StateStore: 
Error managing StateStore[id = (op=0, part=0), dir = 

[jira] [Updated] (SPARK-17622) Cannot run create or load DF on Windows- Spark 2.0.0

2016-09-21 Thread renzhi he (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

renzhi he updated SPARK-17622:
--
Description: 
sc <- sparkR.session(master="local[*]",  sparkConfig = list(spark.driver.memory 
= "2g"))

df <- as.DataFrame(faithful)

get error below:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedSt


on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc1)
df <- as.DataFrame(sqlContext,faithful)

  was:
sc <- sparkR.session(master="local[*]", appName="sparkR", sparkConfig = 
list(spark.driver.memory = "2g"))

df <- as.DataFrame(faithful)

get error below:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
at 
org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
at 
org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
at org.apache.spark.sql.hive.HiveSharedSt


on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
list(spark.driver.memory="2g"))
sqlContext <- sparkRSQL.init(sc1)
df <- as.DataFrame(sqlContext,faithful)


> Cannot run create or load DF on Windows- Spark 2.0.0
> 
>
> Key: SPARK-17622
> URL: https://issues.apache.org/jira/browse/SPARK-17622
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: windows 10
> R 3.3.1
> RStudio 1.0.20
>Reporter: renzhi he
>  Labels: windows
> Fix For: 1.6.1, 1.6.2
>
>
> sc <- sparkR.session(master="local[*]",  sparkConfig = 
> list(spark.driver.memory = "2g"))
> df <- as.DataFrame(faithful)
> get error below:
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:263)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:46)
> at org.apache.spark.sql.hive.HiveSharedSt
> on spark 1.6.1 and spark 1.6.2 can run the corresponding codes.
> sc1 <- sparkR.init(master = "local[*]", sparkEnvir = 
> list(spark.driver.memory="2g"))
> sqlContext <- 

[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-09-21 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510198#comment-15510198
 ] 

Seth Hendrickson commented on SPARK-17134:
--

Hmm, it would be nice to see this vs the old mlor in rdd API, just as a sanity 
check. I conducted performance testing against mllib initially, though, so 
there shouldn't be any regressions.

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17625:


Assignee: Apache Spark

> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation
> ---
>
> Key: SPARK-17625
> URL: https://issues.apache.org/jira/browse/SPARK-17625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>Priority: Minor
>
> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation, otherwise the outputs of LogicalRelation are different 
> from outputs of SimpleCatalogRelation - they have different exprId's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510658#comment-15510658
 ] 

Apache Spark commented on SPARK-17625:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15182

> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation
> ---
>
> Key: SPARK-17625
> URL: https://issues.apache.org/jira/browse/SPARK-17625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation, otherwise the outputs of LogicalRelation are different 
> from outputs of SimpleCatalogRelation - they have different exprId's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-17614:

Comment: was deleted

(was: Create pull request: https://github.com/apache/spark/pull/15183)

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510626#comment-15510626
 ] 

Paul Wu edited comment on SPARK-17614 at 9/21/16 5:42 PM:
--

No, Custom JdbcDialect won't resolve the problem since DataFrameReader uses 
JDBCRDD and the later has a hard code line 

val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0")

for getting the table existence.  See line 61 at 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

Line 61 needs to use the Dialect's "table existence" rather than hard-coded the 
query there.


was (Author: zwu@gmail.com):
No, Custom JdbcDialect won't resolve the problem since DataFrameReader uses 
JDBCRDD and the later has a hard code line 

val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0")

for getting the table existence.  See line 61 at 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-17625:


 Summary: expectedOutputAttributes should be set when converting 
SimpleCatalogRelation to LogicalRelation
 Key: SPARK-17625
 URL: https://issues.apache.org/jira/browse/SPARK-17625
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Zhenhua Wang
Priority: Minor


expectedOutputAttributes should be set when converting SimpleCatalogRelation to 
LogicalRelation, otherwise the outputs of LogicalRelation are different from 
outputs of SimpleCatalogRelation - they have different exprId's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-09-21 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510672#comment-15510672
 ] 

DB Tsai commented on SPARK-17134:
-

I'll try the old mlor in rdd tonight when the cluster is not busy. Actually, 
this is a very large training dataset, and around 160GB in memory. Since there 
are 22533 classes, and 100 features, the total parameters are 2.2M. I expect 
that level 2 blas will help significantly in this case.  

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-09-21 Thread Ioana Delaney (JIRA)
Ioana Delaney created SPARK-17626:
-

 Summary: TPC-DS performance improvements using star-schema 
heuristics
 Key: SPARK-17626
 URL: https://issues.apache.org/jira/browse/SPARK-17626
 Project: Spark
  Issue Type: Umbrella
  Components: SQL
Affects Versions: 2.1.0
Reporter: Ioana Delaney
Priority: Critical


*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed 99
Failed  0
Total q time (s)   14,962
Max time1,467
Min time3
Mean time 145
Geomean44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)  -19%   
Mean time improved (%)   -19%
Geomean improved (%) -24%
End to end improved (seconds)  -3,603
Number of queries improved (>10%)  45
Number of queries degraded (>10%)   6
Number of queries unchanged48
Top 10 queries improved (%)  -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables.
\\
\\





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510707#comment-15510707
 ] 

Sean Owen commented on SPARK-17614:
---

Yup, that much is clearly a bug. Go for a fix, anyone who wants to - or I'll 
fix that to try to unblock further experimentation.

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510709#comment-15510709
 ] 

Paul Wu commented on SPARK-17614:
-

Create pull request: https://github.com/apache/spark/pull/15183

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510704#comment-15510704
 ] 

Apache Spark commented on SPARK-17614:
--

User 'paulzwu' has created a pull request for this issue:
https://github.com/apache/spark/pull/15183

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17614:


Assignee: (was: Apache Spark)

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17614:


Assignee: Apache Spark

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Assignee: Apache Spark
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-09-21 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-17626:
--
Description: 
*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed 99
Failed  0
Total q time (s)   14,962
Max time1,467
Min time3
Mean time 145
Geomean44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)  -19%   
Mean time improved (%)   -19%
Geomean improved (%) -24%
End to end improved (seconds)  -3,603
Number of queries improved (>10%)  45
Number of queries degraded (>10%)   6
Number of queries unchanged48
Top 10 queries improved (%)  -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables. Selectivity hint is optional and it was not used in the above TPC-DS 
tests. 
\\



  was:
*TPC-DS performance improvements using star-schema heuristics*
\\
\\
TPC-DS consists of multiple snowflake schema, which are multiple star schema 
with dimensions linking to dimensions. A star schema consists of a fact table 
referencing a number of dimension tables. Fact table holds the main data about 
a business. Dimension table, a usually smaller table, describes data reflecting 
the dimension/attribute of a business.
\\
\\
As part of the benchmark performance investigation, we observed a pattern of 
sub-optimal execution plans of large fact tables joins. Manual rewrite of some 
of the queries into selective fact-dimensions joins resulted in significant 
performance improvement. This prompted us to develop a simple join reordering 
algorithm based on star schema detection. The performance testing using *1TB 
TPC-DS workload* shows an overall improvement of *19%*. 
\\
\\
*Summary of the results:*
{code}
Passed 99
Failed  0
Total q time (s)   14,962
Max time1,467
Min time3
Mean time 145
Geomean44
{code}

*Compared to baseline* (Negative = improvement; Positive = Degradation):
{code}
End to end improved (%)  -19%   
Mean time improved (%)   -19%
Geomean improved (%) -24%
End to end improved (seconds)  -3,603
Number of queries improved (>10%)  45
Number of queries degraded (>10%)   6
Number of queries unchanged48
Top 10 queries improved (%)  -20%
{code}

Cluster: 20-node cluster with each node having:
* 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 
@ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
* Total memory for the cluster: 2.5TB
* Total storage: 400TB
* Total CPU cores: 480

Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA

Database info:
* Schema: TPCDS 
* Scale factor: 1TB total space
* Storage format: Parquet with Snappy compression

Our investigation and results are included in the attached document.

There are two parts to this improvement:
# Join reordering using star schema detection
# New selectivity hint to specify the selectivity of the predicates over base 
tables.
\\
\\




> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: 

[jira] [Commented] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version

2016-09-21 Thread Joey Paskhay (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510748#comment-15510748
 ] 

Joey Paskhay commented on SPARK-11702:
--

Apologies for the super late response, Sabs. In case you or anyone else is 
still having issues with the work-around, the guava JAR needs to be in both the 
spark.driver.extraClassPath and spark.executor.extraClassPath properties.

So our spark-defaults.conf ended up containing something like the following:

...
spark.driver.extraClassPath=/usr/lib/hive/lib/guava-15.0.jar:
spark.executor.extraClassPath=/usr/lib/hive/lib/guava-15.0.jar:
...

Hope that helps,
Joey

> Guava ClassLoading Issue When Using Different Hive Metastore Version
> 
>
> Key: SPARK-11702
> URL: https://issues.apache.org/jira/browse/SPARK-11702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Joey Paskhay
>
> A Guava classloading error can occur when using a different version of the 
> Hive metastore.
> Running the latest version of Spark at this time (1.5.1) and patched versions 
> of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
> "1.0.0" and "spark.sql.hive.metastore.jars" to 
> "/lib/*:". When trying to 
> launch the spark-shell, the sqlContext would fail to initialize with:
> {code}
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> com/google/common/base/Predicate when creating Hive client using classpath: 
> 
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
> defaultValue=builtin, doc=...
> {code}
> We verified the Guava libraries are in the huge list of the included jars, 
> but we saw that in the 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
> seems to assume that *all* "com.google" (excluding "com.google.cloud") 
> classes should be loaded from the base class loader. The Spark libraries seem 
> to have *some* "com.google.common.base" classes shaded in but not all.
> See 
> [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
>  and its replies.
> The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
> property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14849) shuffle broken when accessing standalone cluster through NAT

2016-09-21 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510745#comment-15510745
 ] 

Shixiong Zhu commented on SPARK-14849:
--

[~skyluc] do you still see the error in Spark 2.0.0?

> shuffle broken when accessing standalone cluster through NAT
> 
>
> Key: SPARK-14849
> URL: https://issues.apache.org/jira/browse/SPARK-14849
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
>Reporter: Luc Bourlier
>  Labels: nat, network
>
> I have the following network configuration:
> {code}
>  ++
>  ||
>  |  spark-shell   |
>  ||
>  +- ip: 10.110.101.2 -+
>|
>|
>  +- ip: 10.110.101.1 -+
>  || NAT + routing
>  |  spark-master  | configured
>  ||
>  +- ip: 10.110.100.1 -+
>|
>   ++
>   ||
> +- ip: 10.110.101.2 -++- ip: 10.110.101.3 -+
> ||||
> |  spark-worker 1||  spark-worker 2|
> ||||
> ++++
> {code}
> I have NAT, DNS and routing correctly configure such as each machine can 
> communicate with each other.
> Launch spark-shell against the cluster works well. Simple map operations work 
> too:
> {code}
> scala> sc.makeRDD(1 to 5).map(_ * 5).collect
> res0: Array[Int] = Array(5, 10, 15, 20, 25)
> {code}
> But operations requiring shuffling fail:
> {code}
> scala> sc.makeRDD(1 to 5).map(i => (i,1)).reduceByKey(_ + _).collect
> 16/04/22 15:33:17 WARN TaskSetManager: Lost task 4.0 in stage 2.0 (TID 19, 
> 10.110.101.1): FetchFailed(BlockManagerId(0, 10.110.101.1, 42842), 
> shuffleId=0, mapId=6, reduceId=4, message=
> org.apache.spark.shuffle.FetchFailedException: Failed to connect to 
> /10.110.101.1:42842
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
> [ ... ]
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to /10.110.101.1:42842
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> [ ... ]
>   at org.apache.spark.network.shuffle.RetryingBlockFetcher.access
> [ ... ]
> {code}
> It makes sense that a connection to 10.110.101.1:42842 would fail, no part of 
> the system should have a direct knowledge of the IP address 10.110.101.1.
> So a part of the system is wrongly discovering this IP address.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) Better error from WLS for cases like singular input

2016-09-21 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510792#comment-15510792
 ] 

DB Tsai commented on SPARK-11918:
-

+1 on QR decomposition. We may add a feature that using LBFGS/OWLQN to optimize 
the objective function once AtA is computed. Thus, we can do one-pass LiR with 
elastic net. This approach will not suffer from ill-condition issues.  
+[~sethah] who is interested in one-pass LiR with elastic net using OWLQN.

> Better error from WLS for cases like singular input
> ---
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Sean Owen
>Priority: Minor
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510525#comment-15510525
 ] 

Paul Wu commented on SPARK-17614:
-

Thanks. I tried to register my custom dialect as following, but it does not 
reach the getTableExistsQuery() method. Could anyone help?

import org.apache.spark.sql.jdbc.JdbcDialect;

public class NRSCassandraDialect  extends JdbcDialect {

@Override
public boolean canHandle(String url) {
System.out.println("came here.."+ url.startsWith("jdbc:cassandra"));
return url.startsWith("jdbc:cassandra");
}
@Override
public String getTableExistsQuery (String table) {
System.out.println("query?");
return "SELECT * from " + table + " LIMIT 1";
}
}

--
public class CassJDBC implements Serializable {

private static final org.apache.log4j.Logger LOGGER = 
org.apache.log4j.Logger.getLogger(CassJDBC.class);

private static final String _CONNECTION_URL = 
"jdbc:cassandra://ulpd326..com/test?loadbalancing=DCAwareRoundRobinPolicy(%22datacenter1%22)";
private static final String _USERNAME = "";
private static final String _PWD = "";

private static final SparkSession sparkSession
= SparkSession.builder() .config("spark.sql.warehouse.dir", 
"file:///home/zw251y/tmp").master("local[*]").appName("Spark2JdbcDs").getOrCreate();

public static void main(String[] args) {
   
JdbcDialects.registerDialect(new NRSCassandraDialect());
final Properties connectionProperties = new Properties();
 
final String dbTable= "sql_demo";

Dataset jdbcDF
= sparkSession.read()
.jdbc(_CONNECTION_URL, dbTable, connectionProperties);

jdbcDF.show();
   
}
}


Error message:
came here..true
parameters = "datacenter1"
Exception in thread "main" java.sql.SQLTransientException: 
com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
at 
com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
at 
com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Priority: Minor
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Paul Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Wu updated SPARK-17614:

Priority: Major  (was: Minor)

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17614) sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra does not support

2016-09-21 Thread Paul Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510626#comment-15510626
 ] 

Paul Wu commented on SPARK-17614:
-

No, Custom JdbcDialect won't resolve the problem since DataFrameReader uses 
JDBCRDD and the later has a hard code line 

val statement = conn.prepareStatement(s"SELECT * FROM $table WHERE 1=0")

for getting the table existence.  See line 61 at 

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala

> sparkSession.read() .jdbc(***) use the sql syntax "where 1=0" that Cassandra 
> does not support
> -
>
> Key: SPARK-17614
> URL: https://issues.apache.org/jira/browse/SPARK-17614
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: Any Spark Runtime 
>Reporter: Paul Wu
>Priority: Minor
>  Labels: cassandra-jdbc, sql
>
> I have the code like the following with Cassandra JDBC 
> (https://github.com/adejanovski/cassandra-jdbc-wrapper):
>  final String dbTable= "sql_demo";
> Dataset jdbcDF
> = sparkSession.read()
> .jdbc(CASSANDRA_CONNECTION_URL, dbTable, 
> connectionProperties);
> List rows = jdbcDF.collectAsList();
> It threw the error:
> Exception in thread "main" java.sql.SQLTransientException: 
> com.datastax.driver.core.exceptions.SyntaxError: line 1:29 no viable 
> alternative at input '1' (SELECT * FROM sql_demo WHERE [1]...)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraPreparedStatement.(CassandraPreparedStatement.java:108)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:371)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:348)
>   at 
> com.github.adejanovski.cassandra.jdbc.CassandraConnection.prepareStatement(CassandraConnection.java:48)
> The reason is that the Spark jdbc code uses the sql syntax "where 1=0" 
> somewhere (to get the schema?), but Cassandra does not support this syntax. 
> Not sure how this issue can be resolved...this is because CQL is not standard 
> sql. 
> The following log shows more information:
> 16/09/20 13:16:35 INFO CassandraConnection  138: Datacenter: %s; Host: %s; 
> Rack: %s
> 16/09/20 13:16:35 TRACE CassandraPreparedStatement  98: CQL: SELECT * FROM 
> sql_demo WHERE 1=0
> 16/09/20 13:16:35 TRACE RequestHandler  71: [19400322] 
> com.datastax.driver.core.Statement$1@41ccb3b9
> 16/09/20 13:16:35 TRACE RequestHandler  272: [19400322-1] Starting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17625) expectedOutputAttributes should be set when converting SimpleCatalogRelation to LogicalRelation

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17625:


Assignee: (was: Apache Spark)

> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation
> ---
>
> Key: SPARK-17625
> URL: https://issues.apache.org/jira/browse/SPARK-17625
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> expectedOutputAttributes should be set when converting SimpleCatalogRelation 
> to LogicalRelation, otherwise the outputs of LogicalRelation are different 
> from outputs of SimpleCatalogRelation - they have different exprId's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

2016-09-21 Thread Ioana Delaney (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ioana Delaney updated SPARK-17626:
--
Attachment: StarSchemaJoinReordering.pptx

> TPC-DS performance improvements using star-schema heuristics
> 
>
> Key: SPARK-17626
> URL: https://issues.apache.org/jira/browse/SPARK-17626
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Ioana Delaney
>Priority: Critical
> Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema 
> with dimensions linking to dimensions. A star schema consists of a fact table 
> referencing a number of dimension tables. Fact table holds the main data 
> about a business. Dimension table, a usually smaller table, describes data 
> reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of 
> sub-optimal execution plans of large fact tables joins. Manual rewrite of 
> some of the queries into selective fact-dimensions joins resulted in 
> significant performance improvement. This prompted us to develop a simple 
> join reordering algorithm based on star schema detection. The performance 
> testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed 99
> Failed  0
> Total q time (s)   14,962
> Max time1,467
> Min time3
> Mean time 145
> Geomean44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)  -19% 
> Mean time improved (%)   -19%
> Geomean improved (%) -24%
> End to end improved (seconds)  -3,603
> Number of queries improved (>10%)  45
> Number of queries degraded (>10%)   6
> Number of queries unchanged48
> Top 10 queries improved (%)  -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 
> v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base 
> tables.
> \\
> \\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11702) Guava ClassLoading Issue When Using Different Hive Metastore Version

2016-09-21 Thread Joey Paskhay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joey Paskhay updated SPARK-11702:
-
Description: 
A Guava classloading error can occur when using a different version of the Hive 
metastore.

Running the latest version of Spark at this time (1.5.1) and patched versions 
of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
"1.0.0" and "spark.sql.hive.metastore.jars" to 
"/lib/*:". When trying to launch 
the spark-shell, the sqlContext would fail to initialize with:

{code}
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
com/google/common/base/Predicate when creating Hive client using classpath: 

Please make sure that jars for your version of hive and hadoop are included in 
the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
defaultValue=builtin, doc=...
{code}

We verified the Guava libraries are in the huge list of the included jars, but 
we saw that in the 
org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
seems to assume that *all* "com.google" (excluding "com.google.cloud") classes 
should be loaded from the base class loader. The Spark libraries seem to have 
*some* "com.google.common.base" classes shaded in but not all.

See 
[https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
 and its replies.

The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
and "spark.executor.extraClassPath" properties.

  was:
A Guava classloading error can occur when using a different version of the Hive 
metastore.

Running the latest version of Spark at this time (1.5.1) and patched versions 
of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
"1.0.0" and "spark.sql.hive.metastore.jars" to 
"/lib/*:". When trying to launch 
the spark-shell, the sqlContext would fail to initialize with:

{code}
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
com/google/common/base/Predicate when creating Hive client using classpath: 

Please make sure that jars for your version of hive and hadoop are included in 
the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
defaultValue=builtin, doc=...
{code}

We verified the Guava libraries are in the huge list of the included jars, but 
we saw that in the 
org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
seems to assume that *all* "com.google" (excluding "com.google.cloud") classes 
should be loaded from the base class loader. The Spark libraries seem to have 
*some* "com.google.common.base" classes shaded in but not all.

See 
[https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
 and its replies.

The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
property.


> Guava ClassLoading Issue When Using Different Hive Metastore Version
> 
>
> Key: SPARK-11702
> URL: https://issues.apache.org/jira/browse/SPARK-11702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Joey Paskhay
>
> A Guava classloading error can occur when using a different version of the 
> Hive metastore.
> Running the latest version of Spark at this time (1.5.1) and patched versions 
> of Hadoop 2.2.0 and Hive 1.0.0. We set "spark.sql.hive.metastore.version" to 
> "1.0.0" and "spark.sql.hive.metastore.jars" to 
> "/lib/*:". When trying to 
> launch the spark-shell, the sqlContext would fail to initialize with:
> {code}
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> com/google/common/base/Predicate when creating Hive client using classpath: 
> 
> Please make sure that jars for your version of hive and hadoop are included 
> in the paths passed to SQLConfEntry(key = spark.sql.hive.metastore.jars, 
> defaultValue=builtin, doc=...
> {code}
> We verified the Guava libraries are in the huge list of the included jars, 
> but we saw that in the 
> org.apache.spark.sql.hive.client.IsolatedClientLoader.isSharedClass method it 
> seems to assume that *all* "com.google" (excluding "com.google.cloud") 
> classes should be loaded from the base class loader. The Spark libraries seem 
> to have *some* "com.google.common.base" classes shaded in but not all.
> See 
> [https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAB51Vx4ipV34e=eishlg7bzldm0uefd_mpyqfe4dodbnbv9...@mail.gmail.com%3E]
>  and its replies.
> The work-around is to add the guava JAR to the "spark.driver.extraClassPath" 
> and "spark.executor.extraClassPath" properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-21 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510787#comment-15510787
 ] 

Michael Armbrust commented on SPARK-16407:
--

I'm still a little unclear on the use cases we are trying to enable, so the dev 
list sounds like a good place to me.

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17418) Spark release must NOT distribute Kinesis related assembly artifact

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17418.

   Resolution: Fixed
 Assignee: Josh Rosen
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

Fixed by my PR for master, branch-2.0, and branch-1.6.

> Spark release must NOT distribute Kinesis related assembly artifact
> ---
>
> Key: SPARK-17418
> URL: https://issues.apache.org/jira/browse/SPARK-17418
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Luciano Resende
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> The Kinesis streaming connector is based on the Amazon Software License, and 
> based on the Apache Legal resolved issues 
> (http://www.apache.org/legal/resolved.html#category-x) it's not allowed to be 
> distributed by Apache projects.
> More details is available in LEGAL-198



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17616.

Resolution: Duplicate

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11918) Better error from WLS for cases like singular input

2016-09-21 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-11918.
-
Resolution: Fixed

> Better error from WLS for cases like singular input
> ---
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Sean Owen
>Priority: Minor
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Description: 
We were getting incorrect results from the DataFrame except method - all rows 
were being returned instead of the ones that intersected. Calling subtract on 
the underlying RDD returned the correct result.

We tracked it down to the use of coalesce - the following is the simplest 
example case we created that reproduces the issue:

{code}
val schema = new StructType().add("test", types.IntegerType )
val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
Row(i)), schema)
val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
Row(i)), schema)
val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
println("Count using normal except = " + t1.except(t3).count())
println("Count using coalesce = " + 
t1.coalesce(8).except(t3.coalesce(8)).count())
{code}

We should get the same result from both uses of except, but the one using 
coalesce returns 100 instead of 94.

  was:
We were getting incorrect results from the DataFrame except method - all rows 
were being returned instead of the ones that intersected. Calling subtract on 
the underlying RDD returned the correct result.

We tracked it down to the use of coalesce - the following is the simplest 
example case we created that reproduces the issue:

val schema = new StructType().add("test", types.IntegerType )
val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
Row(i)), schema)
val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
Row(i)), schema)
val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
println("Count using normal except = " + t1.except(t3).count())
println("Count using coalesce = " + 
t1.coalesce(8).except(t3.coalesce(8)).count())

We should get the same result from both uses of except, but the one using 
coalesce returns 100 instead of 94.


> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Graeme Edwards
>Priority: Minor
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2016-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17592:

Labels:   (was: correctness)

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2016-09-21 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17592:

Fix Version/s: (was: 2.0.1)
   (was: 2.1.0)

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17019) Expose off-heap memory usage in various places

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17019:
---
Target Version/s: 2.1.0  (was: 2.0.1, 2.1.0)

> Expose off-heap memory usage in various places
> --
>
> Key: SPARK-17019
> URL: https://issues.apache.org/jira/browse/SPARK-17019
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Saisai Shao
>Priority: Minor
>
> With SPARK-13992, Spark supports persisting data into off-heap memory, but 
> the usage of off-heap is not exposed currently, it is not so convenient for 
> user to monitor and profile, so here propose to expose off-heap memory as 
> well as on-heap memory usage in various places:
> 1. Spark UI's executor page will display both on-heap and off-heap memory 
> usage.
> 2. REST request returns both on-heap and off-heap memory.
> 3. Also these two memory usage can be obtained programmatically from 
> SparkListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Affects Version/s: 1.6.2

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Minor
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Labels: correctness  (was: )

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Minor
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Priority: Blocker  (was: Minor)

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17618:
---
Target Version/s: 1.6.3

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510934#comment-15510934
 ] 

Josh Rosen commented on SPARK-17618:


Yep, the problem is that {{Coalesce}} advertises that it accepts Unsafe rows 
but misdeclares its row output format as being regular rows. Comparing an 
UnsafeRow to any other row type for equality always returns false (its 
{{equals()}} implementation is compatible with Java universal equality, so it 
doesn't throw when performing a comparison against a different type). As a 
result, the Except compares safe and unsafe rows, causing the comparisons to be 
incorrect and leading to the wrong answer that you saw here.

I'm marking this as a blocker for 1.6.3 and am working on a fix which will fix 
this issue.

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15510874#comment-15510874
 ] 

Josh Rosen commented on SPARK-17618:


It looks like this affects 1.6.2 as well, but I was unable to reproduce in 2.x.

Comparing the two physical plans, I wonder if the issue has to do with Tungsten 
vs. regular internal row formats.

For {{t1.except(t3).explain(true)}}:

{code}
== Physical Plan ==
Except
:- Scan ExistingRDD[test#35] 
+- ConvertToSafe
   +- LeftSemiJoinHash [test#35], [test#36], None
  :- TungstenExchange hashpartitioning(test#35,200), None
  :  +- ConvertToUnsafe
  : +- Scan ExistingRDD[test#35] 
  +- TungstenExchange hashpartitioning(test#36,200), None
 +- ConvertToUnsafe
+- Scan ExistingRDD[test#36]
{code}

whereas {{t1.coalesce(8).except(t3.coalesce(8)).explain(true)}} produces

{code}
Except
:- Coalesce 8
:  +- Scan ExistingRDD[test#35] 
+- Coalesce 8
   +- LeftSemiJoinHash [test#35], [test#36], None
  :- TungstenExchange hashpartitioning(test#35,200), None
  :  +- ConvertToUnsafe
  : +- Scan ExistingRDD[test#35] 
  +- TungstenExchange hashpartitioning(test#36,200), None
 +- ConvertToUnsafe
+- Scan ExistingRDD[test#36]
{code}

My hunch is that Except is inappropriately mixing Tungsten and non-Tungsten row 
formats due to a bug in the row format conversion rules.

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Priority: Minor
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511034#comment-15511034
 ] 

Apache Spark commented on SPARK-17618:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15185

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17618:


Assignee: Apache Spark  (was: Josh Rosen)

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Apache Spark
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17618:


Assignee: Josh Rosen  (was: Apache Spark)

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17618) Dataframe except returns incorrect results when combined with coalesce

2016-09-21 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-17618:
--

Assignee: Josh Rosen

> Dataframe except returns incorrect results when combined with coalesce
> --
>
> Key: SPARK-17618
> URL: https://issues.apache.org/jira/browse/SPARK-17618
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 1.6.2
>Reporter: Graeme Edwards
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
>
> We were getting incorrect results from the DataFrame except method - all rows 
> were being returned instead of the ones that intersected. Calling subtract on 
> the underlying RDD returned the correct result.
> We tracked it down to the use of coalesce - the following is the simplest 
> example case we created that reproduces the issue:
> {code}
> val schema = new StructType().add("test", types.IntegerType )
> val t1 = sql.createDataFrame(sql.sparkContext.parallelize(1 to 100).map(i=> 
> Row(i)), schema)
> val t2 = sql.createDataFrame(sql.sparkContext.parallelize(5 to 10).map(i=> 
> Row(i)), schema)
> val t3 = t1.join(t2, t1.col("test").equalTo(t2.col("test")), "leftsemi")
> println("Count using normal except = " + t1.except(t3).count())
> println("Count using coalesce = " + 
> t1.coalesce(8).except(t3.coalesce(8)).count())
> {code}
> We should get the same result from both uses of except, but the one using 
> coalesce returns 100 instead of 94.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17628) Name of "object StreamingExamples" should be more self-explanatory

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17628:


Assignee: Apache Spark

> Name of "object StreamingExamples" should be more self-explanatory 
> ---
>
> Key: SPARK-17628
> URL: https://issues.apache.org/jira/browse/SPARK-17628
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, Streaming
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Assignee: Apache Spark
>Priority: Minor
>
> `object StreamingExamples` is more of a utility object, and the name is too 
> general and I thought it's an actual streaming example at the very beginning.
> {code}
> /** Utility functions for Spark Streaming examples. */
> object StreamingExamples extends Logging {
>   /** Set reasonable logging levels for streaming if the user has not 
> configured log4j. */
>   def setStreamingLogLevels() {
> val log4jInitialized = 
> Logger.getRootLogger.getAllAppenders.hasMoreElements
> if (!log4jInitialized) {
>   // We first log something to initialize Spark's default logging, then 
> we override the
>   // logging level.
>   logInfo("Setting log level to [WARN] for streaming example." +
> " To override add a custom log4j.properties to the classpath.")
>   Logger.getRootLogger.setLevel(Level.WARN)
> }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Suresh Thalamati (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suresh Thalamati reopened SPARK-14536:
--

SPARK-10186  added array data type support  for postgres in 1.6.  NPE issues 
still exists. I was able repro in the  master. 

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15717) Cannot perform RDD operations on a checkpointed VertexRDD.

2016-09-21 Thread Asher Krim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511727#comment-15511727
 ] 

Asher Krim commented on SPARK-15717:


Any update on this issue? We are experiencing ClassCastExceptions when using 
checkpointing and LDA with the EM optimizer.

> Cannot perform RDD operations on a checkpointed VertexRDD.
> --
>
> Key: SPARK-15717
> URL: https://issues.apache.org/jira/browse/SPARK-15717
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.1
>Reporter: Anderson de Andrade
>
> A checkpointed (materialized) VertexRDD throws the following exception when 
> collected:
> bq. java.lang.ArrayStoreException: 
> org.apache.spark.graphx.impl.ShippableVertexPartition
> Can be replicated by running:
> {code:java}
> graph.vertices.checkpoint()
> graph.vertices.count() // materialize
> graph.vertices.collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-09-21 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15511837#comment-15511837
 ] 

Hyukjin Kwon commented on SPARK-14536:
--

I see. I rushed to read this and didn't noticed that this is actually a 
PostgreSQL specific issue (I thought this JIRA describes a general JDBC 
problem).
Yea, {{ArrayType}} seems only supported in {{PostgreSQL}} in Spark. Maybe we 
should make some relations with those JIRAs with SPARK-8500 to prevent 
confusion.

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17616:


Assignee: Herman van Hovell  (was: Apache Spark)

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Herman van Hovell
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17616) Getting "java.lang.RuntimeException: Distinct columns cannot exist in Aggregate "

2016-09-21 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17616:


Assignee: Apache Spark  (was: Herman van Hovell)

> Getting "java.lang.RuntimeException: Distinct columns cannot exist in 
> Aggregate "
> -
>
> Key: SPARK-17616
> URL: https://issues.apache.org/jira/browse/SPARK-17616
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Apache Spark
>Priority: Minor
>
> I execute:
> {code}
> select platform, 
> collect_set(user_auth) as paid_types,
> count(distinct sessionid) as sessions
> from non_hss.session
> where
> event = 'stop' and platform != 'testplatform' and
> not (month = MONTH(current_date()) AND year = YEAR(current_date()) 
> and day = day(current_date())) and
> (
> (month >= MONTH(add_months(CURRENT_DATE(), -5)) AND year = 
> YEAR(add_months(CURRENT_DATE(), -5)))
> OR
> (month <= MONTH(add_months(CURRENT_DATE(), -5)) AND year > 
> YEAR(add_months(CURRENT_DATE(), -5)))
> )
> group by platform
> {code}
> I get:
> {code}
> java.lang.RuntimeException: Distinct columns cannot exist in Aggregate 
> operator containing aggregate functions which don't support partial 
> aggregation.
> {code}
> IT WORKED IN 1.6.2. I've read error 5 times, and read code once. I still 
> don't understand what I do incorrectly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >