[jira] [Commented] (SPARK-14525) DataFrameWriter's save method should delegate to jdbc for jdbc datasource

2016-04-12 Thread Justin Pihony (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238621#comment-15238621
 ] 

Justin Pihony commented on SPARK-14525:
---

I don't mind putting together a PR for this, however I am curious as to whether 
there is an opinion on the implementation. I see two options. Have the save 
method redirect to the jdbc method, or move the logic in the jdbc method into 
the jdbc.DefaultSource, allowing the DataFrameWriter to not have to be 
responsible; jdbc would delegate to save which would delegate to 
DataSource.write which would delegate to a new method in the jdbc.DefaultSource.

After languishing on the seemingly unclean choice having save redirect to jdbc, 
I am leaning towards the second option. I think it's a better design choice.

> DataFrameWriter's save method should delegate to jdbc for jdbc datasource
> -
>
> Key: SPARK-14525
> URL: https://issues.apache.org/jira/browse/SPARK-14525
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Justin Pihony
>Priority: Minor
>
> If you call {code}df.write.format("jdbc")...save(){code} then you get an 
> error  
> bq. org.apache.spark.sql.execution.datasources.jdbc.DefaultSource does not 
> allow create table as select
> save is a more intuitive guess on the appropriate method to call, so the user 
> should not be punished for not knowing about the jdbc method. 
> Obviously, this will require the caller to have set up the correct parameters 
> for jdbc to work :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14540) Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner

2016-04-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238602#comment-15238602
 ] 

Josh Rosen commented on SPARK-14540:


I found a problem which seems to prevent the cleaning / serialization of 
closures which contain local defs; see 
https://gist.github.com/JoshRosen/8aacdee0162da430868e7f73247d45d8 for a 
writeup which describes the problem.

> Support Scala 2.12 closures and Java 8 lambdas in ClosureCleaner
> 
>
> Key: SPARK-14540
> URL: https://issues.apache.org/jira/browse/SPARK-14540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Using https://github.com/JoshRosen/spark/tree/build-for-2.12, I tried running 
> ClosureCleanerSuite with Scala 2.12 and ran into two bad test failures:
> {code}
> [info] - toplevel return statements in closures are identified at cleaning 
> time *** FAILED *** (32 milliseconds)
> [info]   Expected exception 
> org.apache.spark.util.ReturnStatementInClosureException to be thrown, but no 
> exception was thrown. (ClosureCleanerSuite.scala:57)
> {code}
> and
> {code}
> [info] - user provided closures are actually cleaned *** FAILED *** (56 
> milliseconds)
> [info]   Expected ReturnStatementInClosureException, but got 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task not 
> serializable: java.io.NotSerializableException: java.lang.Object
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class 
> org.apache.spark.util.TestUserClosuresActuallyCleaned$, 
> functionalInterfaceMethod=scala/runtime/java8/JFunction1$mcII$sp.apply$mcII$sp:(I)I,
>  implementation=invokeStatic 
> org/apache/spark/util/TestUserClosuresActuallyCleaned$.org$apache$spark$util$TestUserClosuresActuallyCleaned$$$anonfun$69:(Ljava/lang/Object;I)I,
>  instantiatedMethodType=(I)I, numCaptured=1])
> [info]- element of array (index: 0)
> [info]- array (class "[Ljava.lang.Object;", size: 1)
> [info]- field (class "java.lang.invoke.SerializedLambda", name: 
> "capturedArgs", type: "class [Ljava.lang.Object;")
> [info]- object (class "java.lang.invoke.SerializedLambda", 
> SerializedLambda[capturingClass=class org.apache.spark.rdd.RDD, 
> functionalInterfaceMethod=scala/Function3.apply:(Ljava/lang/Object;Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> org/apache/spark/rdd/RDD.org$apache$spark$rdd$RDD$$$anonfun$20$adapted:(Lscala/Function1;Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  
> instantiatedMethodType=(Lorg/apache/spark/TaskContext;Ljava/lang/Object;Lscala/collection/Iterator;)Lscala/collection/Iterator;,
>  numCaptured=1])
> [info]- field (class "org.apache.spark.rdd.MapPartitionsRDD", name: 
> "f", type: "interface scala.Function3")
> [info]- object (class "org.apache.spark.rdd.MapPartitionsRDD", 
> MapPartitionsRDD[2] at apply at Transformer.scala:22)
> [info]- field (class "scala.Tuple2", name: "_1", type: "class 
> java.lang.Object")
> [info]- root object (class "scala.Tuple2", (MapPartitionsRDD[2] at 
> apply at 
> Transformer.scala:22,org.apache.spark.SparkContext$$Lambda$957/431842435@6e803685)).
> [info]   This means the closure provided by user is not actually cleaned. 
> (ClosureCleanerSuite.scala:78)
> {code}
> We'll need to figure out a closure cleaning strategy which works for 2.12 
> lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14592) Create table like

2016-04-12 Thread Yin Huai (JIRA)
Yin Huai created SPARK-14592:


 Summary: Create table like
 Key: SPARK-14592
 URL: https://issues.apache.org/jira/browse/SPARK-14592
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14591) DDLParser should accept decimal(precision)

2016-04-12 Thread Yin Huai (JIRA)
Yin Huai created SPARK-14591:


 Summary: DDLParser should accept decimal(precision)
 Key: SPARK-14591
 URL: https://issues.apache.org/jira/browse/SPARK-14591
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai


Right now, our DDLParser does not support {{decimal(precision)}} (the scale 
will be set to 0). We should support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14127) [Table related commands] Describe table

2016-04-12 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238567#comment-15238567
 ] 

Xiao Li commented on SPARK-14127:
-

Most of work are duplicate with `show table extended`. Thus, [~dkbiswal] will 
submit a PR to handle both cases. Thanks! 

> [Table related commands] Describe table
> ---
>
> Key: SPARK-14127
> URL: https://issues.apache.org/jira/browse/SPARK-14127
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> TOK_DESCTABLE
> Describe a column/table/partition (see here and here). Seems we support 
> DESCRIBE and DESCRIBE EXTENDED. It will be good to also support other 
> syntaxes (and check if we are missing anything).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
se, but it looks like a necessary improvement for the two engines to converge. 
Hive version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark don't 
> have a similar parsing behavior for decimals. I wouldn't say it is a bug per 
> se, but it looks like a necessary improvement for the two engines to 
> converge. Hive version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-14499) Add tests to make sure drop partitions of an external table will not delete data

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238552#comment-15238552
 ] 

Apache Spark commented on SPARK-14499:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/12350

> Add tests to make sure drop partitions of an external table will not delete 
> data
> 
>
> Key: SPARK-14499
> URL: https://issues.apache.org/jira/browse/SPARK-14499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> This is a follow-up of SPARK-14132 
> (https://github.com/apache/spark/pull/12220#issuecomment-207625166) to 
> address https://github.com/apache/spark/pull/12220#issuecomment-207612627.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14499) Add tests to make sure drop partitions of an external table will not delete data

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14499:


Assignee: Apache Spark

> Add tests to make sure drop partitions of an external table will not delete 
> data
> 
>
> Key: SPARK-14499
> URL: https://issues.apache.org/jira/browse/SPARK-14499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> This is a follow-up of SPARK-14132 
> (https://github.com/apache/spark/pull/12220#issuecomment-207625166) to 
> address https://github.com/apache/spark/pull/12220#issuecomment-207612627.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14499) Add tests to make sure drop partitions of an external table will not delete data

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14499:


Assignee: (was: Apache Spark)

> Add tests to make sure drop partitions of an external table will not delete 
> data
> 
>
> Key: SPARK-14499
> URL: https://issues.apache.org/jira/browse/SPARK-14499
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> This is a follow-up of SPARK-14132 
> (https://github.com/apache/spark/pull/12220#issuecomment-207625166) to 
> address https://github.com/apache/spark/pull/12220#issuecomment-207612627.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14590) Update pull request template with link to jira

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14590:


Assignee: (was: Apache Spark)

> Update pull request template with link to jira
> --
>
> Key: SPARK-14590
> URL: https://issues.apache.org/jira/browse/SPARK-14590
> Project: Spark
>  Issue Type: Improvement
>Reporter: Luciano Resende
>Priority: Minor
>
> Update pull request template to have a link to the current jira issue to 
> facilitate navigation between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14590) Update pull request template with link to jira

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14590?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238532#comment-15238532
 ] 

Apache Spark commented on SPARK-14590:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/12349

> Update pull request template with link to jira
> --
>
> Key: SPARK-14590
> URL: https://issues.apache.org/jira/browse/SPARK-14590
> Project: Spark
>  Issue Type: Improvement
>Reporter: Luciano Resende
>Priority: Minor
>
> Update pull request template to have a link to the current jira issue to 
> facilitate navigation between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14590) Update pull request template with link to jira

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14590:


Assignee: Apache Spark

> Update pull request template with link to jira
> --
>
> Key: SPARK-14590
> URL: https://issues.apache.org/jira/browse/SPARK-14590
> Project: Spark
>  Issue Type: Improvement
>Reporter: Luciano Resende
>Assignee: Apache Spark
>Priority: Minor
>
> Update pull request template to have a link to the current jira issue to 
> facilitate navigation between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14590) Update pull request template with link to jira

2016-04-12 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-14590:
---

 Summary: Update pull request template with link to jira
 Key: SPARK-14590
 URL: https://issues.apache.org/jira/browse/SPARK-14590
 Project: Spark
  Issue Type: Improvement
Reporter: Luciano Resende
Priority: Minor


Update pull request template to have a link to the current jira issue to 
facilitate navigation between the two.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14589) Enhance DB2 JDBC Dialect docker tests

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14589:


Assignee: (was: Apache Spark)

> Enhance DB2 JDBC Dialect docker tests
> -
>
> Key: SPARK-14589
> URL: https://issues.apache.org/jira/browse/SPARK-14589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14589) Enhance DB2 JDBC Dialect docker tests

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238521#comment-15238521
 ] 

Apache Spark commented on SPARK-14589:
--

User 'lresende' has created a pull request for this issue:
https://github.com/apache/spark/pull/12348

> Enhance DB2 JDBC Dialect docker tests
> -
>
> Key: SPARK-14589
> URL: https://issues.apache.org/jira/browse/SPARK-14589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14589) Enhance DB2 JDBC Dialect docker tests

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14589:


Assignee: Apache Spark

> Enhance DB2 JDBC Dialect docker tests
> -
>
> Key: SPARK-14589
> URL: https://issues.apache.org/jira/browse/SPARK-14589
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Luciano Resende
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14589) Enhance DB2 JDBC Dialect docker tests

2016-04-12 Thread Luciano Resende (JIRA)
Luciano Resende created SPARK-14589:
---

 Summary: Enhance DB2 JDBC Dialect docker tests
 Key: SPARK-14589
 URL: https://issues.apache.org/jira/browse/SPARK-14589
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Luciano Resende






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14311) Model persistence in SparkR

2016-04-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238500#comment-15238500
 ] 

Yanbo Liang commented on SPARK-14311:
-

Sure, I can have a try.
Another issue is R `Object` has feature names in the model, but we did not 
store them in the Scala PipelineModel. I think this should be done in further 
tasks, but how did we handle this issue in the current implementation? Should 
we record the feature and label names in extraMetadata of PipelineModel?

> Model persistence in SparkR
> ---
>
> Key: SPARK-14311
> URL: https://issues.apache.org/jira/browse/SPARK-14311
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, 
> naive Bayes, and AFT survival regression. Users can fit models, get summary, 
> and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be 
> straightforward to implement model persistence. We need to think more about 
> the API. R uses save/load for objects and datasets (also objects). It is 
> possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But 
> I'm not sure whether load can be overloaded easily. I propose the following 
> API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load 
> is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14588) Consider getting column stats from files (wherever feasible) to get better stats for joins

2016-04-12 Thread Rajesh Balamohan (JIRA)
Rajesh Balamohan created SPARK-14588:


 Summary: Consider getting column stats from files (wherever 
feasible) to get better stats for joins
 Key: SPARK-14588
 URL: https://issues.apache.org/jira/browse/SPARK-14588
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rajesh Balamohan


Broadcast join is determined by "spark.sql.autoBroadcastJoinThreshold". Stats 
for this is determined from the files and by determining the projected columns 
(internally it assumes 20 bytes for string columns). However, estimated stats 
could be invalid if the dataset contains greater than 20 bytes for string 
columns . In such instances, broadcast join would not be invoked. 

File formats like ORC would be able to provide the raw data size for the 
projected columns. It might be good to consider those (whenever available) to 
determine the accurate stats for broadcast threshold.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14587) abstract class Receiver should be explicit about the return type of its methods

2016-04-12 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-14587:
---

 Summary: abstract class Receiver should be explicit about the 
return type of its methods
 Key: SPARK-14587
 URL: https://issues.apache.org/jira/browse/SPARK-14587
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Jacek Laskowski
Priority: Trivial


The 
[org.apache.spark.streaming.receiver.Receiver|https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/receiver/Receiver.scala#L102]
 abstract class defines API without specifying the return types explicitly, e.g.

{code}
def onStart()
def onStop()
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14441) Consolidate DDL tests

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14441:


Assignee: (was: Apache Spark)

> Consolidate DDL tests
> -
>
> Key: SPARK-14441
> URL: https://issues.apache.org/jira/browse/SPARK-14441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Today we have DDLSuite, DDLCommandSuite, HiveDDLCommandSuite. It's confusing 
> whether a test should exist in one or the other. It also makes it less clear 
> whether our test coverage is comprehensive. Ideally we should consolidate 
> these files as much as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14441) Consolidate DDL tests

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238480#comment-15238480
 ] 

Apache Spark commented on SPARK-14441:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/12347

> Consolidate DDL tests
> -
>
> Key: SPARK-14441
> URL: https://issues.apache.org/jira/browse/SPARK-14441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>
> Today we have DDLSuite, DDLCommandSuite, HiveDDLCommandSuite. It's confusing 
> whether a test should exist in one or the other. It also makes it less clear 
> whether our test coverage is comprehensive. Ideally we should consolidate 
> these files as much as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14441) Consolidate DDL tests

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14441:


Assignee: Apache Spark

> Consolidate DDL tests
> -
>
> Key: SPARK-14441
> URL: https://issues.apache.org/jira/browse/SPARK-14441
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>
> Today we have DDLSuite, DDLCommandSuite, HiveDDLCommandSuite. It's confusing 
> whether a test should exist in one or the other. It also makes it less clear 
> whether our test coverage is comprehensive. Ideally we should consolidate 
> these files as much as possible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14554) disable whole stage codegen if there are too many input columns

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238469#comment-15238469
 ] 

Apache Spark commented on SPARK-14554:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/12346

> disable whole stage codegen if there are too many input columns
> ---
>
> Key: SPARK-14554
> URL: https://issues.apache.org/jira/browse/SPARK-14554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14409) Investigate adding a RankingEvaluator to ML

2016-04-12 Thread Yong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238462#comment-15238462
 ] 

Yong Tang commented on SPARK-14409:
---

Thanks [~mlnick] for the review. I was planning to add MRR to RankingMetrics 
and then wrap that as a first step. But if you think it makes sense, I can 
reimplement from scratch. Please let me know which way would be better and I 
will move forward with it. Thanks.

> Investigate adding a RankingEvaluator to ML
> ---
>
> Key: SPARK-14409
> URL: https://issues.apache.org/jira/browse/SPARK-14409
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Nick Pentreath
>Priority: Minor
>
> {{mllib.evaluation}} contains a {{RankingMetrics}} class, while there is no 
> {{RankingEvaluator}} in {{ml.evaluation}}. Such an evaluator can be useful 
> for recommendation evaluation (and can be useful in other settings 
> potentially).
> Should be thought about in conjunction with adding the "recommendAll" methods 
> in SPARK-13857, so that top-k ranking metrics can be used in cross-validators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0


  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge. Hive 
> version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

{code}
scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0
{code}

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

Not sure if relevant, but Scala does parse numbers with leading space correctly

scala> "2.0".toDouble
res21: Double = 2.0

scala> " 2.0".toDouble
res22: Double = 2.0



> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge. Hive 
> version is 1.5.1
> Not sure if relevant, but Scala does parse numbers with leading space 
> correctly
> {code}
> scala> "2.0".toDouble
> res21: Double = 2.0
> scala> " 2.0".toDouble
> res22: Double = 2.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge. Hive 
version is 1.5.1

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge. Hive 
> version is 1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14447) Speed up TungstenAggregate w/ keys using AggregateHashMap

2016-04-12 Thread Sameer Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sameer Agarwal updated SPARK-14447:
---
Summary: Speed up TungstenAggregate w/ keys using AggregateHashMap  (was: 
Integrate AggregateHashMap in Aggregates with Keys)

> Speed up TungstenAggregate w/ keys using AggregateHashMap
> -
>
> Key: SPARK-14447
> URL: https://issues.apache.org/jira/browse/SPARK-14447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14447) Integrate AggregateHashMap in Aggregates with Keys

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238401#comment-15238401
 ] 

Apache Spark commented on SPARK-14447:
--

User 'sameeragarwal' has created a pull request for this issue:
https://github.com/apache/spark/pull/12345

> Integrate AggregateHashMap in Aggregates with Keys
> --
>
> Key: SPARK-14447
> URL: https://issues.apache.org/jira/browse/SPARK-14447
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Sameer Agarwal
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14583) SparkSQL doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Summary: SparkSQL doesn't read hive table properly after MSCK REPAIR  (was: 
Spark doesn't read hive table properly after MSCK REPAIR)

> SparkSQL doesn't read hive table properly after MSCK REPAIR
> ---
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> {code}
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14586:

Description: 
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements
{code:sql}

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge

  was:
create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge


> SparkSQL doesn't parse decimal like Hive
> 
>
> Key: SPARK-14586
> URL: https://issues.apache.org/jira/browse/SPARK-14586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> create a test_data.csv with the following
> {code:none}
> a, 2.0
> ,3.0
> {code}
> (the space is intended before the 2)
> copy the test_data.csv to hdfs:///spark_testing_2
> go in hive, run the following statements
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv_2;
> CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
>   column_1 varchar(10),
>   column_2 decimal(4,2))
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing_2'
> TBLPROPERTIES('serialization.null.format'='');
> select * from spark_testing.test_csv_2;
> OK
> a   2
> NULL3
> {code}
> As you can see, the value " 2" gets parsed correctly to 2
> Now onto Spark-shell:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv_2").show()
> +++
> |column_1|column_2|
> +++
> |   a|null|
> |null|3.00|
> +++
> {code}
> As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
> similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
> it looks like a necessary improvement for the two engines to converge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14586) SparkSQL doesn't parse decimal like Hive

2016-04-12 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-14586:
---

 Summary: SparkSQL doesn't parse decimal like Hive
 Key: SPARK-14586
 URL: https://issues.apache.org/jira/browse/SPARK-14586
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
Reporter: Stephane Maarek


create a test_data.csv with the following
{code:none}
a, 2.0
,3.0
{code}

(the space is intended before the 2)

copy the test_data.csv to hdfs:///spark_testing_2

go in hive, run the following statements

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv_2;
CREATE EXTERNAL TABLE `spark_testing.test_csv_2`(
  column_1 varchar(10),
  column_2 decimal(4,2))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing_2'
TBLPROPERTIES('serialization.null.format'='');
select * from spark_testing.test_csv_2;
OK
a   2
NULL3

{code}

As you can see, the value " 2" gets parsed correctly to 2

Now onto Spark-shell:

{code:java}

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv_2").show()

+++
|column_1|column_2|
+++
|   a|null|
|null|3.00|
+++

{code}

As you can see, the " 2" got parsed to null. Therefore Hive and Spark have a 
similar parsing behavior for decimals. I wouldn't say it is a bug per se, but 
it looks like a necessary improvement for the two engines to converge



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14375) Unit test for spark.ml KMeansSummary

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14375:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang
Target Version/s: 2.0.0

> Unit test for spark.ml KMeansSummary
> 
>
> Key: SPARK-14375
> URL: https://issues.apache.org/jira/browse/SPARK-14375
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> There is no unit test for KMeansSummary in spark.ml.
> Other items which could be fixed here:
> * Add Since version to KMeansSummary class
> * Modify clusterSizes method to match GMM method, to be robust to empty 
> clusters (in case we support that sometime)  (See PR for [SPARK-13538])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14585) Provide accessor methods for Pipeline stages

2016-04-12 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-14585:
-

 Summary: Provide accessor methods for Pipeline stages
 Key: SPARK-14585
 URL: https://issues.apache.org/jira/browse/SPARK-14585
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley


It is currently hard to access particular stages in a Pipeline or 
PipelineModel.  Some accessor methods would help.

Scala:
{code}
class Pipeline {
  /** Returns stage at index i in Pipeline */
  def getStage[T <: PipelineStage](i: Int): T

  /** Returns all stages of this type */
  def getStagesOfType[T <: PipelineStage]: Array[T]
}

class PipelineModel {
  /** Returns stage at index i in PipelineModel */
  def getStage[T <: Transformer](i: Int): T

  /**
   * Returns stage given its parent or generating instance in PipelineModel.
   * E.g., if this PipelineModel was created from a Pipeline containing a stage 
   * {{myStage}}, then passing {{myStage}} to this method will return the
   * corresponding stage in this PipelineModel.
   */
  def getStage[T <: Transformer][implicit E <: PipelineStage](stage: E): T

  /** Returns all stages of this type */
  def getStagesOfType[T <: Transformer]: Array[T]
}
{code}

These methods should not be recursive for now.  I.e., if a Pipeline A contains 
another Pipeline B, then calling {{getStage}} on the outer Pipeline A should 
not search for the stage within Pipeline B.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Description: 
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:none}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b
{code}
(you can see the NULL)

now onto Spark:


{code:java}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 


  was:
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:csv}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b
{code}
(you can see the NULL)

now onto Spark:


{code:scala}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 



> Spark doesn't read hive table properly after MSCK REPAIR
> 
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:none}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:java}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> 

[jira] [Updated] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephane Maarek updated SPARK-14583:

Description: 
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
{code:csv}
a,2
,3
{code}

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:
{code:sql}
CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b
{code}
(you can see the NULL)

now onto Spark:


{code:scala}
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext.sql("select * from spark_testing.test_csv").show()
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+

{code}

As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 


  was:
it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
a,2
,3

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b

(you can see the NULL)

now onto Spark:
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+


As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 



> Spark doesn't read hive table properly after MSCK REPAIR
> 
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> {code:csv}
> a,2
> ,3
> {code}
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> {code:sql}
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> {code}
> (you can see the NULL)
> now onto Spark:
> {code:scala}
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> sqlContext.sql("select * from spark_testing.test_csv").show()
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> {code}
> As you can see, SPARK can't detect the 

[jira] [Updated] (SPARK-14084) Parallel training jobs in model selection

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14084:
--
Target Version/s: 2.1.0  (was: 2.0.0)

> Parallel training jobs in model selection
> -
>
> Key: SPARK-14084
> URL: https://issues.apache.org/jira/browse/SPARK-14084
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> In CrossValidator and TrainValidationSplit, we run training jobs one by one. 
> If users have a big cluster, they might see speed-ups if we parallelize the 
> job submission on the driver. The trade-off is that we might need to make 
> multiple copies of the training data, which could be expensive. It is worth 
> testing and figure out the best way to implement it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14584) Improve recognition of non-nullability in Dataset transformations

2016-04-12 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-14584:
--

 Summary: Improve recognition of non-nullability in Dataset 
transformations
 Key: SPARK-14584
 URL: https://issues.apache.org/jira/browse/SPARK-14584
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Josh Rosen


There are many cases where we can statically know that a field will never be 
null. For instance, a field in a case class with a primitive type will never 
return null. However, there are currently several cases in the Dataset API 
where we do not properly recognize this non-nullability. For instance:

{code}
case class MyCaseClass(foo: Int)
sc.parallelize(Seq(0)).toDS.map(MyCaseClass).printSchema
{code}

claims that the {{foo}} field is nullable even though this is impossible.

I believe that this is due to the way that we reason about nullability when 
constructing serializer expressions in ExpressionEncoders. The following 
assertion will currently fail if added to ExpressionEncoder:

{code}

  require(schema.size == serializer.size)
  schema.fields.zip(serializer).foreach { case (field, fieldSerializer) =>
require(field.dataType == fieldSerializer.dataType, s"Field ${field.name}'s 
data type is " +
  s"${field.dataType} in the schema but ${fieldSerializer.dataType} in its 
serializer")
require(field.nullable == fieldSerializer.nullable, s"Field ${field.name}'s 
nullability is " +
  s"${field.nullable} in the schema but ${fieldSerializer.nullable} in its 
serializer")
  }
{code}

Most often, the schema claims that a field is non-nullable while the encoder 
allows for nullability, but occasionally we see a mismatch in the datatypes due 
to disagreements over the nullability of nested structs' fields (or fields of 
structs in arrays).

I think the problem is that when we're reasoning about nullability in a 
struct's schema we consider its fields' nullability to be independent of the 
nullability of the struct itself, whereas in the serializer expressions we are 
considering those field extraction expressions to be nullable if the input 
objects themselves can be nullable.

I'm not sure what's the simplest way to fix this. One proposal would be to 
leave the serializers unchanged and have ObjectOperator derive its output 
attributes from an explicitly-passed schema rather than using the serializers' 
attributes. However, I worry that this might introduce bugs in case the 
serializer and schema disagree.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-13982.
---
   Resolution: Fixed
 Assignee: Yanbo Liang
Fix Version/s: 2.0.0

Resolving b/c fixed in the linked JIRA issue

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatic genetared text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic generated text

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13982:
--
Summary: SparkR - KMeans predict: Output column name of features is an 
unclear, automatic generated text  (was: SparkR - KMeans predict: Output column 
name of features is an unclear, automatic genetared text)

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatic generated text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14059) Define R wrappers under org.apache.spark.ml.r

2016-04-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238358#comment-15238358
 ] 

Joseph K. Bradley commented on SPARK-14059:
---

This task looks complete.  Can I resolve it?

> Define R wrappers under org.apache.spark.ml.r
> -
>
> Key: SPARK-14059
> URL: https://issues.apache.org/jira/browse/SPARK-14059
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Affects Versions: 1.6.1
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Currently, the wrapper files are under .../ml/r but the wrapper classes are 
> defined under ...ml.api.r, which doesn't follow package convention. We should 
> move all wrappers under ml.r.
> This should happen after we merged other MLlib/R wrappers to avoid merge 
> conflicts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238352#comment-15238352
 ] 

Stephane Maarek commented on SPARK-14583:
-

pretty much the same behavior if instead of MSCK REPAIR we run ALTER TABLE 
spark_testing.test_csv ADD PARTITION (part_a="a", part_b="b");
This makes me believe it's the partitioning that makes Spark fail

> Spark doesn't read hive table properly after MSCK REPAIR
> 
>
> Key: SPARK-14583
> URL: https://issues.apache.org/jira/browse/SPARK-14583
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
>Reporter: Stephane Maarek
>
> it seems that Spark forgets or fails to read the metadata tblproperties after 
> a MSCK REPAIR is issued from within HIVE
> Here are the steps to reproduce:
> create test_data.csv with the following content:
> a,2
> ,3
> move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/
> run the following hive statements:
> CREATE SCHEMA IF NOT EXISTS spark_testing;
> DROP TABLE IF EXISTS spark_testing.test_csv;
> CREATE EXTERNAL TABLE `spark_testing.test_csv`(
>   column_1 varchar(10),
>   column_2 int)
> PARTITIONED BY (
>   `part_a` string,
>   `part_b` string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY ','
> STORED AS TEXTFILE LOCATION '/spark_testing'
> TBLPROPERTIES('serialization.null.format'='');
> MSCK REPAIR TABLE spark_testing.test_csv;
> select * from spark_testing.test_csv;
> OK
> a   2   a   b
> NULL3   a   b
> (you can see the NULL)
> now onto Spark:
> +++--+--+
> |column_1|column_2|part_a|part_b|
> +++--+--+
> |   a|   2| a| b|
> ||   3| a| b|
> +++--+--+
> As you can see, SPARK can't detect the null. 
> I don't know if it affects future versions of SPARK and I can't test it in my 
> company's environment. Steps are easy to reproduce though so can be tested in 
> other environments. My hive version is 1.2.1
> Let me know if you have any questions. To me that's a big issue because data 
> isn't read correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-13982:
--
Target Version/s: 2.0.0

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatic genetared text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238346#comment-15238346
 ] 

yuhao yang commented on SPARK-14154:


Got your concern. I'll run some benchmark.

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14509) Add python CountVectorizerExample

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14509:
--
Shepherd: Joseph K. Bradley
Assignee: zhengruifeng
Target Version/s: 2.0.0
 Component/s: PySpark
  ML

> Add python CountVectorizerExample
> -
>
> Key: SPARK-14509
> URL: https://issues.apache.org/jira/browse/SPARK-14509
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, PySpark
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> Add the missing python example for CountVectorizer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14577) spark.sql.codegen.maxCaseBranches config option

2016-04-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238344#comment-15238344
 ] 

Reynold Xin commented on SPARK-14577:
-

Yea we shouldn't change the architecture.


> spark.sql.codegen.maxCaseBranches config option
> ---
>
> Key: SPARK-14577
> URL: https://issues.apache.org/jira/browse/SPARK-14577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently disable codegen for CaseWhen if the number of branches is 
> greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better 
> if this value is a non-public config defined in SQLConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14583) Spark doesn't read hive table properly after MSCK REPAIR

2016-04-12 Thread Stephane Maarek (JIRA)
Stephane Maarek created SPARK-14583:
---

 Summary: Spark doesn't read hive table properly after MSCK REPAIR
 Key: SPARK-14583
 URL: https://issues.apache.org/jira/browse/SPARK-14583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.1
Reporter: Stephane Maarek


it seems that Spark forgets or fails to read the metadata tblproperties after a 
MSCK REPAIR is issued from within HIVE

Here are the steps to reproduce:
create test_data.csv with the following content:
a,2
,3

move test_data.csv to hdfs:///spark_testing/part_a=a/part_b=b/

run the following hive statements:

CREATE SCHEMA IF NOT EXISTS spark_testing;
DROP TABLE IF EXISTS spark_testing.test_csv;
CREATE EXTERNAL TABLE `spark_testing.test_csv`(
  column_1 varchar(10),
  column_2 int)
PARTITIONED BY (
  `part_a` string,
  `part_b` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE LOCATION '/spark_testing'
TBLPROPERTIES('serialization.null.format'='');
MSCK REPAIR TABLE spark_testing.test_csv;
select * from spark_testing.test_csv;


OK
a   2   a   b
NULL3   a   b

(you can see the NULL)

now onto Spark:
+++--+--+
|column_1|column_2|part_a|part_b|
+++--+--+
|   a|   2| a| b|
||   3| a| b|
+++--+--+


As you can see, SPARK can't detect the null. 
I don't know if it affects future versions of SPARK and I can't test it in my 
company's environment. Steps are easy to reproduce though so can be tested in 
other environments. My hive version is 1.2.1

Let me know if you have any questions. To me that's a big issue because data 
isn't read correctly. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14577) spark.sql.codegen.maxCaseBranches config option

2016-04-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238323#comment-15238323
 ] 

Dongjoon Hyun commented on SPARK-14577:
---

In the current Spark architecture, `sql/core` module is designed to depend on 
`catalyst` module, and `catalyst` module does not access `SQLConf` of 
`sql/core`. What this issue needs is using configuration without changing any 
architectural design, isn't it?

> spark.sql.codegen.maxCaseBranches config option
> ---
>
> Key: SPARK-14577
> URL: https://issues.apache.org/jira/browse/SPARK-14577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently disable codegen for CaseWhen if the number of branches is 
> greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better 
> if this value is a non-public config defined in SQLConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14582) Increase the parallelism for small tables

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238321#comment-15238321
 ] 

Apache Spark commented on SPARK-14582:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12344

> Increase the parallelism for small tables
> -
>
> Key: SPARK-14582
> URL: https://issues.apache.org/jira/browse/SPARK-14582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14582) Increase the parallelism for small tables

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14582:


Assignee: Apache Spark  (was: Davies Liu)

> Increase the parallelism for small tables
> -
>
> Key: SPARK-14582
> URL: https://issues.apache.org/jira/browse/SPARK-14582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14582) Increase the parallelism for small tables

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14582:


Assignee: Davies Liu  (was: Apache Spark)

> Increase the parallelism for small tables
> -
>
> Key: SPARK-14582
> URL: https://issues.apache.org/jira/browse/SPARK-14582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14579) Fix a race condition in StreamExecution.processAllAvailable

2016-04-12 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-14579.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Fix a race condition in StreamExecution.processAllAvailable
> ---
>
> Key: SPARK-14579
> URL: https://issues.apache.org/jira/browse/SPARK-14579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> See the PR description for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14582) Increase the parallelism for small tables

2016-04-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14582:
--

 Summary: Increase the parallelism for small tables
 Key: SPARK-14582
 URL: https://issues.apache.org/jira/browse/SPARK-14582
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10386) Model import/export for PrefixSpan

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10386:
--
Shepherd: Joseph K. Bradley  (was: Xiangrui Meng)

> Model import/export for PrefixSpan
> --
>
> Key: SPARK-10386
> URL: https://issues.apache.org/jira/browse/SPARK-10386
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Support save/load for PrefixSpanModel. Should be similar to save/load for 
> FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14578) Can't load a json dataset with nested wide schema

2016-04-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14578.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 12338
[https://github.com/apache/spark/pull/12338]

> Can't load a json dataset with nested wide schema
> -
>
> Key: SPARK-14578
> URL: https://issues.apache.org/jira/browse/SPARK-14578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> The generated code from CreateExternalRow can't be compiled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8514) LU factorization on BlockMatrix

2016-04-12 Thread Jerome (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238300#comment-15238300
 ] 

Jerome commented on SPARK-8514:
---

Hello Joseph:

Is this JIRA still under consideration?

Best, Jerome

On Tue, Apr 12, 2016 at 4:56 PM, Joseph K. Bradley (JIRA) 




-- 
Jerome Nilmeier, PhD
Cell:  510-325-8695
Home:   925-292-5321


> LU factorization on BlockMatrix
> ---
>
> Key: SPARK-8514
> URL: https://issues.apache.org/jira/browse/SPARK-8514
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: advanced
> Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
> BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix 
> Factorization - M...ark 1.5.0 Documentation.pdf, testImplementation.scala, 
> testScript.scala
>
>
> LU is the most common method to solve a general linear system or inverse a 
> general matrix. A distributed version could in implemented block-wise with 
> pipelining. A reference implementation is provided in ScaLAPACK:
> http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14529) Consolidate mllib and mllib-local into one mllib folder

2016-04-12 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238291#comment-15238291
 ] 

DB Tsai commented on SPARK-14529:
-

We still can make graphx depend on mllib-local, and I plan to do so for 
https://issues.apache.org/jira/browse/SPARK-11496 ([~yraimond])

> Consolidate mllib and mllib-local into one mllib folder
> ---
>
> Key: SPARK-14529
> URL: https://issues.apache.org/jira/browse/SPARK-14529
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> In the 2.0 QA period (to avoid the conflict of other PRs), this task will 
> consolidate `mllib/src` into `mllib/mllib/src` and `mllib-local/src` into 
> `mllib/mllib-local/src`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5992:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12942) Provide option to allow control the precision of numerical type for DataFrameWriter

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12942:
--
Target Version/s: 2.0.0
 Component/s: (was: ML)

> Provide option to allow control the precision of numerical type for 
> DataFrameWriter
> ---
>
> Key: SPARK-12942
> URL: https://issues.apache.org/jira/browse/SPARK-12942
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> This is useful for libsvm where most of features are double, and this can 
> reduce the file size and gain some performance improvement when reading the 
> libsvm file in training. 
> So it can be libsvm specific or global setting which apply to other data 
> sources.   We can discuss it here.  \cc [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12942) Provide option to allow control the precision of numerical type for DataFrameWriter

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12942:
--
Target Version/s:   (was: 2.0.0)

> Provide option to allow control the precision of numerical type for 
> DataFrameWriter
> ---
>
> Key: SPARK-12942
> URL: https://issues.apache.org/jira/browse/SPARK-12942
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jeff Zhang
>
> This is useful for libsvm where most of features are double, and this can 
> reduce the file size and gain some performance improvement when reading the 
> libsvm file in training. 
> So it can be libsvm specific or global setting which apply to other data 
> sources.   We can discuss it here.  \cc [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12942) Provide option to allow control the precision of numerical type for DataFrameWriter

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12942:
--
Target Version/s:   (was: 2.0.0)

> Provide option to allow control the precision of numerical type for 
> DataFrameWriter
> ---
>
> Key: SPARK-12942
> URL: https://issues.apache.org/jira/browse/SPARK-12942
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL
>Reporter: Jeff Zhang
>
> This is useful for libsvm where most of features are double, and this can 
> reduce the file size and gain some performance improvement when reading the 
> libsvm file in training. 
> So it can be libsvm specific or global setting which apply to other data 
> sources.   We can discuss it here.  \cc [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9478) Add class weights to Random Forest

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9478:
-
Target Version/s: 2.1.0  (was: 2.0.0)

> Add class weights to Random Forest
> --
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.1
>Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8514) LU factorization on BlockMatrix

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8514:
-
Target Version/s:   (was: 2.0.0)

> LU factorization on BlockMatrix
> ---
>
> Key: SPARK-8514
> URL: https://issues.apache.org/jira/browse/SPARK-8514
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Critical
>  Labels: advanced
> Attachments: BlockMatrixSolver.pdf, BlockPartitionMethods.py, 
> BlockPartitionMethods.scala, LUBlockDecompositionBasic.pdf, Matrix 
> Factorization - M...ark 1.5.0 Documentation.pdf, testImplementation.scala, 
> testScript.scala
>
>
> LU is the most common method to solve a general linear system or inverse a 
> general matrix. A distributed version could in implemented block-wise with 
> pipelining. A reference implementation is provided in ScaLAPACK:
> http://netlib.org/scalapack/slug/node178.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10078) Vector-free L-BFGS

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10078:
--
Target Version/s:   (was: 2.0.0)

> Vector-free L-BFGS
> --
>
> Key: SPARK-10078
> URL: https://issues.apache.org/jira/browse/SPARK-10078
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is to implement a scalable version of vector-free L-BFGS 
> (http://papers.nips.cc/paper/5333-large-scale-l-bfgs-using-mapreduce.pdf).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-04-12 Thread Martin Brandt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238264#comment-15238264
 ] 

Martin Brandt edited comment on SPARK-13116 at 4/12/16 11:46 PM:
-

I am seeing what looks like the issue described here, in Spark 1.6.1 when 
querying a fairly wide table:

java.lang.UnsupportedOperationException
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:89)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:60)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.createNewAggregationBuffer(TungstenAggregationIterator.scala:248)

This reproduces consistently on my 5 node cluster using the following code.  
Please note if I change the number of columns to 2700 or less the problem does 
NOT occur, at 2800 columns and above it always occurs.

{code:borderStyle=solid}

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val c = 1 to 2800
val r = 1 to 10

val rows = sc.parallelize(r.map(i=>
  Row.fromSeq(c.map(_+i

val fields = c.map(i => StructField("t" + i.toString, IntegerType, true))
val schema = StructType(fields)

val df = sqlContext.createDataFrame(rows, schema)

df.registerTempTable("temp")
val sql = c.map(i=>"avg(t" + i.toString + ")").mkString(",")
val sel = sqlContext.sql("SELECT " + sql + " FROM temp")
sel.collect()

{code}



was (Author: mbrandt):
I am seeing what looks like the issue described here, in Spark 1.6.1 when 
querying a fairly wide table:

java.lang.UnsupportedOperationException
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:89)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:60)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.createNewAggregationBuffer(TungstenAggregationIterator.scala:248)

This reproduces consistently on my 5 node cluster using the following code.  
Please note if I change the number of columns to 2700 or less the problem does 
NOT occur, at 2800 columns and above it always occurs.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val c = 1 to 2800
val r = 1 to 10

val rows = sc.parallelize(r.map(i=>
  Row.fromSeq(c.map(_+i

val fields = c.map(i => StructField("t" + i.toString, IntegerType, true))
val schema = StructType(fields)

val df = sqlContext.createDataFrame(rows, schema)

df.registerTempTable("temp")
val sql = c.map(i=>"avg(t"+i.toString+")").mkString(",")
val sel = sqlContext.sql("SELECT " + sql + " FROM temp")
sel.collect()



> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +  

[jira] [Commented] (SPARK-13116) TungstenAggregate though it is supposedly capable of all processing unsafe & safe rows, fails if the input is safe rows

2016-04-12 Thread Martin Brandt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238264#comment-15238264
 ] 

Martin Brandt commented on SPARK-13116:
---

I am seeing what looks like the issue described here, in Spark 1.6.1 when 
querying a fairly wide table:

java.lang.UnsupportedOperationException
at 
org.apache.spark.sql.catalyst.expressions.UnsafeRow.update(UnsafeRow.java:238)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:89)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:60)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.createNewAggregationBuffer(TungstenAggregationIterator.scala:248)

This reproduces consistently on my 5 node cluster using the following code.  
Please note if I change the number of columns to 2700 or less the problem does 
NOT occur, at 2800 columns and above it always occurs.

import org.apache.spark.sql._
import org.apache.spark.sql.types._

val c = 1 to 2800
val r = 1 to 10

val rows = sc.parallelize(r.map(i=>
  Row.fromSeq(c.map(_+i

val fields = c.map(i => StructField("t" + i.toString, IntegerType, true))
val schema = StructType(fields)

val df = sqlContext.createDataFrame(rows, schema)

df.registerTempTable("temp")
val sql = c.map(i=>"avg(t"+i.toString+")").mkString(",")
val sel = sqlContext.sql("SELECT " + sql + " FROM temp")
sel.collect()



> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows
> ---
>
> Key: SPARK-13116
> URL: https://issues.apache.org/jira/browse/SPARK-13116
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Asif Hussain Shahid
> Attachments: SPARK_13116_Test.scala
>
>
> TungstenAggregate though it is supposedly capable of all processing unsafe & 
> safe rows, fails if the input is safe rows.
> If the input to TungstenAggregateIterator is a SafeRow, while the target is 
> an UnsafeRow ,  the current code will try to set the fields in the UnsafeRow 
> using the update method in UnSafeRow. 
> This method is called via TunsgtenAggregateIterator on the 
> InterpretedMutableProjection. The target row in the 
> InterpretedMutableProjection is an UnsafeRow, while the current row is a 
> SafeRow.
> In the InterpretedMutableProjection's apply method, it invokes
>  mutableRow(i) = exprArray(i).eval(input)
> Now for UnsafeRow, the update method throws UnsupportedOperationException.
> The proposed fix I did for our forked branch , on the class 
> InterpretedProjection is:
> +  private var targetUnsafe = false
>  +  type UnsafeSetter = (UnsafeRow,  Any ) => Unit
>  +  private var setters : Array[UnsafeSetter] = _
> private[this] val exprArray = expressions.toArray
> private[this] var mutableRow: MutableRow = new 
> GenericMutableRow(exprArray.length)
> def currentValue: InternalRow = mutableRow
>   
>  +
> override def target(row: MutableRow): MutableProjection = {
>   mutableRow = row
>  +targetUnsafe = row match {
>  +  case _:UnsafeRow =>{
>  +if(setters == null) {
>  +  setters = Array.ofDim[UnsafeSetter](exprArray.length)
>  +  for(i <- 0 until exprArray.length) {
>  +setters(i) = exprArray(i).dataType match {
>  +  case IntegerType => (target: UnsafeRow,  value: Any ) =>
>  +target.setInt(i,value.asInstanceOf[Int])
>  +  case LongType => (target: UnsafeRow,  value: Any ) =>
>  +target.setLong(i,value.asInstanceOf[Long])
>  +  case DoubleType => (target: UnsafeRow,  value: Any ) =>
>  +target.setDouble(i,value.asInstanceOf[Double])
>  +  case FloatType => (target: UnsafeRow, value: Any ) =>
>  +target.setFloat(i,value.asInstanceOf[Float])
>  +
>  +  case NullType => (target: UnsafeRow,  value: Any ) =>
>  +target.setNullAt(i)
>  +
>  +  case BooleanType => (target: UnsafeRow,  value: Any ) =>
>  +target.setBoolean(i,value.asInstanceOf[Boolean])
>  +
>  +  case ByteType => (target: UnsafeRow,  value: Any ) =>
>  +target.setByte(i,value.asInstanceOf[Byte])
>  +  case ShortType => (target: UnsafeRow, value: Any ) =>
>  +target.setShort(i,value.asInstanceOf[Short])
>  +
>  +}
>  +  }
>  +}
>  +true
>  +  }
>  +  case _ => false
>  +}
>  +
>   this
> }
>   
> override def apply(input: InternalRow): InternalRow = {
>   var i = 0
>   while (i < exprArray.length) {
>  -   

[jira] [Commented] (SPARK-14581) Improve filter push down

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238220#comment-15238220
 ] 

Apache Spark commented on SPARK-14581:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/12342

> Improve filter push down
> 
>
> Key: SPARK-14581
> URL: https://issues.apache.org/jira/browse/SPARK-14581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, filter push down only works with Project, Aggregate, Generate and 
> Join, they can't be pushed through many other plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14581) Improve filter push down

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14581:


Assignee: Davies Liu  (was: Apache Spark)

> Improve filter push down
> 
>
> Key: SPARK-14581
> URL: https://issues.apache.org/jira/browse/SPARK-14581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, filter push down only works with Project, Aggregate, Generate and 
> Join, they can't be pushed through many other plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14581) Improve filter push down

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14581:


Assignee: Apache Spark  (was: Davies Liu)

> Improve filter push down
> 
>
> Key: SPARK-14581
> URL: https://issues.apache.org/jira/browse/SPARK-14581
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, filter push down only works with Project, Aggregate, Generate and 
> Join, they can't be pushed through many other plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14581) Improve filter push down

2016-04-12 Thread Davies Liu (JIRA)
Davies Liu created SPARK-14581:
--

 Summary: Improve filter push down
 Key: SPARK-14581
 URL: https://issues.apache.org/jira/browse/SPARK-14581
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


Right now, filter push down only works with Project, Aggregate, Generate and 
Join, they can't be pushed through many other plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14363) Executor OOM due to a memory leak in Sorter

2016-04-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-14363.

   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 12285
[https://github.com/apache/spark/pull/12285]

> Executor OOM due to a memory leak in Sorter
> ---
>
> Key: SPARK-14363
> URL: https://issues.apache.org/jira/browse/SPARK-14363
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.1
>Reporter: Sital Kedia
> Fix For: 2.0.0, 1.6.2
>
>
> While running a Spark job, we see that the job fails because of executor OOM 
> with following stack trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 76 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:326)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:91)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:168)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:728)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> The issue is that there is a memory leak in the Sorter.  When the 
> UnsafeExternalSorter spills the data to disk, it does not free up the 
> underlying pointer array. As a result, we see a lot of executor OOM and also 
> memory under utilization.
> This is a regression partially introduced in PR 
> https://github.com/apache/spark/pull/9241



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14497) Use top instead of sortBy() to get top N frequent words as dict in CountVectorizer

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14497:
--
Summary: Use top instead of sortBy() to get top N frequent words as dict in 
CountVectorizer  (was: Use top instead of sortBy() to get top N frequent words 
as dict in ConutVectorizer)

> Use top instead of sortBy() to get top N frequent words as dict in 
> CountVectorizer
> --
>
> Key: SPARK-14497
> URL: https://issues.apache.org/jira/browse/SPARK-14497
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feng Wang
>Assignee: Feng Wang
> Fix For: 2.0.0
>
>
> It's not necessary to sort the whole rdd to get top n frequent words.
> // Sort terms to select vocab
> wordCounts.sortBy(_._2, ascending = false).take(vocSize)
>   
> we could use top() instead since: 
> top - O ( n )
> sortBy - O (n*logn)
> A minor side effect introduced by top() using default implicit Ordering in 
> Tuple2: 
> if the terms with same TF in dictionary would be sorted in descending order.
> (a:1), (b:1),(c:1)  => dict: [c, b, a]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12414) Remove closure serializer

2016-04-12 Thread Dubkov Mikhail (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238171#comment-15238171
 ] 

Dubkov Mikhail commented on SPARK-12414:


[~srowen], [~andrewor14],

As I see, you just hard coded spark.closure.serializer use JavaSerializer 
implementation, that's why I have a question:

On our project we use custom spark.serializer that has own requirements for 
objects serialization which can differs from JavaSerializer requirements. For 
example, our serializer  not requires "implements Serializable", unless 
JavaSerializer does.

The question is will application fails once we upgrade to Spark 2.0.0 ?
Because now, if we don't define serializer as "spark.closure.serializer" 
application fails with 'Caused by: java.io.NotSerializableException'

Could you please explain how it will work after changes in scope of this task?

Thanks!

> Remove closure serializer
> -
>
> Key: SPARK-12414
> URL: https://issues.apache.org/jira/browse/SPARK-12414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Sean Owen
> Fix For: 2.0.0
>
>
> There is a config `spark.closure.serializer` that accepts exactly one value: 
> the java serializer. This is because there are currently bugs in the Kryo 
> serializer that make it not a viable candidate. This was uncovered by an 
> unsuccessful attempt to make it work: SPARK-7708.
> My high level point is that the Java serializer has worked well for at least 
> 6 Spark versions now, and it is an incredibly complicated task to get other 
> serializers (not just Kryo) to work with Spark's closures. IMO the effort is 
> not worth it and we should just remove this documentation and all the code 
> associated with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238172#comment-15238172
 ] 

Xiangrui Meng commented on SPARK-14154:
---

Changed the priority to critical since we should decide before the feature 
freeze deadline.

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14568:
--
Component/s: ML

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14154) Simplify the implementation for Kolmogorov–Smirnov test

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14154:
--
Priority: Critical  (was: Minor)

> Simplify the implementation for Kolmogorov–Smirnov test
> ---
>
> Key: SPARK-14154
> URL: https://issues.apache.org/jira/browse/SPARK-14154
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Critical
> Fix For: 2.0.0
>
>
> I just read the code for KolmogorovSmirnovTest and find it could be much 
> simplified following the original definition.
> Send a PR for discussion



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11157) Allow Spark to be built without assemblies

2016-04-12 Thread Sebastian Kochman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238151#comment-15238151
 ] 

Sebastian Kochman commented on SPARK-11157:
---

After this change, when I try to submit a Spark app to YARN on Windows (using 
spark-submit.cmd), the app fails with the following error:

Diagnostics: The command line has a length of 12046 exceeds maximum allowed 
length of 8191. Command starts with: @set 
SPARK_YARN_CACHE_FILES=[...]/.sparkStaging/application_1460496865345_0
Failing this attempt. Failing the application.

So basically a large number of jars needed for staging in YARN causes exceeding 
Windows command line length limit.

Has anybody seen this? Is there a recommendation for a workaround?

Marcelo: in the original description, you mentioned there will still be a Maven 
profile building a single assembly. I couldn't find it -- is there one?

> Allow Spark to be built without assemblies
> --
>
> Key: SPARK-11157
> URL: https://issues.apache.org/jira/browse/SPARK-11157
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Spark Core, YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 2.0.0
>
> Attachments: no-assemblies.pdf
>
>
> For reasoning, discussion of pros and cons, and other more detailed 
> information, please see attached doc.
> The idea is to be able to build a Spark distribution that has just a 
> directory full of jars instead of the huge assembly files we currently have.
> Getting there requires changes in a bunch of places, I'll try to list the 
> ones I identified in the document, in the order that I think would be needed 
> to not break things:
> * make streaming backends not be assemblies
> Since people may depend on the current assembly artifacts in their 
> deployments, we can't really remove them; but we can make them be dummy jars 
> and rely on dependency resolution to download all the jars.
> PySpark tests would also need some tweaking here.
> * make examples jar not be an assembly
> Probably requires tweaks to the {{run-example}} script. The location of the 
> examples jar would have to change (it won't be able to live in the same place 
> as the main Spark jars anymore).
> * update YARN backend to handle a directory full of jars when launching apps
> Currently YARN localizes the Spark assembly (depending on the user 
> configuration); it needs to be modified so that it can localize all needed 
> libraries instead of a single jar.
> * Modify launcher library to handle the jars directory
> This should be trivial
> * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory 
> depending on which profile is enabled.
> We should keep the option to build with the assembly on by default, for 
> backwards compatibility, to give people time to prepare.
> Filing this bug as an umbrella; please file sub-tasks if you plan to work on 
> a specific part of the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14568:
--
Shepherd: Joseph K. Bradley

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14568:
--
Target Version/s: 2.0.0

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-14568:
--
Assignee: Timothy Hunter

> Log instrumentation in logistic regression as a first task
> --
>
> Key: SPARK-14568
> URL: https://issues.apache.org/jira/browse/SPARK-14568
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14576) Spark console should display Web UI url

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14576:


Assignee: Apache Spark

> Spark console should display Web UI url
> ---
>
> Key: SPARK-14576
> URL: https://issues.apache.org/jira/browse/SPARK-14576
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Ergin Seyfe
>Assignee: Apache Spark
>Priority: Minor
>
> This is a suggestion to print the Spark Driver UI link when spark-shell is 
> launched. 
> In example:
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context Web UI available at 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14576) Spark console should display Web UI url

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238137#comment-15238137
 ] 

Apache Spark commented on SPARK-14576:
--

User 'seyfe' has created a pull request for this issue:
https://github.com/apache/spark/pull/12341

> Spark console should display Web UI url
> ---
>
> Key: SPARK-14576
> URL: https://issues.apache.org/jira/browse/SPARK-14576
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Ergin Seyfe
>Priority: Minor
>
> This is a suggestion to print the Spark Driver UI link when spark-shell is 
> launched. 
> In example:
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context Web UI available at 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14576) Spark console should display Web UI url

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14576:


Assignee: (was: Apache Spark)

> Spark console should display Web UI url
> ---
>
> Key: SPARK-14576
> URL: https://issues.apache.org/jira/browse/SPARK-14576
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Ergin Seyfe
>Priority: Minor
>
> This is a suggestion to print the Spark Driver UI link when spark-shell is 
> launched. 
> In example:
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.1
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context Web UI available at 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14564) Python Word2Vec missing setWindowSize method

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14564:
--
Component/s: (was: MLlib)

> Python Word2Vec missing setWindowSize method
> 
>
> Key: SPARK-14564
> URL: https://issues.apache.org/jira/browse/SPARK-14564
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 1.6.1
> Environment: pyspark
>Reporter: Brad Willard
>Priority: Minor
>  Labels: ml, pyspark, python, word2vec
>
> The setWindowSize method when constructing the Word2Vec model is available in 
> scala but missing in python so you're stuck with a window of 5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14564) Python Word2Vec missing setWindowSize method

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14564:
--
Labels: ml pyspark python word2vec  (was: ml mllib pyspark python word2vec)

> Python Word2Vec missing setWindowSize method
> 
>
> Key: SPARK-14564
> URL: https://issues.apache.org/jira/browse/SPARK-14564
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.6.0, 1.6.1
> Environment: pyspark
>Reporter: Brad Willard
>Priority: Minor
>  Labels: ml, pyspark, python, word2vec
>
> The setWindowSize method when constructing the Word2Vec model is available in 
> scala but missing in python so you're stuck with a window of 5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14580) HiveTypeCoercion.IfCoercion should preserve original predicates.

2016-04-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14580:
--
Description: 
Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
return-type is null. However, some UDFs need evaluations because they are 
designed to throw exceptions.
*Hive*
{code}
hive> select if(assert_true(false),2,3);
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
ASSERT_TRUE(): assertion failed.
{code}

*Spark*
{code}
scala> sql("select if(assert_true(false),2,3)").head
res2: org.apache.spark.sql.Row = [3]
{code}

`IfCoercion` works like the followings.
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
!'Project [unresolvedalias(if 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
 2 else 3)]   'Project [unresolvedalias(if (nu
ll) 2 else 3)]
 +- OneRowRelation$ 
  +- OneRowRelation$  
{code}

This issue aims to fix this.

  was:
Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
return-type is null. However, some UDFs need evaluations because they are 
designed to throw exceptions.
*Hive*
{code}
hive> select if(assert_true(false),2,3);
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
ASSERT_TRUE(): assertion failed.
{code}

*Spark*
{code}
scala> sql("select if(assert_true(false),2,3)").head
res2: org.apache.spark.sql.Row = [3]
{code}

`IfCoercion` works like the followings.
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
!'Project [unresolvedalias(if 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
 2 else 3)]   'Project [unresolvedalias(if (nu
ll) 2 else 3)]
 +- OneRowRelation$ 
  +- OneRowRelation$  
{code}



> HiveTypeCoercion.IfCoercion should preserve original predicates.
> 
>
> Key: SPARK-14580
> URL: https://issues.apache.org/jira/browse/SPARK-14580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Dongjoon Hyun
>
> Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
> return-type is null. However, some UDFs need evaluations because they are 
> designed to throw exceptions.
> *Hive*
> {code}
> hive> select if(assert_true(false),2,3);
> OK
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> ASSERT_TRUE(): assertion failed.
> {code}
> 
> *Spark*
> {code}
> scala> sql("select if(assert_true(false),2,3)").head
> res2: org.apache.spark.sql.Row = [3]
> {code}
> `IfCoercion` works like the followings.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
> !'Project [unresolvedalias(if 
> (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
>  2 else 3)]   'Project [unresolvedalias(if (nu
> ll) 2 else 3)]
>  +- OneRowRelation$   
> +- OneRowRelation$  
> {code}
> This issue aims to fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14580) HiveTypeCoercion.IfCoercion should preserve original predicates.

2016-04-12 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14580:
--
Description: 
Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
return-type is null. However, some UDFs need evaluations because they are 
designed to throw exceptions.
*Hive*
{code}
hive> select if(assert_true(false),2,3);
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
ASSERT_TRUE(): assertion failed.
{code}

*Spark*
{code}
scala> sql("select if(assert_true(false),2,3)").head
res2: org.apache.spark.sql.Row = [3]
{code}

`IfCoercion` works like the followings.
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
!'Project [unresolvedalias(if 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
 2 else 3)]   'Project [unresolvedalias(if (nu
ll) 2 else 3)]
 +- OneRowRelation$ 
  +- OneRowRelation$  
{code}


  was:
Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
return-type is null. However, some UDFs need evaluations because they are 
designed to throw exceptions.
*Hive*
{code}
hive> select if(assert_true(false),2,3);
OK
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
ASSERT_TRUE(): assertion failed.
{code}

*Before*
{code}
scala> sql("select if(assert_true(false),2,3)").head
res2: org.apache.spark.sql.Row = [3]
{code}

`IfCoercion` works like the followings.
{code}
=== Applying Rule 
org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
!'Project [unresolvedalias(if 
(HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
 2 else 3)]   'Project [unresolvedalias(if (nu
ll) 2 else 3)]
 +- OneRowRelation$ 
  +- OneRowRelation$  
{code}



> HiveTypeCoercion.IfCoercion should preserve original predicates.
> 
>
> Key: SPARK-14580
> URL: https://issues.apache.org/jira/browse/SPARK-14580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Dongjoon Hyun
>
> Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
> return-type is null. However, some UDFs need evaluations because they are 
> designed to throw exceptions.
> *Hive*
> {code}
> hive> select if(assert_true(false),2,3);
> OK
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> ASSERT_TRUE(): assertion failed.
> {code}
> 
> *Spark*
> {code}
> scala> sql("select if(assert_true(false),2,3)").head
> res2: org.apache.spark.sql.Row = [3]
> {code}
> `IfCoercion` works like the followings.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
> !'Project [unresolvedalias(if 
> (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
>  2 else 3)]   'Project [unresolvedalias(if (nu
> ll) 2 else 3)]
>  +- OneRowRelation$   
> +- OneRowRelation$  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14577) spark.sql.codegen.maxCaseBranches config option

2016-04-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238129#comment-15238129
 ] 

Dongjoon Hyun commented on SPARK-14577:
---

Oh, sure!
Thank you for guide.

> spark.sql.codegen.maxCaseBranches config option
> ---
>
> Key: SPARK-14577
> URL: https://issues.apache.org/jira/browse/SPARK-14577
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> We currently disable codegen for CaseWhen if the number of branches is 
> greater than 20 (in CaseWhen.MAX_NUM_CASES_FOR_CODEGEN). It would be better 
> if this value is a non-public config defined in SQLConf.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14580) HiveTypeCoercion.IfCoercion should preserve original predicates.

2016-04-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238127#comment-15238127
 ] 

Apache Spark commented on SPARK-14580:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/12340

> HiveTypeCoercion.IfCoercion should preserve original predicates.
> 
>
> Key: SPARK-14580
> URL: https://issues.apache.org/jira/browse/SPARK-14580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Dongjoon Hyun
>
> Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
> return-type is null. However, some UDFs need evaluations because they are 
> designed to throw exceptions.
> *Hive*
> {code}
> hive> select if(assert_true(false),2,3);
> OK
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> ASSERT_TRUE(): assertion failed.
> {code}
> 
> *Before*
> {code}
> scala> sql("select if(assert_true(false),2,3)").head
> res2: org.apache.spark.sql.Row = [3]
> {code}
> `IfCoercion` works like the followings.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
> !'Project [unresolvedalias(if 
> (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
>  2 else 3)]   'Project [unresolvedalias(if (nu
> ll) 2 else 3)]
>  +- OneRowRelation$   
> +- OneRowRelation$  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14580) HiveTypeCoercion.IfCoercion should preserve original predicates.

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14580:


Assignee: (was: Apache Spark)

> HiveTypeCoercion.IfCoercion should preserve original predicates.
> 
>
> Key: SPARK-14580
> URL: https://issues.apache.org/jira/browse/SPARK-14580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Dongjoon Hyun
>
> Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
> return-type is null. However, some UDFs need evaluations because they are 
> designed to throw exceptions.
> *Hive*
> {code}
> hive> select if(assert_true(false),2,3);
> OK
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> ASSERT_TRUE(): assertion failed.
> {code}
> 
> *Before*
> {code}
> scala> sql("select if(assert_true(false),2,3)").head
> res2: org.apache.spark.sql.Row = [3]
> {code}
> `IfCoercion` works like the followings.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
> !'Project [unresolvedalias(if 
> (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
>  2 else 3)]   'Project [unresolvedalias(if (nu
> ll) 2 else 3)]
>  +- OneRowRelation$   
> +- OneRowRelation$  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14580) HiveTypeCoercion.IfCoercion should preserve original predicates.

2016-04-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14580:


Assignee: Apache Spark

> HiveTypeCoercion.IfCoercion should preserve original predicates.
> 
>
> Key: SPARK-14580
> URL: https://issues.apache.org/jira/browse/SPARK-14580
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Currently, `HiveTypeCoercion.IfCoercion` removes all predicates whose 
> return-type is null. However, some UDFs need evaluations because they are 
> designed to throw exceptions.
> *Hive*
> {code}
> hive> select if(assert_true(false),2,3);
> OK
> Failed with exception 
> java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: 
> ASSERT_TRUE(): assertion failed.
> {code}
> 
> *Before*
> {code}
> scala> sql("select if(assert_true(false),2,3)").head
> res2: org.apache.spark.sql.Row = [3]
> {code}
> `IfCoercion` works like the followings.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$IfCoercion ===
> !'Project [unresolvedalias(if 
> (HiveGenericUDF#org.apache.hadoop.hive.ql.udf.generic.GenericUDFAssertTrue(false))
>  2 else 3)]   'Project [unresolvedalias(if (nu
> ll) 2 else 3)]
>  +- OneRowRelation$   
> +- OneRowRelation$  
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14547) Avoid DNS resolution for reusing connections

2016-04-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-14547.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Avoid DNS resolution for reusing connections
> 
>
> Key: SPARK-14547
> URL: https://issues.apache.org/jira/browse/SPARK-14547
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14550) OneHotEncoding wrapper in SparkR

2016-04-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238120#comment-15238120
 ] 

Joseph K. Bradley commented on SPARK-14550:
---

Please see comment on [SPARK-14546]

> OneHotEncoding wrapper in SparkR
> 
>
> Key: SPARK-14550
> URL: https://issues.apache.org/jira/browse/SPARK-14550
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> Implement OneHotEncoding in R.
> In R , usually one can use model.matrix to do one hot encoding. which accepts 
> formula. I think we can support simple formula here.
> model.matrix doc: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/model.matrix.html
> here is the example, that would be nice to have
> example :
> http://stackoverflow.com/questions/16200241/recode-categorical-factor-with-n-categories-into-n-binary-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14553) PCA wrapper for SparkR

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14553.
-
Resolution: Later

Please see comment on [SPARK-14546]

> PCA wrapper for SparkR
> --
>
> Key: SPARK-14553
> URL: https://issues.apache.org/jira/browse/SPARK-14553
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> Implement the SparkR wrapper for the PCA transformer
> https://spark.apache.org/docs/latest/ml-features.html#pca
> we should support api similar to R i.e 
> {code}
> featire<-prcomp(df,
>  center = TRUE,
>  scale. = TRUE) 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14552) ReValue wrapper for SparkR

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14552.
-
Resolution: Later

Please see comment on [SPARK-14546]

> ReValue wrapper for SparkR
> --
>
> Key: SPARK-14552
> URL: https://issues.apache.org/jira/browse/SPARK-14552
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> Implement the wrapper for VectorIndexer.
> The inspiring idea is from dply package in R
> {code}
> x <- c("a", "b", "c")
> revalue(x, c(a = "1", c = "2"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14546) Scale Wrapper in SparkR

2016-04-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238117#comment-15238117
 ] 

Joseph K. Bradley commented on SPARK-14546:
---

[~aloknsingh] Thanks for reporting these issues.  We have not yet discussed how 
to support feature transformations in R, and we will need to design a good plan 
& API before we proceed with it.  I'm going to close these issue as Later, and 
it will be great to revisit these questions for future releases.

> Scale Wrapper in SparkR
> ---
>
> Key: SPARK-14546
> URL: https://issues.apache.org/jira/browse/SPARK-14546
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> ML has the StandardScaler and that seems like very commonly used.
> This jira is to implement the SparkR wrapper for it .
> Here is the R scale command
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14546) Scale Wrapper in SparkR

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14546.
-
Resolution: Later

> Scale Wrapper in SparkR
> ---
>
> Key: SPARK-14546
> URL: https://issues.apache.org/jira/browse/SPARK-14546
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> ML has the StandardScaler and that seems like very commonly used.
> This jira is to implement the SparkR wrapper for it .
> Here is the R scale command
> https://stat.ethz.ch/R-manual/R-devel/library/base/html/scale.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-14550) OneHotEncoding wrapper in SparkR

2016-04-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-14550.
-
Resolution: Later

> OneHotEncoding wrapper in SparkR
> 
>
> Key: SPARK-14550
> URL: https://issues.apache.org/jira/browse/SPARK-14550
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Alok Singh
>
> Implement OneHotEncoding in R.
> In R , usually one can use model.matrix to do one hot encoding. which accepts 
> formula. I think we can support simple formula here.
> model.matrix doc: 
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/model.matrix.html
> here is the example, that would be nice to have
> example :
> http://stackoverflow.com/questions/16200241/recode-categorical-factor-with-n-categories-into-n-binary-columns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14529) Consolidate mllib and mllib-local into one mllib folder

2016-04-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15238105#comment-15238105
 ] 

Joseph K. Bradley commented on SPARK-14529:
---

Will this be confusing if, at some point, we decide to make graphx depend on 
mllib-local?

> Consolidate mllib and mllib-local into one mllib folder
> ---
>
> Key: SPARK-14529
> URL: https://issues.apache.org/jira/browse/SPARK-14529
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: DB Tsai
>Assignee: DB Tsai
>
> In the 2.0 QA period (to avoid the conflict of other PRs), this task will 
> consolidate `mllib/src` into `mllib/mllib/src` and `mllib-local/src` into 
> `mllib/mllib-local/src`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >