[jira] [Assigned] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17660:


Assignee: (was: Apache Spark)

> DESC FORMATTED for VIEW Lacks View Definition
> -
>
> Key: SPARK-17660
> URL: https://issues.apache.org/jira/browse/SPARK-17660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> Currently, DESC FORMATTED does not have a section for the view definition. We 
> should add it for permanent views, like what Hive does. Below is an example 
> with the desired view definition.
> {noformat}
> ++-+---+
> |col_name|data_type   
>   
>|comment|
> ++-+---+
> |a   |int 
>   
>|null   |
> ||
>   
>|   |
> |# Detailed Table Information|
>   
>|   |
> |Database:   |default 
>   
>|   |
> |Owner:  |xiaoli  
>   
>|   |
> |Create Time:|Sat Sep 24 21:46:19 PDT 2016
>   
>|   |
> |Last Access Time:   |Wed Dec 31 16:00:00 PST 1969
>   
>|   |
> |Location:   |
>   
>|   |
> |Table Type: |VIEW
>   
>|   |
> |Table Parameters:   |
>   
>|   |
> |  transient_lastDdlTime |1474778779  
>   
>|   |
> ||
>   
>|   |
> |# Storage Information   |
>   
>|   |
> |SerDe Library:  
> |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
> |   |
> |InputFormat:
> |org.apache.hadoop.mapred.SequenceFileInputFormat 
> |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> |   |
> |Compressed: |No  
>   
>|   |
> |Storage Desc Parameters:|
>   
>|   |
> |  serialization.format  |1   
>   
>|   |
> ||
>   
>|   |
> |# View Information  |  

[jira] [Assigned] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17660:


Assignee: Apache Spark

> DESC FORMATTED for VIEW Lacks View Definition
> -
>
> Key: SPARK-17660
> URL: https://issues.apache.org/jira/browse/SPARK-17660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, DESC FORMATTED does not have a section for the view definition. We 
> should add it for permanent views, like what Hive does. Below is an example 
> with the desired view definition.
> {noformat}
> ++-+---+
> |col_name|data_type   
>   
>|comment|
> ++-+---+
> |a   |int 
>   
>|null   |
> ||
>   
>|   |
> |# Detailed Table Information|
>   
>|   |
> |Database:   |default 
>   
>|   |
> |Owner:  |xiaoli  
>   
>|   |
> |Create Time:|Sat Sep 24 21:46:19 PDT 2016
>   
>|   |
> |Last Access Time:   |Wed Dec 31 16:00:00 PST 1969
>   
>|   |
> |Location:   |
>   
>|   |
> |Table Type: |VIEW
>   
>|   |
> |Table Parameters:   |
>   
>|   |
> |  transient_lastDdlTime |1474778779  
>   
>|   |
> ||
>   
>|   |
> |# Storage Information   |
>   
>|   |
> |SerDe Library:  
> |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
> |   |
> |InputFormat:
> |org.apache.hadoop.mapred.SequenceFileInputFormat 
> |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> |   |
> |Compressed: |No  
>   
>|   |
> |Storage Desc Parameters:|
>   
>|   |
> |  serialization.format  |1   
>   
>|   |
> ||
>   
>|   |
> |# View 

[jira] [Commented] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition

2016-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15520189#comment-15520189
 ] 

Apache Spark commented on SPARK-17660:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15234

> DESC FORMATTED for VIEW Lacks View Definition
> -
>
> Key: SPARK-17660
> URL: https://issues.apache.org/jira/browse/SPARK-17660
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> Currently, DESC FORMATTED does not have a section for the view definition. We 
> should add it for permanent views, like what Hive does. Below is an example 
> with the desired view definition.
> {noformat}
> ++-+---+
> |col_name|data_type   
>   
>|comment|
> ++-+---+
> |a   |int 
>   
>|null   |
> ||
>   
>|   |
> |# Detailed Table Information|
>   
>|   |
> |Database:   |default 
>   
>|   |
> |Owner:  |xiaoli  
>   
>|   |
> |Create Time:|Sat Sep 24 21:46:19 PDT 2016
>   
>|   |
> |Last Access Time:   |Wed Dec 31 16:00:00 PST 1969
>   
>|   |
> |Location:   |
>   
>|   |
> |Table Type: |VIEW
>   
>|   |
> |Table Parameters:   |
>   
>|   |
> |  transient_lastDdlTime |1474778779  
>   
>|   |
> ||
>   
>|   |
> |# Storage Information   |
>   
>|   |
> |SerDe Library:  
> |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe   
> |   |
> |InputFormat:
> |org.apache.hadoop.mapred.SequenceFileInputFormat 
> |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> |   |
> |Compressed: |No  
>   
>|   |
> |Storage Desc Parameters:|
>   
>|   |
> |  serialization.format  |1   
>   
>|   |
> ||
> 

[jira] [Created] (SPARK-17660) DESC FORMATTED for VIEW Lacks View Definition

2016-09-24 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17660:
---

 Summary: DESC FORMATTED for VIEW Lacks View Definition
 Key: SPARK-17660
 URL: https://issues.apache.org/jira/browse/SPARK-17660
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 2.1.0
Reporter: Xiao Li


Currently, DESC FORMATTED does not have a section for the view definition. We 
should add it for permanent views, like what Hive does. Below is an example 
with the desired view definition.

{noformat}
++-+---+
|col_name|data_type 

   |comment|
++-+---+
|a   |int   

   |null   |
||  

   |   |
|# Detailed Table Information|  

   |   |
|Database:   |default   

   |   |
|Owner:  |xiaoli

   |   |
|Create Time:|Sat Sep 24 21:46:19 PDT 2016  

   |   |
|Last Access Time:   |Wed Dec 31 16:00:00 PST 1969  

   |   |
|Location:   |  

   |   |
|Table Type: |VIEW  

   |   |
|Table Parameters:   |  

   |   |
|  transient_lastDdlTime |1474778779

   |   |
||  

   |   |
|# Storage Information   |  

   |   |
|SerDe Library:  
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe 
  |   |
|InputFormat:|org.apache.hadoop.mapred.SequenceFileInputFormat  

   |   |
|OutputFormat:   
|org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat  
  |   |
|Compressed: |No

   |   |
|Storage Desc Parameters:|  

   |   |
|  serialization.format  |1 

   |   |
||  

   |   |
|# View Information  |  

   |   |
|View Original Text: |SELECT * FROM tbl 

   |   |
|View Expanded Text: |SELECT 

[jira] [Commented] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE

2016-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15520095#comment-15520095
 ] 

Apache Spark commented on SPARK-17659:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15233

> Partitioned View is Not Supported In SHOW CREATE TABLE
> --
>
> Key: SPARK-17659
> URL: https://issues.apache.org/jira/browse/SPARK-17659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, 
> SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE 
> TABLE should not support it like the other Hive-only features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17659:


Assignee: Apache Spark

> Partitioned View is Not Supported In SHOW CREATE TABLE
> --
>
> Key: SPARK-17659
> URL: https://issues.apache.org/jira/browse/SPARK-17659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, 
> SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE 
> TABLE should not support it like the other Hive-only features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17659:


Assignee: (was: Apache Spark)

> Partitioned View is Not Supported In SHOW CREATE TABLE
> --
>
> Key: SPARK-17659
> URL: https://issues.apache.org/jira/browse/SPARK-17659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> `Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, 
> SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE 
> TABLE should not support it like the other Hive-only features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17659) Partitioned View is Not Supported In SHOW CREATE TABLE

2016-09-24 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17659:
---

 Summary: Partitioned View is Not Supported In SHOW CREATE TABLE
 Key: SPARK-17659
 URL: https://issues.apache.org/jira/browse/SPARK-17659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 2.1.0
Reporter: Xiao Li


`Partitioned View` is not supported by SPARK SQL. For Hive partitioned view, 
SHOW CREATE TABLE is unable to generate the right DDL. Thus, SHOW CREATE TABLE 
should not support it like the other Hive-only features.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17631) Structured Streaming - Add Http Stream Sink

2016-09-24 Thread zhangxinyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519970#comment-15519970
 ] 

zhangxinyu commented on SPARK-17631:


h4. A short design for this feature

h5. Goal
Build an http sink for structured streaming. Streaming query results can be 
written out to http servers.

h5. Usage
# The streaming query results should have a single string column.
# We should configure ".format("http").option("url", yourHttpUrl)" in our 
programs to create http sinks.
e.g.
val query = counts.writeStream
.outputMode("complete")
.format("http")
.option("url", "yourHttpUrl")
.start()

h5. Design
# Add a class "HttpSink" that extends trait "Sink", and override function 
"addBatch". 
   Override "addBatch": echo Row in dataFrame will be written out through 
sending an http post request. 
# Add a class "HttpStreamSink" that extends both trait "StreamSinkProvider" and 
trait "DataSourceRegister". It overrides two functions:
   - shortname: return an string "http"
   - createSink: return an HttpSink instance

h5. Other features to debate
# should we support https too?
# do we need to set any headers (i.e. maybe the batch id?)

> Structured Streaming - Add Http Stream Sink
> ---
>
> Key: SPARK-17631
> URL: https://issues.apache.org/jira/browse/SPARK-17631
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>Priority: Minor
>
> Streaming query results can be sinked to http server through http post request
> github: https://github.com/apache/spark/pull/15194



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519953#comment-15519953
 ] 

Xiao Li commented on SPARK-17653:
-

Yeah. You are right. It does not work. : )

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519950#comment-15519950
 ] 

Xiao Li commented on SPARK-17653:
-

I see. After rethinking it, Union is special. My PR is not applicable to it. We 
are unable to eliminate the Distinct in this pattern. I think what you said is 
correct. We can do it for UNION. Do you want me to try it? Or somebody else 
already started it? Thanks!

BTW, in traditional RDBMS, many optimizer rules are based on the unique 
constraints. However, Spark SQL does not have the concept of primary key or 
unique constraints. If we allow users specify unique constraints using Hints, 
we could further optimize the plan and the execution. Do you think adding such 
a HINT is OK to Spark SQL? 

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519923#comment-15519923
 ] 

Reynold Xin commented on SPARK-17653:
-

[~smilegator] - I just took a quick look at #11930. It looks to me it mainly 
propagates uniqueness property up. In this case we want to remove distincts 
down a subtree. How would it work in your case?


> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519710#comment-15519710
 ] 

Reynold Xin commented on SPARK-17653:
-

There are different ways to fix this, from fairly general ones to more surgical 
ones. The most surgical fix I can think of is to just match a bunch of 
Distinct(Union(Distinct(Union(...))) and combine them into a single 
Distinct(Union(...)).

If the more general fix is simple enough, that could be a good idea too.

cc [~vssrinath]

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17609) SessionCatalog.tableExists should not check temp view

2016-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17609:

Fix Version/s: (was: 2.0.2)
   2.0.1

> SessionCatalog.tableExists should not check temp view
> -
>
> Key: SPARK-17609
> URL: https://issues.apache.org/jira/browse/SPARK-17609
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17640) Avoid using -1 as the default batchId for FileStreamSource.FileEntry

2016-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17640:

Fix Version/s: (was: 2.0.2)
   2.0.1

> Avoid using -1 as the default batchId for FileStreamSource.FileEntry
> 
>
> Key: SPARK-17640
> URL: https://issues.apache.org/jira/browse/SPARK-17640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17502) Multiple Bugs in DDL Statements on Temporary Views

2016-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17502:

Fix Version/s: (was: 2.0.2)
   2.0.1

> Multiple Bugs in DDL Statements on Temporary Views 
> ---
>
> Key: SPARK-17502
> URL: https://issues.apache.org/jira/browse/SPARK-17502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.1, 2.1.0
>
>
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for partition-related 
> ALTER TABLE commands. However, it always reports a confusing error message. 
> For example, 
> {noformat}
> Partition spec is invalid. The spec (a, b) must match the partition spec () 
> defined in table '`testview`';
> {noformat}
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for `ALTER TABLE ... 
> UNSET TBLPROPERTIES`. However, it reports missing table property. However, 
> the expected error should be `NoSuchTableException`. For example, 
> {noformat}
> Attempted to unset non-existent property 'p' in table '`testView`';
> {noformat}
> - When `ANALYZE TABLE` is called on a view or a temporary view, we should 
> issue an error message. However, it reports a strange error:
> {noformat}
> ANALYZE TABLE is not supported for Project
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17210) sparkr.zip is not distributed to executors when run sparkr in RStudio

2016-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17210:

Fix Version/s: (was: 2.0.2)
   2.0.1

> sparkr.zip is not distributed to executors when run sparkr in RStudio
> -
>
> Key: SPARK-17210
> URL: https://issues.apache.org/jira/browse/SPARK-17210
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 2.0.1, 2.1.0
>
>
> Here's the code to reproduce this issue. 
> {code}
> Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
> .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
> library(SparkR)
> sparkR.session(master="yarn-client", sparkConfig = 
> list(spark.executor.instances="1"))
> df <- as.DataFrame(mtcars)
> head(df)
> {code}
> And this is the exception in executor log.
> {noformat}
> 16/08/24 15:33:45 INFO BufferedStreamThread: Fatal error: cannot open file 
> '/Users/jzhang/Temp/hadoop_tmp/nm-local-dir/usercache/jzhang/appcache/application_1471846125517_0022/container_1471846125517_0022_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 16/08/24 15:33:55 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 6)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:367)
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

2016-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16240:

Fix Version/s: (was: 2.0.2)
   2.0.1

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Assignee: Gayathri Murali
> Fix For: 2.0.1, 2.1.0
>
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17651) Automate Spark version update for documentations

2016-09-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-17651:

Fix Version/s: (was: 2.0.2)
   2.0.1

> Automate Spark version update for documentations
> 
>
> Key: SPARK-17651
> URL: https://issues.apache.org/jira/browse/SPARK-17651
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Documentation
>Reporter: Reynold Xin
>Assignee: Shivaram Venkataraman
> Fix For: 2.0.1, 2.1.0
>
>
> Both the Jekyll generated docs and SparkR API reference docs have a version 
> number in them. It would be great to automate those in the release script 
> without having to manually update using a commit.
> cc [~shivaram]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-24 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15519497#comment-15519497
 ] 

Yan commented on SPARK-17556:
-

For 2),  I think BitTorrent won't help in the case of all-to-all transfers, 
unlike the one-to-all such as the driver-to-cluster broadcast, or few-to-all, 
transfers. Thanks.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17631) Structured Streaming - Add Http Stream Sink

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17631:
--
Fix Version/s: (was: 2.0.0)

> Structured Streaming - Add Http Stream Sink
> ---
>
> Key: SPARK-17631
> URL: https://issues.apache.org/jira/browse/SPARK-17631
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>Priority: Minor
>
> Streaming query results can be sinked to http server through http post request
> github: https://github.com/apache/spark/pull/15194



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17499) make the default params in sparkR spark.mlp consistent with MultilayerPerceptronClassifier

2016-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518938#comment-15518938
 ] 

Apache Spark commented on SPARK-17499:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15232

> make the default params in sparkR spark.mlp consistent with 
> MultilayerPerceptronClassifier
> --
>
> Key: SPARK-17499
> URL: https://issues.apache.org/jira/browse/SPARK-17499
> Project: Spark
>  Issue Type: Improvement
>Reporter: Weichen Xu
>Assignee: Weichen Xu
> Fix For: 2.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> several default params in sparkR spark.mlp is wrong,
> layers should be null
> tol should be 1e-6
> stepSize should be 0.03
> seed should be -763139545



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17658:


Assignee: (was: Apache Spark)

> write.df API requires path which is not actually always nessasary in SparkR
> ---
>
> Key: SPARK-17658
> URL: https://issues.apache.org/jira/browse/SPARK-17658
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems {{write.df}} in SparkR always requires taking {{path}}. This is 
> actually not always nessasary.
> For example, if we have a datasource extending {{CreatableRelationProvider}}, 
> it might not request {{path}}. 
> FWIW, Python/Scala do not require this in the API already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR

2016-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518856#comment-15518856
 ] 

Apache Spark commented on SPARK-17658:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15231

> write.df API requires path which is not actually always nessasary in SparkR
> ---
>
> Key: SPARK-17658
> URL: https://issues.apache.org/jira/browse/SPARK-17658
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It seems {{write.df}} in SparkR always requires taking {{path}}. This is 
> actually not always nessasary.
> For example, if we have a datasource extending {{CreatableRelationProvider}}, 
> it might not request {{path}}. 
> FWIW, Python/Scala do not require this in the API already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17658:


Assignee: Apache Spark

> write.df API requires path which is not actually always nessasary in SparkR
> ---
>
> Key: SPARK-17658
> URL: https://issues.apache.org/jira/browse/SPARK-17658
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> It seems {{write.df}} in SparkR always requires taking {{path}}. This is 
> actually not always nessasary.
> For example, if we have a datasource extending {{CreatableRelationProvider}}, 
> it might not request {{path}}. 
> FWIW, Python/Scala do not require this in the API already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17658) write.df API requires path which is not actually always nessasary in SparkR

2016-09-24 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-17658:


 Summary: write.df API requires path which is not actually always 
nessasary in SparkR
 Key: SPARK-17658
 URL: https://issues.apache.org/jira/browse/SPARK-17658
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.0.0
Reporter: Hyukjin Kwon
Priority: Minor


It seems {{write.df}} in SparkR always requires taking {{path}}. This is 
actually not always nessasary.

For example, if we have a datasource extending {{CreatableRelationProvider}}, 
it might not request {{path}}. 

FWIW, Python/Scala do not require this in the API already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-24 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518790#comment-15518790
 ] 

Liang-Chi Hsieh commented on SPARK-17556:
-

For 1). It is true only if your driver is outside of the cluster. So you can 
avoid uploading data from the driver to the cluster. If it is in cluster mode, 
then I think it is no obvious difference between uploading data from the driver 
and any executor.

For 2). I think it is not exactly correct. Basically we perform a 
BitTorrent-like approach to fetch block, the slaves do need to connect to all 
others by the end.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-24 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh updated SPARK-17556:

Attachment: executor-side-broadcast.pdf

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf, executor-side-broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17556) Executor side broadcast for broadcast joins

2016-09-24 Thread Yan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518598#comment-15518598
 ] 

Yan commented on SPARK-17556:
-

A few comments of mine are as follows:

1) The "one-executor collection" approach is different from the driver-side 
collection and broadcasting, in that it avoids uploading data from the driver 
back to cluster. The primary concern of the "one-executor collection" approach, 
as pointed out, is that the sole executor could get bottlenecked similar to the 
latency issue with the "driver-side collection" approach, to a large degree;
2) The "all-executor collection" approach is more balanced and scalable, but it 
might suffer from the network storming since all slaves needs to connect to all 
others.
3) the real issue is the repeated, and thus wasted, work of collection of 
pieces of the broadcast data by multiple collectors/broadcasters, against the 
extended latency if the collection/broadcasting is performed once and for all. 
This is actually not quite different from the scenario of multiple- vs 
single-reducer in a map/reduce execution. Final output from a single reducer is 
ready to use; while those from multiple-reducers require final assemblies by 
the end users, particularly if the final result is to be organized, e.g., 
totally ordered. But using multiple-reducers is more scalable, balanced and 
likely faster. 
4) It's probably good to have a configurable # of executors acting as 
collectors/broadcasters, each of which just collects and broadcasts a portion 
of the broadcast table for the final join executions.

> Executor side broadcast for broadcast joins
> ---
>
> Key: SPARK-17556
> URL: https://issues.apache.org/jira/browse/SPARK-17556
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Reporter: Reynold Xin
> Attachments: executor broadcast.pdf
>
>
> Currently in Spark SQL, in order to perform a broadcast join, the driver must 
> collect the result of an RDD and then broadcast it. This introduces some 
> extra latency. It might be possible to broadcast directly from executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17057) ProbabilisticClassifierModels' thresholds should have at most one 0

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17057.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15149
[https://github.com/apache/spark/pull/15149]

> ProbabilisticClassifierModels' thresholds should have at most one 0
> ---
>
> Key: SPARK-17057
> URL: https://issues.apache.org/jira/browse/SPARK-17057
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: zhengruifeng
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.1.0
>
>
> {code}
> val path = "./data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rfm = rf.fit(data)
> scala> rfm.setThresholds(Array(0.0,0.0,0.0))
> res4: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees
> scala> rfm.transform(data).show(5)
> +-++--+-+--+
> |label|features| rawPrediction|  probability|prediction|
> +-++--+-+--+
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]|   0.0|
> +-++--+-+--+
> only showing top 5 rows
> {code}
> If multi thresholds are set zero, the prediction of 
> {{ProbabilisticClassificationModel}} is the first index whose corresponding 
> threshold is 0. 
> However, in this case, the index with max {{probability}} among indices with 
> 0-threshold should be more reasonable to mark as
> {{prediction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17057) ProbabilisticClassifierModels' thresholds should have at most one 0

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17057:
--
Issue Type: Improvement  (was: Bug)
   Summary: ProbabilisticClassifierModels' thresholds should have at most 
one 0  (was: ProbabilisticClassifierModels' thresholds should be > 0)

> ProbabilisticClassifierModels' thresholds should have at most one 0
> ---
>
> Key: SPARK-17057
> URL: https://issues.apache.org/jira/browse/SPARK-17057
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: zhengruifeng
>Assignee: Sean Owen
>Priority: Minor
>
> {code}
> val path = "./data/mllib/sample_multiclass_classification_data.txt"
> val data = spark.read.format("libsvm").load(path)
> val rfm = rf.fit(data)
> scala> rfm.setThresholds(Array(0.0,0.0,0.0))
> res4: org.apache.spark.ml.classification.RandomForestClassificationModel = 
> RandomForestClassificationModel (uid=rfc_cbe640b0eccc) with 20 trees
> scala> rfm.transform(data).show(5)
> +-++--+-+--+
> |label|features| rawPrediction|  probability|prediction|
> +-++--+-+--+
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  1.0|(4,[0,1,2,3],[-0|[0.0,20.0,0.0]|[0.0,1.0,0.0]|   0.0|
> |  0.0|(4,[0,1,2,3],[0.1...|[20.0,0.0,0.0]|[1.0,0.0,0.0]|   0.0|
> +-++--+-+--+
> only showing top 5 rows
> {code}
> If multi thresholds are set zero, the prediction of 
> {{ProbabilisticClassificationModel}} is the first index whose corresponding 
> threshold is 0. 
> However, in this case, the index with max {{probability}} among indices with 
> 0-threshold should be more reasonable to mark as
> {{prediction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17656) Decide on the variant of @scala.annotation.varargs and use consistently

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17656:
--
Affects Version/s: (was: 2.0.2)
   2.0.0
 Priority: Trivial  (was: Major)

(Not Major, can't affect unreleased 2.0.2)

There is only one annotation, it's a question of how to import it. The normal 
thing to do is {{import scala.annotation.varargs}} and then {{@varargs}}. The 
{{_root_}} prefix has to be used where necessary to disambiguate the import, 
but isn't apparently needed in any case in the code right now.

> Decide on the variant of @scala.annotation.varargs and use consistently
> ---
>
> Key: SPARK-17656
> URL: https://issues.apache.org/jira/browse/SPARK-17656
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> After the [discussion at 
> dev@spark|http://apache-spark-developers-list.1001551.n3.nabble.com/scala-annotation-varargs-or-root-scala-annotation-varargs-td18898.html]
>  it appears there's a consensus to review the use of 
> {{@scala.annotation.varargs}} throughout the codebase and use one variant and 
> use it consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10835) Word2Vec should accept non-null string array, in addition to existing null string array

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10835.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15179
[https://github.com/apache/spark/pull/15179]

> Word2Vec should accept non-null string array, in addition to existing null 
> string array
> ---
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17653) Optimizer should remove unnecessary distincts (in multiple unions)

2016-09-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518577#comment-15518577
 ] 

Xiao Li commented on SPARK-17653:
-

[~rxin] I submitted a PR https://github.com/apache/spark/pull/11930 for 
resolving a related issue. If you think that is a right direction, I will 
continue/enhance it and write the design doc. 

> Optimizer should remove unnecessary distincts (in multiple unions)
> --
>
> Key: SPARK-17653
> URL: https://issues.apache.org/jira/browse/SPARK-17653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Reynold Xin
>
> Query:
> {code}
> select 1 a union select 2 b union select 3 c
> {code}
> Explain plan:
> {code}
> == Physical Plan ==
> *HashAggregate(keys=[a#13], functions=[])
> +- Exchange hashpartitioning(a#13, 200)
>+- *HashAggregate(keys=[a#13], functions=[])
>   +- Union
>  :- *HashAggregate(keys=[a#13], functions=[])
>  :  +- Exchange hashpartitioning(a#13, 200)
>  : +- *HashAggregate(keys=[a#13], functions=[])
>  :+- Union
>  :   :- *Project [1 AS a#13]
>  :   :  +- Scan OneRowRelation[]
>  :   +- *Project [2 AS b#14]
>  :  +- Scan OneRowRelation[]
>  +- *Project [3 AS c#15]
> +- Scan OneRowRelation[]
> {code}
> Only one distinct should be necessary. This makes a bunch of unions slower 
> than a bunch of union alls followed by a distinct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10835) Word2Vec should accept non-null string array, in addition to existing null string array

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10835:
--
Summary: Word2Vec should accept non-null string array, in addition to 
existing null string array  (was: [SPARK-10835] [ML] Word2Vec should accept 
non-null string array, in addition to existing null string array)

> Word2Vec should accept non-null string array, in addition to existing null 
> string array
> ---
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10835) [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition to existing null string array

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10835:
--
Summary: [SPARK-10835] [ML] Word2Vec should accept non-null string array, 
in addition to existing null string array  (was: Change Output of NGram to 
Array(String, True))

> [SPARK-10835] [ML] Word2Vec should accept non-null string array, in addition 
> to existing null string array
> --
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10835) Word2Vec should accept non-null string array, in addition to existing null string array

2016-09-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10835:
--
Shepherd: Sean Owen  (was: Joseph K. Bradley)

> Word2Vec should accept non-null string array, in addition to existing null 
> string array
> ---
>
> Key: SPARK-10835
> URL: https://issues.apache.org/jira/browse/SPARK-10835
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Sumit Chawla
>Assignee: yuhao yang
>Priority: Minor
>
> Currently output type of NGram is Array(String, false), which is not 
> compatible with LDA  since their input type is Array(String, true). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17657) Disallow Users to Change Table Type

2016-09-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518556#comment-15518556
 ] 

Apache Spark commented on SPARK-17657:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15230

> Disallow Users to Change Table Type 
> 
>
> Key: SPARK-17657
> URL: https://issues.apache.org/jira/browse/SPARK-17657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> Hive allows users to change the table type from `Managed` to `External` or 
> from `External` to `Managed` by altering table's property `EXTERNAL`. See the 
> JIRA: https://issues.apache.org/jira/browse/HIVE-1329
> So far, Spark SQL does not correctly support it, although users can do it. 
> Many assumptions are broken in the implementation. Thus, this PR is to 
> disallow users to do it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17657) Disallow Users to Change Table Type

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17657:


Assignee: (was: Apache Spark)

> Disallow Users to Change Table Type 
> 
>
> Key: SPARK-17657
> URL: https://issues.apache.org/jira/browse/SPARK-17657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>
> Hive allows users to change the table type from `Managed` to `External` or 
> from `External` to `Managed` by altering table's property `EXTERNAL`. See the 
> JIRA: https://issues.apache.org/jira/browse/HIVE-1329
> So far, Spark SQL does not correctly support it, although users can do it. 
> Many assumptions are broken in the implementation. Thus, this PR is to 
> disallow users to do it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17657) Disallow Users to Change Table Type

2016-09-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17657:


Assignee: Apache Spark

> Disallow Users to Change Table Type 
> 
>
> Key: SPARK-17657
> URL: https://issues.apache.org/jira/browse/SPARK-17657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Hive allows users to change the table type from `Managed` to `External` or 
> from `External` to `Managed` by altering table's property `EXTERNAL`. See the 
> JIRA: https://issues.apache.org/jira/browse/HIVE-1329
> So far, Spark SQL does not correctly support it, although users can do it. 
> Many assumptions are broken in the implementation. Thus, this PR is to 
> disallow users to do it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17657) Disallow Users to Change Table Type

2016-09-24 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17657:
---

 Summary: Disallow Users to Change Table Type 
 Key: SPARK-17657
 URL: https://issues.apache.org/jira/browse/SPARK-17657
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 2.1.0
Reporter: Xiao Li


Hive allows users to change the table type from `Managed` to `External` or from 
`External` to `Managed` by altering table's property `EXTERNAL`. See the JIRA: 
https://issues.apache.org/jira/browse/HIVE-1329

So far, Spark SQL does not correctly support it, although users can do it. Many 
assumptions are broken in the implementation. Thus, this PR is to disallow 
users to do it. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17656) Decide on the variant of @scala.annotation.varargs and use consistently

2016-09-24 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-17656:
---

 Summary: Decide on the variant of @scala.annotation.varargs and 
use consistently
 Key: SPARK-17656
 URL: https://issues.apache.org/jira/browse/SPARK-17656
 Project: Spark
  Issue Type: Improvement
Affects Versions: 2.0.2
Reporter: Jacek Laskowski


After the [discussion at 
dev@spark|http://apache-spark-developers-list.1001551.n3.nabble.com/scala-annotation-varargs-or-root-scala-annotation-varargs-td18898.html]
 it appears there's a consensus to review the use of 
{{@scala.annotation.varargs}} throughout the codebase and use one variant and 
use it consistently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8987) Increase test coverage of DAGScheduler

2016-09-24 Thread OuyangJin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518537#comment-15518537
 ] 

OuyangJin commented on SPARK-8987:
--

 I'd like to work on this

> Increase test coverage of DAGScheduler
> --
>
> Key: SPARK-8987
> URL: https://issues.apache.org/jira/browse/SPARK-8987
> Project: Spark
>  Issue Type: Umbrella
>  Components: Scheduler, Tests
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>
> DAGScheduler is one of the most monstrous piece of code in Spark. Every time 
> someone changes something there something like the following happens:
> (1) Someone pings a committer
> (2) The committer pings a scheduler maintainer
> (3) Scheduler maintainer correctly points out bugs in the patch
> (4) Author of patch fixes bug but introduces more bugs
> (5) Repeat steps 3 - 4 N times
> (6) Other committers / contributors jump in and start debating
> (7) The patch goes stale for months
> All of this happens because no one, including the committers, has high 
> confidence that a particular change doesn't break some corner case in the 
> scheduler. I believe one of the main issues is the lack of sufficient test 
> coverage, which is not a luxury but a necessity for logic as complex as the 
> DAGScheduler.
> As of the writing of this JIRA, DAGScheduler has ~1500 lines, while the 
> DAGSchedulerSuite only has ~900 lines. I would argue that the suite line 
> count should actually be many multiples of that of the original code.
> If you wish to work on this, let me know and I will assign it to you. Anyone 
> is welcome. :)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17210) sparkr.zip is not distributed to executors when run sparkr in RStudio

2016-09-24 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15518507#comment-15518507
 ] 

Felix Cheung commented on SPARK-17210:
--

Got it, sorry about that, I should have noticed.

> sparkr.zip is not distributed to executors when run sparkr in RStudio
> -
>
> Key: SPARK-17210
> URL: https://issues.apache.org/jira/browse/SPARK-17210
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
> Fix For: 2.0.2, 2.1.0
>
>
> Here's the code to reproduce this issue. 
> {code}
> Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
> .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
> library(SparkR)
> sparkR.session(master="yarn-client", sparkConfig = 
> list(spark.executor.instances="1"))
> df <- as.DataFrame(mtcars)
> head(df)
> {code}
> And this is the exception in executor log.
> {noformat}
> 16/08/24 15:33:45 INFO BufferedStreamThread: Fatal error: cannot open file 
> '/Users/jzhang/Temp/hadoop_tmp/nm-local-dir/usercache/jzhang/appcache/application_1471846125517_0022/container_1471846125517_0022_01_02/sparkr/SparkR/worker/daemon.R':
>  No such file or directory
> 16/08/24 15:33:55 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 6)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:404)
> at java.net.ServerSocket.implAccept(ServerSocket.java:545)
> at java.net.ServerSocket.accept(ServerSocket.java:513)
> at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:367)
> at org.apache.spark.api.r.RRunner.compute(RRunner.scala:69)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org