date:20160215

[jira] [Comment Edited] (SPARK-13298) DAG visualization does not render correctly for jobs

2016-02-15 Thread Lucas Woltmann (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148232#comment-15148232
 ] 

Lucas Woltmann edited comment on SPARK-13298 at 2/16/16 7:57 AM:
-

Looks like .cache() breaks it.

DAG without cache(): !dag_full.png!


was (Author: telemort):
Looks like .cache() breaks it.

DAG without cache(): !dag_full-png!

> DAG visualization does not render correctly for jobs
> 
>
> Key: SPARK-13298
> URL: https://issues.apache.org/jira/browse/SPARK-13298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Lucas Woltmann
> Attachments: dag_full.png, dag_viz.png
>
>
> Whenever I try to open the DAG for a job, I get something like this:
> !dag_viz.png!
> Obviously the svg doesn't get resized, but if I resize it manually, only the 
> first of four stages in the DAG is shown. 
> The js console says (variable v is null in peg$c34):
> {code:javascript}
> Uncaught TypeError: Cannot read property '3' of null
>   peg$c34 @ graphlib-dot.min.js:1
>   peg$parseidDef @ graphlib-dot.min.js:1
>   peg$parseaList @ graphlib-dot.min.js:1
>   peg$parseattrListBlock @ graphlib-dot.min.js:1
>   peg$parseattrList @ graphlib-dot.min.js:1
>   peg$parsenodeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsegraphStmt @ graphlib-dot.min.js:1
>   parse @ graphlib-dot.min.js:2
>   readOne @ graphlib-dot.min.js:2
>   renderDot @ spark-dag-viz.js:281
>   (anonymous function) @ spark-dag-viz.js:248
>   (anonymous function) @ d3.min.js:
>   3Y @ d3.min.js:1
>   _a.each @ d3.min.js:3
>   renderDagVizForJob @ spark-dag-viz.js:207
>   renderDagViz @ spark-dag-viz.js:163
>   toggleDagViz @ spark-dag-viz.js:100
>   onclick @ ?id=2:153
> {code}
> (tested in FIrefox 44.0.1 and Chromium 48.0.2564.103)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13298) DAG visualization does not render correctly for jobs

2016-02-15 Thread Lucas Woltmann (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lucas Woltmann updated SPARK-13298:
---
Attachment: dag_full.png

> DAG visualization does not render correctly for jobs
> 
>
> Key: SPARK-13298
> URL: https://issues.apache.org/jira/browse/SPARK-13298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Lucas Woltmann
> Attachments: dag_full.png, dag_viz.png
>
>
> Whenever I try to open the DAG for a job, I get something like this:
> !dag_viz.png!
> Obviously the svg doesn't get resized, but if I resize it manually, only the 
> first of four stages in the DAG is shown. 
> The js console says (variable v is null in peg$c34):
> {code:javascript}
> Uncaught TypeError: Cannot read property '3' of null
>   peg$c34 @ graphlib-dot.min.js:1
>   peg$parseidDef @ graphlib-dot.min.js:1
>   peg$parseaList @ graphlib-dot.min.js:1
>   peg$parseattrListBlock @ graphlib-dot.min.js:1
>   peg$parseattrList @ graphlib-dot.min.js:1
>   peg$parsenodeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsegraphStmt @ graphlib-dot.min.js:1
>   parse @ graphlib-dot.min.js:2
>   readOne @ graphlib-dot.min.js:2
>   renderDot @ spark-dag-viz.js:281
>   (anonymous function) @ spark-dag-viz.js:248
>   (anonymous function) @ d3.min.js:
>   3Y @ d3.min.js:1
>   _a.each @ d3.min.js:3
>   renderDagVizForJob @ spark-dag-viz.js:207
>   renderDagViz @ spark-dag-viz.js:163
>   toggleDagViz @ spark-dag-viz.js:100
>   onclick @ ?id=2:153
> {code}
> (tested in FIrefox 44.0.1 and Chromium 48.0.2564.103)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13298) DAG visualization does not render correctly for jobs

2016-02-15 Thread Lucas Woltmann (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148232#comment-15148232
 ] 

Lucas Woltmann commented on SPARK-13298:


Looks like .cache() breaks it.

DAG without cache(): !dag_full-png!

> DAG visualization does not render correctly for jobs
> 
>
> Key: SPARK-13298
> URL: https://issues.apache.org/jira/browse/SPARK-13298
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Lucas Woltmann
> Attachments: dag_viz.png
>
>
> Whenever I try to open the DAG for a job, I get something like this:
> !dag_viz.png!
> Obviously the svg doesn't get resized, but if I resize it manually, only the 
> first of four stages in the DAG is shown. 
> The js console says (variable v is null in peg$c34):
> {code:javascript}
> Uncaught TypeError: Cannot read property '3' of null
>   peg$c34 @ graphlib-dot.min.js:1
>   peg$parseidDef @ graphlib-dot.min.js:1
>   peg$parseaList @ graphlib-dot.min.js:1
>   peg$parseattrListBlock @ graphlib-dot.min.js:1
>   peg$parseattrList @ graphlib-dot.min.js:1
>   peg$parsenodeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsesubgraphStmt @ graphlib-dot.min.js:1
>   peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1
>   peg$parseedgeStmt @ graphlib-dot.min.js:1
>   peg$parsestmt @ graphlib-dot.min.js:1
>   peg$parsestmtList @ graphlib-dot.min.js:1
>   peg$parsegraphStmt @ graphlib-dot.min.js:1
>   parse @ graphlib-dot.min.js:2
>   readOne @ graphlib-dot.min.js:2
>   renderDot @ spark-dag-viz.js:281
>   (anonymous function) @ spark-dag-viz.js:248
>   (anonymous function) @ d3.min.js:
>   3Y @ d3.min.js:1
>   _a.each @ d3.min.js:3
>   renderDagVizForJob @ spark-dag-viz.js:207
>   renderDagViz @ spark-dag-viz.js:163
>   toggleDagViz @ spark-dag-viz.js:100
>   onclick @ ?id=2:153
> {code}
> (tested in FIrefox 44.0.1 and Chromium 48.0.2564.103)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-15 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148205#comment-15148205
 ] 

dylanzhou edited comment on SPARK-13183 at 2/16/16 7:48 AM:


@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC. Can you give me some 
advice? Here is my question, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html



was (Author: dylanzhou):
@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC. Here is my question, get 
advice here is my point, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html


> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-15 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148205#comment-15148205
 ] 

dylanzhou edited comment on SPARK-13183 at 2/16/16 7:46 AM:


@Sean Owen maybe is a memory leak problem, and finally will run out of heap 
memory error java.lang.OutOfMemoryError:Java for heap space. When I try to 
increase driver memory, just streaming programs work a little longer, in my 
opinion byte[] objects cannot be reclaimed by the GC. Here is my question, get 
advice here is my point, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html



was (Author: dylanzhou):
There is a memory leak problem, and finally will run out of heap memory error 
java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver 
memory, just streaming programs work a little longer, in my opinion byte[] 
objects cannot be reclaimed by the GC. Here is my question, get advice here is 
my point, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html


> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-15 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148205#comment-15148205
 ] 

dylanzhou edited comment on SPARK-13183 at 2/16/16 7:45 AM:


There is a memory leak problem, and finally will run out of heap memory error 
java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver 
memory, just streaming programs work a little longer, in my opinion byte[] 
objects cannot be reclaimed by the GC. Here is my question, get advice here is 
my point, thank you!
http://apache-spark-user-list.1001560.n3.nabble.com/the-memory-leak-problem-of-use-sparkstreamimg-and-sparksql-with-kafka-in-spark-1-4-1-td26231.html



was (Author: dylanzhou):
There is a memory leak problem, and finally will run out of heap memory error 
java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver 
memory, just streaming programs work a little longer, in my opinion byte[] 
objects cannot be reclaimed by the GC. Here is my program, get advice here is 
my point, thank you!
object LogAnalyzerStreamingSQL { 


  def main(args: Array[String]) { 
val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in 
Scala") 
val sc = new SparkContext(sparkConf) 

val sqlContext = new SQLContext(sc) 
import sqlContext.implicits._ 

val ssc = new StreamingContext(sc, 30) 
val topicSet = Set("applogs") 
val kafkaParams = Map[String, String]( 
"metadata.broker.list" -> 
"192.168.100.1:9092,192.168.100.2:9092,192.168.100.3:9092", 
"group.id" -> "app_group", 
"serializer.class" -> "kafka.serializer.StringEncoder") 
val kafkaStream= KafkaUtils.createDirectStream(ssc,kafkaParams,topics) 
 kafkaStream.foreachRDD(rdd => { 
if (!rdd.isEmpty()) { 
  val jsonRdd=rdd.map(x=>x._2) 
  val df = sqlContext.read.json(jsonRdd) 
  df.registerTempTable("applogs") 
  sqlContext.cacheTable("applogs") 

// Calculate statistics based on the content size. 
val contentSizeStats = sqlContext 
  .sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), 
MAX(contentSize) FROM applogs") 
 .show() 


// Compute Response Code to Count. 
val responseCodeToCount = sqlContext 
  .sql("SELECT responseCode, COUNT(*) FROM applogs GROUP BY 
responseCode") 
  .map(row => (row.getInt(0), row.getLong(1))) 
  .show() 


// Any IPAddress that has accessed the server more than 10 times. 
val ipAddresses =sqlContext 
  .sql("SELECT ipAddress, COUNT(*) AS total FROM applogs GROUP BY 
ipAddress HAVING total > 10") 
  .map(row => row.getString(0)) 
  .take(100) 


val topEndpoints = sqlContext 
  .sql("SELECT endpoint, COUNT(*) AS total FROM applogs GROUP BY 
endpoint ORDER BY total DESC LIMIT 10") 
  .map(row => (row.getString(0), row.getLong(1))) 
  .show() 

  //a lot of sql like that 

  sqlContext.uncacheTable("applogs") 
  } 
}) 

ssc.start() 
ssc.awaitTermination() 
  } 
} 

> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-13221.
-
Resolution: Fixed

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-13221:
-

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13221) GroupingSets Returns an Incorrect Results

2016-02-15 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-13221.
-
   Resolution: Resolved
Fix Version/s: 2.0.0

> GroupingSets Returns an Incorrect Results
> -
>
> Key: SPARK-13221
> URL: https://issues.apache.org/jira/browse/SPARK-13221
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Xiao Li
>Priority: Critical
> Fix For: 2.0.0
>
>
> The following query returns a wrong result:
> {code}
> sql("select course, sum(earnings) as sum from courseSales group by course, 
> earnings" +
>  " grouping sets((), (course), (course, earnings))" +
>  " order by course, sum").show()
> {code}
> Before the fix, the results are like
> {code}
> [null,null]
> [Java,null]
> [Java,2.0]
> [Java,3.0]
> [dotNET,null]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> {code}
> After the fix, the results are corrected:
> {code}
> [null,113000.0]
> [Java,2.0]
> [Java,3.0]
> [Java,5.0]
> [dotNET,5000.0]
> [dotNET,1.0]
> [dotNET,48000.0]
> [dotNET,63000.0]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148212#comment-15148212
 ] 

Xiao Li commented on SPARK-1:
-

Tried join, intersect and except in 2.0. Works fine! For example, 
{code}
val df4 = df1.join(df2)
df4.explain(true)
println("DF4")
df4.show()
{code}

The problem exists only if we use {unionall}. : ) 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-15 Thread dylanzhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148205#comment-15148205
 ] 

dylanzhou edited comment on SPARK-13183 at 2/16/16 7:36 AM:


There is a memory leak problem, and finally will run out of heap memory error 
java.lang.OutOfMemoryError:Java for heap space. When I try to increase driver 
memory, just streaming programs work a little longer, in my opinion byte[] 
objects cannot be reclaimed by the GC. Here is my program, get advice here is 
my point, thank you!
object LogAnalyzerStreamingSQL { 


  def main(args: Array[String]) { 
val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in 
Scala") 
val sc = new SparkContext(sparkConf) 

val sqlContext = new SQLContext(sc) 
import sqlContext.implicits._ 

val ssc = new StreamingContext(sc, 30) 
val topicSet = Set("applogs") 
val kafkaParams = Map[String, String]( 
"metadata.broker.list" -> 
"192.168.100.1:9092,192.168.100.2:9092,192.168.100.3:9092", 
"group.id" -> "app_group", 
"serializer.class" -> "kafka.serializer.StringEncoder") 
val kafkaStream= KafkaUtils.createDirectStream(ssc,kafkaParams,topics) 
 kafkaStream.foreachRDD(rdd => { 
if (!rdd.isEmpty()) { 
  val jsonRdd=rdd.map(x=>x._2) 
  val df = sqlContext.read.json(jsonRdd) 
  df.registerTempTable("applogs") 
  sqlContext.cacheTable("applogs") 

// Calculate statistics based on the content size. 
val contentSizeStats = sqlContext 
  .sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), 
MAX(contentSize) FROM applogs") 
 .show() 


// Compute Response Code to Count. 
val responseCodeToCount = sqlContext 
  .sql("SELECT responseCode, COUNT(*) FROM applogs GROUP BY 
responseCode") 
  .map(row => (row.getInt(0), row.getLong(1))) 
  .show() 


// Any IPAddress that has accessed the server more than 10 times. 
val ipAddresses =sqlContext 
  .sql("SELECT ipAddress, COUNT(*) AS total FROM applogs GROUP BY 
ipAddress HAVING total > 10") 
  .map(row => row.getString(0)) 
  .take(100) 


val topEndpoints = sqlContext 
  .sql("SELECT endpoint, COUNT(*) AS total FROM applogs GROUP BY 
endpoint ORDER BY total DESC LIMIT 10") 
  .map(row => (row.getString(0), row.getLong(1))) 
  .show() 

  //a lot of sql like that 

  sqlContext.uncacheTable("applogs") 
  } 
}) 

ssc.start() 
ssc.awaitTermination() 
  } 
} 


was (Author: dylanzhou):
确实存在内存泄露问题，最后堆内存会耗尽，报错java.lang.OutOfMemoryError: Java heap 
space。当我尝试增大driver内存，只是streaming程序正常执行的时间会长一点，在我看来byte[]对象无法被gc回收。这里是我的程序，麻烦指点一下是否是我的程序问题，谢谢！
object LogAnalyzerStreamingSQL { 


  def main(args: Array[String]) { 
val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in 
Scala") 
val sc = new SparkContext(sparkConf) 

val sqlContext = new SQLContext(sc) 
import sqlContext.implicits._ 

val ssc = new StreamingContext(sc, 30) 
val topicSet = Set("applogs") 
val kafkaParams = Map[String, String]( 
"metadata.broker.list" -> 
"192.168.100.1:9092,192.168.100.2:9092,192.168.100.3:9092", 
"group.id" -> "app_group", 
"serializer.class" -> "kafka.serializer.StringEncoder") 
val kafkaStream= KafkaUtils.createDirectStream(ssc,kafkaParams,topics) 
 kafkaStream.foreachRDD(rdd => { 
if (!rdd.isEmpty()) { 
  val jsonRdd=rdd.map(x=>x._2) 
  val df = sqlContext.read.json(jsonRdd) 
  df.registerTempTable("applogs") 
  sqlContext.cacheTable("applogs") 

// Calculate statistics based on the content size. 
val contentSizeStats = sqlContext 
  .sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), 
MAX(contentSize) FROM applogs") 
 .show() 


// Compute Response Code to Count. 
val responseCodeToCount = sqlContext 
  .sql("SELECT responseCode, COUNT(*) FROM applogs GROUP BY 
responseCode") 
  .map(row => (row.getInt(0), row.getLong(1))) 
  .show() 


// Any IPAddress that has accessed the server more than 10 times. 
val ipAddresses =sqlContext 
  .sql("SELECT ipAddress, COUNT(*) AS total FROM applogs GROUP BY 
ipAddress HAVING total > 10") 
  .map(row => row.getString(0)) 
  .take(100) 


val topEndpoints = sqlContext 
  .sql("SELECT endpoint, COUNT(*) AS total FROM applogs GROUP BY 
endpoint ORDER BY total DESC LIMIT 10")

[jira] [Reopened] (SPARK-13183) Bytebuffers occupy a large amount of heap memory

2016-02-15 Thread dylanzhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dylanzhou reopened SPARK-13183:
---

确实存在内存泄露问题，最后堆内存会耗尽，报错java.lang.OutOfMemoryError: Java heap 
space。当我尝试增大driver内存，只是streaming程序正常执行的时间会长一点，在我看来byte[]对象无法被gc回收。这里是我的程序，麻烦指点一下是否是我的程序问题，谢谢！
object LogAnalyzerStreamingSQL { 


  def main(args: Array[String]) { 
val sparkConf = new SparkConf().setAppName("Log Analyzer Streaming in 
Scala") 
val sc = new SparkContext(sparkConf) 

val sqlContext = new SQLContext(sc) 
import sqlContext.implicits._ 

val ssc = new StreamingContext(sc, 30) 
val topicSet = Set("applogs") 
val kafkaParams = Map[String, String]( 
"metadata.broker.list" -> 
"192.168.100.1:9092,192.168.100.2:9092,192.168.100.3:9092", 
"group.id" -> "app_group", 
"serializer.class" -> "kafka.serializer.StringEncoder") 
val kafkaStream= KafkaUtils.createDirectStream(ssc,kafkaParams,topics) 
 kafkaStream.foreachRDD(rdd => { 
if (!rdd.isEmpty()) { 
  val jsonRdd=rdd.map(x=>x._2) 
  val df = sqlContext.read.json(jsonRdd) 
  df.registerTempTable("applogs") 
  sqlContext.cacheTable("applogs") 

// Calculate statistics based on the content size. 
val contentSizeStats = sqlContext 
  .sql("SELECT SUM(contentSize), COUNT(*), MIN(contentSize), 
MAX(contentSize) FROM applogs") 
 .show() 


// Compute Response Code to Count. 
val responseCodeToCount = sqlContext 
  .sql("SELECT responseCode, COUNT(*) FROM applogs GROUP BY 
responseCode") 
  .map(row => (row.getInt(0), row.getLong(1))) 
  .show() 


// Any IPAddress that has accessed the server more than 10 times. 
val ipAddresses =sqlContext 
  .sql("SELECT ipAddress, COUNT(*) AS total FROM applogs GROUP BY 
ipAddress HAVING total > 10") 
  .map(row => row.getString(0)) 
  .take(100) 


val topEndpoints = sqlContext 
  .sql("SELECT endpoint, COUNT(*) AS total FROM applogs GROUP BY 
endpoint ORDER BY total DESC LIMIT 10") 
  .map(row => (row.getString(0), row.getLong(1))) 
  .show() 

  //a lot of sql like that 

  sqlContext.uncacheTable("applogs") 
  } 
}) 

ssc.start() 
ssc.awaitTermination() 
  } 
} 

> Bytebuffers occupy a large amount of heap memory
> 
>
> Key: SPARK-13183
> URL: https://issues.apache.org/jira/browse/SPARK-13183
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: dylanzhou
>
> When I used sparkstreamimg and sparksql, i cache the table,found that old gen 
> increases very fast and full GC is very frequent, running for a period of 
> time will be out of memory, after analysis of heap memory, found that there 
> are a large number of org.apache.spark.sql.columnar.ColumnBuilder[38] @ 
> 0xd022a0b8, takes up 90% of the space, look at the source is HeapByteBuffer 
> occupy, don't know why these objects are not released, had been waiting for 
> GC to recycle;if i donot use cache table, there will be no this problem, but 
> I need to repeatedly query this table do



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13336) Add non-numerical summaries to DataFrame.describe

2016-02-15 Thread Ian Hellstrom (JIRA)

Ian Hellstrom created SPARK-13336:
-

 Summary: Add non-numerical summaries to DataFrame.describe
 Key: SPARK-13336
 URL: https://issues.apache.org/jira/browse/SPARK-13336
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Ian Hellstrom
Priority: Minor


The DataFrame.describe method currently only returns statistics for numerical 
columns. It would be nice to see generic information that is non-statistical in 
nature yet often important to see, especially when assessing the quality of 
data: 
* MIN/MAX for any column, not just numerical ones
* (approximate) DISTINCT COUNT (or percentage of all rows or sample)
* (approximate) NULL COUNT (or percentage of all rows or sample)
* (approximate) MODE (i.e. most common value of all rows or sample)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148202#comment-15148202
 ] 

Xiao Li commented on SPARK-1:
-

Interesting. This query has specified the seed. Thus, it should return the same 
result. 

{code}
== Physical Plan ==
Union
:- WholeStageCodegen
:  :  +- Project [_1#0 AS id#1,randn(12345) AS b#2]
:  : +- Filter (_1#0 = 0)
:  :+- INPUT
:  +- LocalTableScan [_1#0], [[0],[1]]
+- WholeStageCodegen
   :  +- Project [_1#0 AS id#1,randn(12345) AS b#2]
   : +- Filter (_1#0 = 0)
   :+- INPUT
   +- LocalTableScan [_1#0], [[0],[1]]
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148200#comment-15148200
 ] 

Maciej Bryński edited comment on SPARK-13283 at 2/16/16 7:30 AM:
-

Yep.
For MySQL this could look like this:
{code}
sb.append(s", `$name` $typ $nullable")
{code}

For other RDBMS:
{code}
sb.append(s", \"$name\" $typ $nullable")
{code}


was (Author: maver1ck):
Yep.
For MySQL this could look like this:
{code}
sb.append(s", `$name` $typ $nullable")
{code}

For other RDBMS:
{code}
sb.append(s", "$name" $typ $nullable")
{code}

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148200#comment-15148200
 ] 

Maciej Bryński commented on SPARK-13283:


Yep.
For MySQL this could look like this:
{code}
sb.append(s", `$name` $typ $nullable")
{code}

For other RDBMS:
{code}
sb.append(s", "$name" $typ $nullable")
{code}

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-15 Thread Adrian Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148196#comment-15148196
 ] 

Adrian Wang commented on SPARK-13283:
-

So the problem here is that "from" is a reserved word in MySQL, but we failed 
to keep the backtick around it, do we?

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-02-15 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-13335:
---
Priority: Minor  (was: Major)

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>Priority: Minor
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-02-15 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148199#comment-15148199
 ] 

Matt Cheah commented on SPARK-13335:


I have a prototypical patch for this and can submit a PR accordingly.

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13335) Optimize Data Frames collect_list and collect_set with declarative aggregates

2016-02-15 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-13335:
---
Summary: Optimize Data Frames collect_list and collect_set with declarative 
aggregates  (was: Optimize collect_list and collect_set with declarative 
aggregates)

> Optimize Data Frames collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13335) Optimize collect_list and collect_set with declarative aggregates

2016-02-15 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-13335:
---
Component/s: SQL

> Optimize collect_list and collect_set with declarative aggregates
> -
>
> Key: SPARK-13335
> URL: https://issues.apache.org/jira/browse/SPARK-13335
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Matt Cheah
>
> Based on discussion from SPARK-9301, we can optimize collect_set and 
> collect_list with declarative aggregate expressions, as opposed to using Hive 
> UDAFs. The problem with Hive UDAFs is that they require converting the data 
> items from catalyst types back to external types repeatedly. We can get 
> around this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13335) Optimize collect_list and collect_set with declarative aggregates

2016-02-15 Thread Matt Cheah (JIRA)

Matt Cheah created SPARK-13335:
--

 Summary: Optimize collect_list and collect_set with declarative 
aggregates
 Key: SPARK-13335
 URL: https://issues.apache.org/jira/browse/SPARK-13335
 Project: Spark
  Issue Type: Improvement
Reporter: Matt Cheah


Based on discussion from SPARK-9301, we can optimize collect_set and 
collect_list with declarative aggregate expressions, as opposed to using Hive 
UDAFs. The problem with Hive UDAFs is that they require converting the data 
items from catalyst types back to external types repeatedly. We can get around 
this by implementing declarative aggregate expressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-15 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148189#comment-15148189
 ] 

Maciej Bryński commented on SPARK-13283:


No it's not fixed.

Problem is in:
https://github.com/apache/spark/blob/0d42292f6a2dbe626e8f6a50e6c61dd79533f235/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L255

The trick is that different RDBMS are using different quoting sign. Most of 
them are using ", but MySQL `.
So we have to add quote sign to dialect.

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148187#comment-15148187
 ] 

Xiao Li commented on SPARK-1:
-

The current solution also has performance penalty. That has been mentioned in 
another JIRA: https://issues.apache.org/jira/browse/SPARK-2183

In the current logical plan, we lost the actual data sources for each member 
after deduplication. This is another issue. 

Let me at [~rxin] [~marmbrus] [~yhuai]

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148186#comment-15148186
 ] 

Xiao Li commented on SPARK-1:
-

You will get the right result if you cache the first DF
{code}
// Removing the following filter() call makes this give the expected result.
val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)).cache()
println("DF1")
df1.show()

val df2 = df1.select("id", "b")
println("DF2")
df2.show()  // same as df1.show(), as expected

val df3 = df1.unionAll(df2)
println("DF3")
df3.show()  // NOT two copies of df1, which is unexpected
{code}

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148182#comment-15148182
 ] 

Xiao Li commented on SPARK-1:
-

This is a known issue. The same issue exists in CTE with non-deterministic 
expression. 

For example, the following query could return a wrong result. 

{code}
With q as (select * from testData limit 10) select * from q as q1 inner join q 
as q2 where q1.key = q2.key
{code}

We need to materialize it before doing self unionall/join/intersect like 
operations. 

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC

2016-02-15 Thread Adrian Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148171#comment-15148171
 ] 

Adrian Wang commented on SPARK-13283:
-

See comments from SPARK-13297, this have been fixed in master branch.

> Spark doesn't escape column names when creating table on JDBC
> -
>
> Key: SPARK-13283
> URL: https://issues.apache.org/jira/browse/SPARK-13283
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> Hi,
> I have following problem.
> I have DF where one of the columns has 'from' name.
> {code}
> root
>  |-- from: decimal(20,0) (nullable = true)
> {code}
> When I'm saving it to MySQL database I'm getting error:
> {code}
> Py4JJavaError: An error occurred while calling o183.jdbc.
> : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an 
> error in your SQL syntax; check the manual that corresponds to your MySQL 
> server version for the right syntax to use near 'from DECIMAL(20,0) , ' at 
> line 1
> {code}
> I think the problem is that Spark doesn't escape column names with ` sign on 
> creating table.
> {code}
> `from`
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13334) ML KMeansModel/BisectingKMeansModel should be set parent

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13334:


Assignee: Apache Spark

> ML KMeansModel/BisectingKMeansModel should be set parent
> 
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13334) ML KMeansModel/BisectingKMeansModel should be set parent

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148168#comment-15148168
 ] 

Apache Spark commented on SPARK-13334:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/11214

> ML KMeansModel/BisectingKMeansModel should be set parent
> 
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13334) ML KMeansModel/BisectingKMeansModel should be set parent

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13334:


Assignee: (was: Apache Spark)

> ML KMeansModel/BisectingKMeansModel should be set parent
> 
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13334) ML KMeansModel/BisectingKMeansModel/QuantileDiscretizerModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13334:

Summary: ML KMeansModel/BisectingKMeansModel/QuantileDiscretizerModel 
should be set parent  (was: ML KMeansModel / BisectingKMeansModel should be set 
parent)

> ML KMeansModel/BisectingKMeansModel/QuantileDiscretizerModel should be set 
> parent
> -
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13334) ML KMeansModel/BisectingKMeansModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13334:

Summary: ML KMeansModel/BisectingKMeansModel should be set parent  (was: ML 
KMeansModel/BisectingKMeansModel/QuantileDiscretizerModel should be set parent)

> ML KMeansModel/BisectingKMeansModel should be set parent
> 
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13334) ML KMeansModel / BisectingKMeansModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13334:

Description: 
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have also checked all Estimators, others are set correctly. 

  was:
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have also checked all Estimators, others are set properly. 


> ML KMeansModel / BisectingKMeansModel should be set parent
> --
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13334) ML KMeansModel / BisectingKMeansModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13334:

Description: 
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have also checked all Estimators, others are set properly. 

  was:
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have also checked other Estimators, no other ones lack up.


> ML KMeansModel / BisectingKMeansModel should be set parent
> --
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked all Estimators, others are set properly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13334) ML KMeansModel / BisectingKMeansModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13334:

Description: 
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have also checked other Estimators, no other ones lack up.

  was:
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have checked other Estimators, only these three lack up calling setParent.


> ML KMeansModel / BisectingKMeansModel should be set parent
> --
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have also checked other Estimators, no other ones lack up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13334) ML KMeansModel / BisectingKMeansModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-13334:

Description: 
ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
parent.
I have checked other Estimators, only these three lack up calling setParent.

  was:ML KMeansModel / BisectingKMeansModel should be set parent


> ML KMeansModel / BisectingKMeansModel should be set parent
> --
>
> Key: SPARK-13334
> URL: https://issues.apache.org/jira/browse/SPARK-13334
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML KMeansModel / BisectingKMeansModel / QuantileDiscretizer should be set 
> parent.
> I have checked other Estimators, only these three lack up calling setParent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148163#comment-15148163
 ] 

Xiao Li commented on SPARK-1:
-

Glad to work on this issue. Let me try it. Will keep you posted. Thanks!

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13334) ML KMeansModel / BisectingKMeansModel should be set parent

2016-02-15 Thread Yanbo Liang (JIRA)

Yanbo Liang created SPARK-13334:
---

 Summary: ML KMeansModel / BisectingKMeansModel should be set parent
 Key: SPARK-13334
 URL: https://issues.apache.org/jira/browse/SPARK-13334
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


ML KMeansModel / BisectingKMeansModel should be set parent



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148135#comment-15148135
 ] 

Joseph K. Bradley commented on SPARK-1:
---

I haven't tested with 1.5 yet, but I assume it affects 1.5 as well.

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-1:
--
Affects Version/s: 1.6.1
   1.4.2

> DataFrame filter + randn + unionAll has bad interaction
> ---
>
> Key: SPARK-1
> URL: https://issues.apache.org/jira/browse/SPARK-1
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2, 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>
> Buggy workflow
> * Create a DataFrame df0
> * Filter df0
> * Add a randn column
> * Create a copy of the DataFrame
> * unionAll the two DataFrames
> This fails, where randn produces the same results on the original DataFrame 
> and the copy before unionAll but fails to do so after unionAll.  Removing the 
> filter fixes the problem.
> The bug can be reproduced on master:
> {code}
> import org.apache.spark.sql.functions.randn
> val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")
> // Removing the following filter() call makes this give the expected result.
> val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
> println("DF1")
> df1.show()
> val df2 = df1.select("id", "b")
> println("DF2")
> df2.show()  // same as df1.show(), as expected
> val df3 = df1.unionAll(df2)
> println("DF3")
> df3.show()  // NOT two copies of df1, which is unexpected
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction

2016-02-15 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-1:
-

 Summary: DataFrame filter + randn + unionAll has bad interaction
 Key: SPARK-1
 URL: https://issues.apache.org/jira/browse/SPARK-1
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley


Buggy workflow
* Create a DataFrame df0
* Filter df0
* Add a randn column
* Create a copy of the DataFrame
* unionAll the two DataFrames

This fails, where randn produces the same results on the original DataFrame and 
the copy before unionAll but fails to do so after unionAll.  Removing the 
filter fixes the problem.

The bug can be reproduced on master:
{code}
import org.apache.spark.sql.functions.randn

val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id")

// Removing the following filter() call makes this give the expected result.
val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345))
println("DF1")
df1.show()

val df2 = df1.select("id", "b")
println("DF2")
df2.show()  // same as df1.show(), as expected

val df3 = df1.unionAll(df2)
println("DF3")
df3.show()  // NOT two copies of df1, which is unexpected
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2016-02-15 Thread Qian Huang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148099#comment-15148099
 ] 

Qian Huang commented on SPARK-4036:
---

Hi, I have created a spark package, 
http://spark-packages.org/package/hqzizania/crf-spark. I co-work on it with 
hujiayin.

[~josephkb] It can be said that this package is a spark-based re-implementation 
of crf++. It has the same limit as crf++ has, like feature generator design and 
only for "segmenting/labeling sequential data". But it basically meets the 
requirement of NLP and can run in parallel for big data.

Welcome to try it. If you encounter bugs, feel free to submit an issue or pull 
request.




> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf, crf-spark.zip, 
> dig-hair-eye-train.model, features.hair-eye, sample-input, sample-output
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13332) Decimal datatype support for SQL pow

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13332:


Assignee: Apache Spark

> Decimal datatype support for SQL pow
> 
>
> Key: SPARK-13332
> URL: https://issues.apache.org/jira/browse/SPARK-13332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: yucai
>Assignee: Apache Spark
>
> In SQL pow:
> 1. when base is Decimal and exponent is integer(Byte, Short, Int), calculate 
> and return as Decimal
> 2. otherwise, calculate and return as Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13332) Decimal datatype support for SQL pow

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13332:


Assignee: (was: Apache Spark)

> Decimal datatype support for SQL pow
> 
>
> Key: SPARK-13332
> URL: https://issues.apache.org/jira/browse/SPARK-13332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: yucai
>
> In SQL pow:
> 1. when base is Decimal and exponent is integer(Byte, Short, Int), calculate 
> and return as Decimal
> 2. otherwise, calculate and return as Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13332) Decimal datatype support for SQL pow

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148074#comment-15148074
 ] 

Apache Spark commented on SPARK-13332:
--

User 'yucai' has created a pull request for this issue:
https://github.com/apache/spark/pull/11212

> Decimal datatype support for SQL pow
> 
>
> Key: SPARK-13332
> URL: https://issues.apache.org/jira/browse/SPARK-13332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: yucai
>
> In SQL pow:
> 1. when base is Decimal and exponent is integer(Byte, Short, Int), calculate 
> and return as Decimal
> 2. otherwise, calculate and return as Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13332) Decimal datatype support for SQL pow

2016-02-15 Thread yucai (JIRA)

yucai created SPARK-13332:
-

 Summary: Decimal datatype support for SQL pow
 Key: SPARK-13332
 URL: https://issues.apache.org/jira/browse/SPARK-13332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: yucai


In SQL pow:
1. when base is Decimal and exponent is integer(Byte, Short, Int), calculate 
and return as Decimal
2. otherwise, calculate and return as Double



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13018) Replace example code in mllib-pmml-model-export.md using include_example

2016-02-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13018.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 11126
[https://github.com/apache/spark/pull/11126]

> Replace example code in mllib-pmml-model-export.md using include_example
> 
>
> Key: SPARK-13018
> URL: https://issues.apache.org/jira/browse/SPARK-13018
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xin Ren
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> Goal is to move actual example code to spark/examples and test compilation in 
> Jenkins builds. Then in the markdown, we can reference part of the code to 
> show in the user guide. This requires adding a Jekyll tag that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}{% include_example 
> scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}{code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala`
>  and pick code blocks marked "example" and replace code block in 
> {code}{% highlight %}{code}
>  in the markdown. 
> See more sub-tasks in parent ticket: 
> https://issues.apache.org/jira/browse/SPARK-11337



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13097) Extend Binarizer to allow Double AND Vector inputs

2016-02-15 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-13097.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10976
[https://github.com/apache/spark/pull/10976]

> Extend Binarizer to allow Double AND Vector inputs
> --
>
> Key: SPARK-13097
> URL: https://issues.apache.org/jira/browse/SPARK-13097
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Mike Seddon
>Assignee: Mike Seddon
>Priority: Minor
> Fix For: 2.0.0
>
>
> To enhance the existing SparkML Binarizer [SPARK-5891] to allow Vector in 
> addition to the existing Double input column type.
> https://github.com/apache/spark/pull/10976
> A use case for this enhancement is for when a user wants to Binarize many 
> similar feature columns at once using the same threshold value.
> A real-world example for this would be where the authors of one of the 
> leading MNIST handwriting character recognition entries converts 784 
> grayscale (0-255) pixels (28x28 pixel images) to binary if the pixel's 
> grayscale exceeds 127.5: (http://arxiv.org/abs/1003.0358). With this 
> modification the user is able to: VectorAssembler(784 
> columns)->Binarizer(127.5)->Classifier as all the pixels are of a similar 
> type. 
> This approach also allows much easier use of the ParamGridBuilder to test 
> multiple theshold values.
> I have already written the code and unit tests and have tested in a 
> Multilayer perceptron classifier workflow.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2183) Avoid loading/shuffling data twice in self-join query

2016-02-15 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148065#comment-15148065
 ] 

Xiao Li commented on SPARK-2183:


This problem still exists, right? I guess it might hurt the performance of 
TPCDS-Q4 a lot.

https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDS_1_4_Queries.scala#L189-L190

> Avoid loading/shuffling data twice in self-join query
> -
>
> Key: SPARK-2183
> URL: https://issues.apache.org/jira/browse/SPARK-2183
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Minor
>
> {code}
> scala> hql("select * from src a join src b on (a.key=b.key)")
> res2: org.apache.spark.sql.SchemaRDD = 
> SchemaRDD[3] at RDD at SchemaRDD.scala:100
> == Query Plan ==
> Project [key#3:0,value#4:1,key#5:2,value#6:3]
>  HashJoin [key#3], [key#5], BuildRight
>   Exchange (HashPartitioning [key#3:0], 200)
>HiveTableScan [key#3,value#4], (MetastoreRelation default, src, Some(a)), 
> None
>   Exchange (HashPartitioning [key#5:0], 200)
>HiveTableScan [key#5,value#6], (MetastoreRelation default, src, Some(b)), 
> None
> {code}
> The optimal execution strategy for the above example is to load data only 
> once and repartition once. 
> If we want to hyper optimize it, we can also have a self join operator that 
> builds the hashmap and then simply traverses the hashmap ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13331) Spark network encryption optimization

2016-02-15 Thread Dong Chen (JIRA)

Dong Chen created SPARK-13331:
-

 Summary: Spark network encryption optimization
 Key: SPARK-13331
 URL: https://issues.apache.org/jira/browse/SPARK-13331
 Project: Spark
  Issue Type: Improvement
Reporter: Dong Chen


In network/common, SASL encryption uses DIGEST-MD5 mechanism, which supports: 
3DES, DES, and RC4

3des and rc4 are slow relatively. We could make it support AES for more secure 
and performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.

2016-02-15 Thread SaintBacchus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148023#comment-15148023
 ] 

SaintBacchus commented on SPARK-12316:
--

[~tgraves] The application would not hit the same condition because one minute 
later all the  non-deamon threads had exited.
[~hshreedharan] Driver and ApplicationMaster will both try to delete the 
staging directory. If we want to make sure the ExecutorDelegationTokenUpdater 
stopped before the ApplicationMaster had exited, it will have to add some RPC 
call between these threads. So I think add a try after one minuter may be an 
easy to avoid this issue.

> Stack overflow with endless call of `Delegation token thread` when 
> application end.
> ---
>
> Key: SPARK-12316
> URL: https://issues.apache.org/jira/browse/SPARK-12316
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.0
>Reporter: SaintBacchus
>Assignee: SaintBacchus
> Attachments: 20151210045149.jpg, 20151210045533.jpg
>
>
> When application end, AM will clean the staging dir.
> But if the driver trigger to update the delegation token, it will can't find 
> the right token file and then it will endless cycle call the method 
> 'updateCredentialsIfRequired'.
> Then it lead to StackOverflowError.
> !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg!
> !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13330) PYTHONHASHSEED is not propgated to executor

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13330:


Assignee: (was: Apache Spark)

> PYTHONHASHSEED is not propgated to executor
> ---
>
> Key: SPARK-13330
> URL: https://issues.apache.org/jira/browse/SPARK-13330
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>
> when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
> propagated to executor, and cause the following error.
> {noformat}
>   File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
> portable_hash
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13330) PYTHONHASHSEED is not propgated to executor

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15148014#comment-15148014
 ] 

Apache Spark commented on SPARK-13330:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/11211

> PYTHONHASHSEED is not propgated to executor
> ---
>
> Key: SPARK-13330
> URL: https://issues.apache.org/jira/browse/SPARK-13330
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>
> when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
> propagated to executor, and cause the following error.
> {noformat}
>   File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
> portable_hash
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13330) PYTHONHASHSEED is not propgated to executor

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13330:


Assignee: Apache Spark

> PYTHONHASHSEED is not propgated to executor
> ---
>
> Key: SPARK-13330
> URL: https://issues.apache.org/jira/browse/SPARK-13330
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>
> when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
> propagated to executor, and cause the following error.
> {noformat}
>   File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
> portable_hash
> raise Exception("Randomness of hash of string should be disabled via 
> PYTHONHASHSEED")
> Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
>   at org.apache.spark.scheduler.Task.run(Task.scala:81)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13330) PYTHONHASHSEED is not propgated to executor

2016-02-15 Thread Jeff Zhang (JIRA)

Jeff Zhang created SPARK-13330:
--

 Summary: PYTHONHASHSEED is not propgated to executor
 Key: SPARK-13330
 URL: https://issues.apache.org/jira/browse/SPARK-13330
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
Reporter: Jeff Zhang


when using python 3.3 , PYTHONHASHSEED is only set in driver, but not 
propagated to executor, and cause the following error.
{noformat}
  File "/Users/jzhang/github/spark/python/pyspark/rdd.py", line 74, in 
portable_hash
raise Exception("Randomness of hash of string should be disabled via 
PYTHONHASHSEED")
Exception: Randomness of hash of string should be disabled via PYTHONHASHSEED

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:342)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:77)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:45)
at org.apache.spark.scheduler.Task.run(Task.scala:81)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11381) Replace example code in mllib-linear-methods.md using include_example

2016-02-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147993#comment-15147993
 ] 

Xusen Yin commented on SPARK-11381:
---

[~somi...@us.ibm.com] Are you still interested in working on it? Even tough 
https://issues.apache.org/jira/browse/SPARK-11399 is still unmerged, we can 
work around it by folding lines between example snippets into comments.

> Replace example code in mllib-linear-methods.md using include_example
> -
>
> Key: SPARK-11381
> URL: https://issues.apache.org/jira/browse/SPARK-11381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>  Labels: starter
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-frequent-pattern-mining.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11337) Make example code in user guide testable

2016-02-15 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11337:
--
Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.

  was:
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example ml.KMeansExample %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.


> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.
> *self-check list for contributors in this JIRA*
> * Be sure to match Scala/Java/Python code style guide. If unsure of a code 
> style, please refer to other merged example code under examples/.
> * Remove useless imports
> * It's

[jira] [Updated] (SPARK-11337) Make example code in user guide testable

2016-02-15 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11337:
--
Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example ml.KMeansExample %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.

  was:
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example ml.KMeansExample %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}.
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.


> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example ml.KMeansExample %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.
> *self-check list for contributors in this JIRA*
> * Be sure to match Scala/Java/Python code style guide. If unsure of a code 
> style, please refer to other merged example code under examples/.
> * Remove useless imports
> * It's better to have a side-effect operation at the end of each example 
> code,

[jira] [Updated] (SPARK-11337) Make example code in user guide testable

2016-02-15 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11337:
--
Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example ml.KMeansExample %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}.
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.

  was:
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example scala ml.KMeansExample guide %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}.
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.


> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example ml.KMeansExample %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.
> *self-check list for contributors in this JIRA*
> * Be sure to match Scala/Java/Python code style guide. If unsure of a code 
> style, please refer to other merged example code under examples/.
> * Remove useless imports
> * It's better to have a side-effect operation at the end of each example

[jira] [Updated] (SPARK-11337) Make example code in user guide testable

2016-02-15 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11337:
--
Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example scala ml.KMeansExample guide %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*

# Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.

# Remove useless imports

# It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}.

# Make sure the code example is runnable without error.

# After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.

  was:
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example scala ml.KMeansExample guide %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.


> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.
> *self-check list for contributors in this JIRA*
> # Be sure to match Scala/Java/Python code style guide. If unsure of a code 
> style, please refer to other merged example code under examples/.
> # Remove useless imports
> # It's better to have a side-effect operation at the end of each example 
> code, usually it's a {code}print(...){code}.
> # Make sure the code example is runnable without error.
> # After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
> serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
> generated html looks good.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Updated] (SPARK-11337) Make example code in user guide testable

2016-02-15 Thread Xusen Yin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11337:
--
Description: 
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example scala ml.KMeansExample guide %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*
* Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.
* Remove useless imports
* It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}.
* Make sure the code example is runnable without error.
* After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.

  was:
The example code in the user guide is embedded in the markdown and hence it is 
not easy to test. It would be nice to automatically test them. This JIRA is to 
discuss options to automate example code testing and see what we can do in 
Spark 1.6.

One option I propose is to move actual example code to spark/examples and test 
compilation in Jenkins builds. Then in the markdown, we can reference part of 
the code to show in the user guide. This requires adding a Jekyll tag that is 
similar to 
https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, e.g., 
called include_example.

{code}
{% include_example scala ml.KMeansExample guide %}
{code}

Jekyll will find 
`examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` and 
pick code blocks marked "example" and put them under `{% highlight %}` in the 
markdown. We can discuss the syntax for marker comments.

Sub-tasks are created to move example code from user guide to `examples/`.

*self-check list for contributors in this JIRA*

# Be sure to match Scala/Java/Python code style guide. If unsure of a code 
style, please refer to other merged example code under examples/.

# Remove useless imports

# It's better to have a side-effect operation at the end of each example code, 
usually it's a {code}print(...){code}.

# Make sure the code example is runnable without error.

# After finishing code migration, use {code}cd docs; SKIP_API=1 jekyll 
serve{code} the check the webpage at http://127.0.0.1:4000 to see whether the 
generated html looks good.


> Make example code in user guide testable
> 
>
> Key: SPARK-11337
> URL: https://issues.apache.org/jira/browse/SPARK-11337
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xusen Yin
>Priority: Critical
>
> The example code in the user guide is embedded in the markdown and hence it 
> is not easy to test. It would be nice to automatically test them. This JIRA 
> is to discuss options to automate example code testing and see what we can do 
> in Spark 1.6.
> One option I propose is to move actual example code to spark/examples and 
> test compilation in Jenkins builds. Then in the markdown, we can reference 
> part of the code to show in the user guide. This requires adding a Jekyll tag 
> that is similar to 
> https://github.com/jekyll/jekyll/blob/master/lib/jekyll/tags/include.rb, 
> e.g., called include_example.
> {code}
> {% include_example scala ml.KMeansExample guide %}
> {code}
> Jekyll will find 
> `examples/src/main/scala/org/apache/spark/examples/ml/KMeansExample.scala` 
> and pick code blocks marked "example" and put them under `{% highlight %}` in 
> the markdown. We can discuss the syntax for marker comments.
> Sub-tasks are created to move example code from user guide to `examples/`.
> *self-check list for contributors in this JIRA*
> * Be sure to match Scala/Java/Python code style guide. If unsure of a code 
> style, please refer to other merged example code under examples/.
> * Remove useless imports
> * It's better to have a side-effect

[jira] [Assigned] (SPARK-13329) Considering output for statistics of logicol plan

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13329:


Assignee: Davies Liu  (was: Apache Spark)

> Considering output for statistics of logicol plan
> -
>
> Key: SPARK-13329
> URL: https://issues.apache.org/jira/browse/SPARK-13329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The current implementation of statistics of UnaryNode does not considering 
> output (for example, Project), we should considering it to have a better 
> guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13329) Considering output for statistics of logicol plan

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13329:


Assignee: Apache Spark  (was: Davies Liu)

> Considering output for statistics of logicol plan
> -
>
> Key: SPARK-13329
> URL: https://issues.apache.org/jira/browse/SPARK-13329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> The current implementation of statistics of UnaryNode does not considering 
> output (for example, Project), we should considering it to have a better 
> guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13329) Considering output for statistics of logicol plan

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147974#comment-15147974
 ] 

Apache Spark commented on SPARK-13329:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/11210

> Considering output for statistics of logicol plan
> -
>
> Key: SPARK-13329
> URL: https://issues.apache.org/jira/browse/SPARK-13329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> The current implementation of statistics of UnaryNode does not considering 
> output (for example, Project), we should considering it to have a better 
> guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13329) Considering output for statistics of logicol plan

2016-02-15 Thread Davies Liu (JIRA)

Davies Liu created SPARK-13329:
--

 Summary: Considering output for statistics of logicol plan
 Key: SPARK-13329
 URL: https://issues.apache.org/jira/browse/SPARK-13329
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu


The current implementation of statistics of UnaryNode does not considering 
output (for example, Project), we should considering it to have a better guess.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13038) PySpark ml.pipeline support export/import

2016-02-15 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147941#comment-15147941
 ] 

Xusen Yin commented on SPARK-13038:
---

I start working on it.

> PySpark ml.pipeline support export/import
> -
>
> Key: SPARK-13038
> URL: https://issues.apache.org/jira/browse/SPARK-13038
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add export/import for all estimators and transformers(which have Scala 
> implementation) under pyspark/ml/pipeline.py. Please refer the implementation 
> at SPARK-13032.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.

2016-02-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147929#comment-15147929
 ] 

Hyukjin Kwon commented on SPARK-13323:
--

{code}
sqlCtx.createDataFrame([["a"], [1]]).show()
{code}

This fails to infer schema emitting this exception below:

{code}
Traceback (most recent call last):
  File 
"/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/tests.py",
 line 2023, in 
sqlCtx.createDataFrame([["a"], [1]]).show()
  File 
"/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/context.py",
 line 398, in createDataFrame
rdd, schema = self._createFromLocal(data, schema)
  File 
"/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/context.py",
 line 314, in _createFromLocal
struct = self._inferSchemaFromList(data)
  File 
"/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/context.py",
 line 241, in _inferSchemaFromList
schema = reduce(_merge_type, map(_infer_schema, data))
  File 
"/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/types.py",
 line 862, in _merge_type
for f in a.fields]
  File 
"/Users/hyukjinkwon/Desktop/workspace/local/forked/spark/python/pyspark/sql/types.py",
 line 856, in _merge_type
raise TypeError("Can not merge type %s and %s" % (type(a), type(b)))
TypeError: Can not merge type  and 
{code}

I think we can set this type as {{StringType}}.

> Type cast support in type inference during merging types.
> -
>
> Key: SPARK-13323
> URL: https://issues.apache.org/jira/browse/SPARK-13323
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int 
> -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types 
> when the given types are different during merging schemas.
> I think this can be done by resembling 
> {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of 
> both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
> type(None): NullType,
> bool: BooleanType,
> int: LongType,
> float: DoubleType,
> str: StringType,
> bytearray: BinaryType,
> decimal.Decimal: DecimalType,
> datetime.date: DateType,
> datetime.datetime: TimestampType,
> datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please 
> mark this as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13328) Possible Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi updated SPARK-13328:

Summary: Possible Poor read performance for broadcast variables with 
dynamic resource allocation  (was: Poor read performance for broadcast 
variables with dynamic resource allocation)

> Possible Poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-13328) Possible poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi updated SPARK-13328:

Summary: Possible poor read performance for broadcast variables with 
dynamic resource allocation  (was: Possible Poor read performance for broadcast 
variables with dynamic resource allocation)

> Possible poor read performance for broadcast variables with dynamic resource 
> allocation
> ---
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147896#comment-15147896
 ] 

Nezih Yigitbasi edited comment on SPARK-13328 at 2/15/16 11:30 PM:
---

Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

If you think such a fix is valuable I will be happy to create a PR.


was (Author: nezihyigitbasi):
Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

> Poor read performance for broadcast variables with dynamic resource allocation
> --
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147896#comment-15147896
 ] 

Nezih Yigitbasi commented on SPARK-13328:
-

Although this long time can be reduced by decreasing the values of the 
{{spark.shuffle.io.maxRetries}} and {{spark.shuffle.io.retryWait}} parameters 
it may not be desirable to reduce # of retries globally and also reducing retry 
wait may increase the load on the serving block manager. 

I already have a fix where I added a new config parameter 
{{spark.block.failures.beforeLocationRefresh}} that determines when to refresh 
the list of block locations from the driver while going through all these 
locations. In my fix this parameter is honored only when dynamic allocation is 
enabled and I set its default value to Int.MaxValue so that it doesn't change 
the behavior even if dynamic alloc. is enabled (as refreshing the location may 
not be necessary in small clusters).

> Poor read performance for broadcast variables with dynamic resource allocation
> --
>
> Key: SPARK-13328
> URL: https://issues.apache.org/jira/browse/SPARK-13328
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2
>Reporter: Nezih Yigitbasi
>
> When dynamic resource allocation is enabled fetching broadcast variables from 
> removed executors were causing job failures and SPARK-9591 fixed this problem 
> by trying all locations of a block before giving up. However, the locations 
> of a block is retrieved only once from the driver in this process and the 
> locations in this list can be stale due to dynamic resource allocation. This 
> situation gets worse when running on a large cluster as the size of this 
> location list can be in the order of several hundreds out of which there may 
> be tens of stale entries. What we have observed is with the default settings 
> of 3 max retries and 5s between retries (that's 15s per location) the time it 
> takes to read a broadcast variable can be as high as ~17m (below log shows 
> the failed 70th block fetch attempt where each attempt takes 15s)
> {code}
> ...
> 16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
> broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 
> 60675) (failed attempt 70)
> ...
> 16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 18 took 1051049 ms
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-02-15 Thread William Dixon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147895#comment-15147895
 ] 

William Dixon commented on SPARK-12675:
---

I see this issue in Spark 1.5.2 running in local, accessing remote data via 
wasb://  w/ 25 partitions and a lot of memory. I agree that the 
ClassCastException is very suspicious.

This also seems to cause exceptions in shutdown in cleaner threads 

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at 
>

[jira] [Created] (SPARK-13328) Poor read performance for broadcast variables with dynamic resource allocation

2016-02-15 Thread Nezih Yigitbasi (JIRA)

Nezih Yigitbasi created SPARK-13328:
---

 Summary: Poor read performance for broadcast variables with 
dynamic resource allocation
 Key: SPARK-13328
 URL: https://issues.apache.org/jira/browse/SPARK-13328
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Nezih Yigitbasi


When dynamic resource allocation is enabled fetching broadcast variables from 
removed executors were causing job failures and SPARK-9591 fixed this problem 
by trying all locations of a block before giving up. However, the locations of 
a block is retrieved only once from the driver in this process and the 
locations in this list can be stale due to dynamic resource allocation. This 
situation gets worse when running on a large cluster as the size of this 
location list can be in the order of several hundreds out of which there may be 
tens of stale entries. What we have observed is with the default settings of 3 
max retries and 5s between retries (that's 15s per location) the time it takes 
to read a broadcast variable can be as high as ~17m (below log shows the failed 
70th block fetch attempt where each attempt takes 15s)

{code}
...
16/02/13 01:02:27 WARN storage.BlockManager: Failed to fetch remote block 
broadcast_18_piece0 from BlockManagerId(8, ip-10-178-77-38.ec2.internal, 60675) 
(failed attempt 70)
...
16/02/13 01:02:27 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
18 took 1051049 ms
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12583) spark shuffle fails with mesos after 2mins

2016-02-15 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-12583:
-
Target Version/s: 1.6.1

> spark shuffle fails with mesos after 2mins
> --
>
> Key: SPARK-12583
> URL: https://issues.apache.org/jira/browse/SPARK-12583
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>
> See user mailing list "Executor deregistered after 2mins" for more details.
> As of 1.6, the driver registers with each shuffle manager via  
> MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
> automatically cleans up the data associate with that driver.
> However, the connection is terminated before this happens as it's idle. 
> Looking at a packet trace, after 120secs the shuffle manager is sending a FIN 
> packet to the driver.   The only way to delay this is to increase 
> spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
> I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
> newbie Scala skills to call the TransportContext call with 
> closeIdleConnections "false" and this didn't help (hadn't done the network 
> trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.

2016-02-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147862#comment-15147862
 ] 

Hyukjin Kwon commented on SPARK-13323:
--

Let me add some codes here to reproduce in an hour.

> Type cast support in type inference during merging types.
> -
>
> Key: SPARK-13323
> URL: https://issues.apache.org/jira/browse/SPARK-13323
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int 
> -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types 
> when the given types are different during merging schemas.
> I think this can be done by resembling 
> {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of 
> both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
> type(None): NullType,
> bool: BooleanType,
> int: LongType,
> float: DoubleType,
> str: StringType,
> bytearray: BinaryType,
> decimal.Decimal: DecimalType,
> datetime.date: DateType,
> datetime.datetime: TimestampType,
> datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please 
> mark this as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-13323) Type cast support in type inference during merging types.

2016-02-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147851#comment-15147851
 ] 

Hyukjin Kwon edited comment on SPARK-13323 at 2/15/16 10:43 PM:


[~davies]

Yes it's complicated but dealimg with numeric precedence is not super much.

The problem is that can't find a compatible type. Namly, if the types of 
following rows are different with the types of the first row, it just simply 
fails to infer types, which CSV and JSON type inference do not.


was (Author: hyukjin.kwon):
[~davies]

Yes it's complicated but dealimg with numeric precedence is not super much.

The problem is that is can't find a compatible types. Namly, if the types of 
following rows are different with the types of the first row, it just simply 
fails to infer types, which CSV and JSON type inference do not.

> Type cast support in type inference during merging types.
> -
>
> Key: SPARK-13323
> URL: https://issues.apache.org/jira/browse/SPARK-13323
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int 
> -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types 
> when the given types are different during merging schemas.
> I think this can be done by resembling 
> {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of 
> both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
> type(None): NullType,
> bool: BooleanType,
> int: LongType,
> float: DoubleType,
> str: StringType,
> bytearray: BinaryType,
> decimal.Decimal: DecimalType,
> datetime.date: DateType,
> datetime.datetime: TimestampType,
> datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please 
> mark this as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.

2016-02-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147851#comment-15147851
 ] 

Hyukjin Kwon commented on SPARK-13323:
--

[~davies]

Yes it's complicated but dealimg with numeric precedence is not super much.

The problem is that is can't find a compatible types. Namly, if the types of 
following rows are different with the types of the first row, it just simply 
fails to infer types, which CSV and JSON type inference do not.

> Type cast support in type inference during merging types.
> -
>
> Key: SPARK-13323
> URL: https://issues.apache.org/jira/browse/SPARK-13323
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int 
> -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types 
> when the given types are different during merging schemas.
> I think this can be done by resembling 
> {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of 
> both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
> type(None): NullType,
> bool: BooleanType,
> int: LongType,
> float: DoubleType,
> str: StringType,
> bytearray: BinaryType,
> decimal.Decimal: DecimalType,
> datetime.date: DateType,
> datetime.datetime: TimestampType,
> datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please 
> mark this as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13327) colnames()<- allows invalid column names

2016-02-15 Thread Oscar D. Lara Yejas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147843#comment-15147843
 ] 

Oscar D. Lara Yejas commented on SPARK-13327:
-

I'm working on this one

> colnames()<- allows invalid column names
> 
>
> Key: SPARK-13327
> URL: https://issues.apache.org/jira/browse/SPARK-13327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Oscar D. Lara Yejas
>
> colnames<- fails if:
> 1) Given colnames contain .
> 2) Given colnames contain NA
> 3) Given colnames are not character
> 4) Given colnames have different length than dataset's (SparkSQL error is 
> through but not user friendly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13327) colnames()<- allows invalid column names

2016-02-15 Thread Oscar D. Lara Yejas (JIRA)

Oscar D. Lara Yejas created SPARK-13327:
---

 Summary: colnames()<- allows invalid column names
 Key: SPARK-13327
 URL: https://issues.apache.org/jira/browse/SPARK-13327
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Oscar D. Lara Yejas


colnames<- fails if:

1) Given colnames contain .
2) Given colnames contain NA
3) Given colnames are not character
4) Given colnames have different length than dataset's (SparkSQL error is 
through but not user friendly)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13326) Dataset in spark 2.0.0-SNAPSHOT missing columns

2016-02-15 Thread koert kuipers (JIRA)

koert kuipers created SPARK-13326:
-

 Summary: Dataset in spark 2.0.0-SNAPSHOT missing columns
 Key: SPARK-13326
 URL: https://issues.apache.org/jira/browse/SPARK-13326
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: koert kuipers
Priority: Minor


i noticed some things stopped working on datasets in spark 2.0.0-SNAPSHOT, and 
with a confusing error message (cannot resolved some column with input columns 
[]).

for example in 1.6.0-SNAPSHOT:
{noformat}
scala> val ds = sc.parallelize(1 to 10).toDS
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.map(x => Option(x))
res0: org.apache.spark.sql.Dataset[Option[Int]] = [value: int]
{noformat}

and same commands in 2.0.0-SNAPSHOT:
{noformat}
scala> val ds = sc.parallelize(1 to 10).toDS
ds: org.apache.spark.sql.Dataset[Int] = [value: int]

scala> ds.map(x => Option(x))
org.apache.spark.sql.AnalysisException: cannot resolve 'value' given input 
columns: [];
  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:284)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:162)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:172)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:176)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:176)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:181)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
  at scala.collection.Iterator$class.foreach(Iterator.scala:742)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
  at scala.collection.AbstractIterator.to(Iterator.scala:1194)
  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
  at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:181)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:122)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:121)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:121)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:46)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:322)
  at org.apache.spark.sql.Dataset.(Dataset.scala:81)
  at org.apache.spark.sql.Dataset.(Dataset.scala:92)
  at org.apache.spark.sql.Dataset.mapPartitions(Dataset.scala:339)
  at org.apache.spark.sql.Dataset.map(Dataset.scala:323)
  ... 43 elided
{noformat}

i

[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-15 Thread Ankit Jindal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147829#comment-15147829
 ] 

Ankit Jindal commented on SPARK-12969:
--

Hi Jais,
i am running java program directly, and following is the command i am running.
./spark-submit --class contribution.DateConversion --deploy-mode client 
/Users/ankitjindal/Documents/workspace/SparkWithJava8/bin/

Please let me know if i am doing anything wrong.


Thanks,
Ankit

> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue
> This happens only if we execute the program in client mode , it works if we 
> execute through spark submit. Here is the sample project : 
> https://github.com/uhonnavarkar/spark_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13325) Create a high-quality 64-bit hashcode expression

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13325:


Assignee: Apache Spark

> Create a high-quality 64-bit hashcode expression
> 
>
> Key: SPARK-13325
> URL: https://issues.apache.org/jira/browse/SPARK-13325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> Spark currently lacks a high quality 64-bit hashcode. This is especially 
> useful for the HyperLogLog++ aggregate & potentially other probabilistic 
> datastructures.
> I have been looking at xxHash for a while now, and I'd like to see if we can 
> get the same (stellar) performance in Java/Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-13325) Create a high-quality 64-bit hashcode expression

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13325:


Assignee: (was: Apache Spark)

> Create a high-quality 64-bit hashcode expression
> 
>
> Key: SPARK-13325
> URL: https://issues.apache.org/jira/browse/SPARK-13325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently lacks a high quality 64-bit hashcode. This is especially 
> useful for the HyperLogLog++ aggregate & potentially other probabilistic 
> datastructures.
> I have been looking at xxHash for a while now, and I'd like to see if we can 
> get the same (stellar) performance in Java/Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13325) Create a high-quality 64-bit hashcode expression

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147824#comment-15147824
 ] 

Apache Spark commented on SPARK-13325:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/11209

> Create a high-quality 64-bit hashcode expression
> 
>
> Key: SPARK-13325
> URL: https://issues.apache.org/jira/browse/SPARK-13325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Herman van Hovell
>
> Spark currently lacks a high quality 64-bit hashcode. This is especially 
> useful for the HyperLogLog++ aggregate & potentially other probabilistic 
> datastructures.
> I have been looking at xxHash for a while now, and I'd like to see if we can 
> get the same (stellar) performance in Java/Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-13325) Create a high-quality 64-bit hashcode expression

2016-02-15 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-13325:
-

 Summary: Create a high-quality 64-bit hashcode expression
 Key: SPARK-13325
 URL: https://issues.apache.org/jira/browse/SPARK-13325
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Herman van Hovell


Spark currently lacks a high quality 64-bit hashcode. This is especially useful 
for the HyperLogLog++ aggregate & potentially other probabilistic 
datastructures.

I have been looking at xxHash for a while now, and I'd like to see if we can 
get the same (stellar) performance in Java/Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-15 Thread Jais Sebastian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147710#comment-15147710
 ] 

Jais Sebastian commented on SPARK-12969:


Hi Ankit,

Don't use spark submit. Try the following
1. get the sample code from  https://github.com/uhonnavarkar/spark_test 
2. Run the java program DateConversion directly. Dont create JAR file and 
execute spark submit.



> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue
> This happens only if we execute the program in client mode , it works if we 
> execute through spark submit. Here is the sample project : 
> https://github.com/uhonnavarkar/spark_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-15 Thread Ankit Jindal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147694#comment-15147694
 ] 

Ankit Jindal commented on SPARK-12969:
--

Hi Jais,

Yes, i have tested your program in client mode using spark submit and it worked.


Regards,
Ankit

> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue
> This happens only if we execute the program in client mode , it works if we 
> execute through spark submit. Here is the sample project : 
> https://github.com/uhonnavarkar/spark_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12969) Exception while casting a spark supported date formatted "string" to "date" data type.

2016-02-15 Thread Jais Sebastian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147684#comment-15147684
 ] 

Jais Sebastian commented on SPARK-12969:


Hi Ankit,

Have you tested the program in client mode ? Ex : 
https://github.com/uhonnavarkar/spark_test . This happens only in client mode 
and this works, if you execute using spark submitt.

Regards,
Jais

> Exception while  casting a spark supported date formatted "string" to "date" 
> data type.
> ---
>
> Key: SPARK-12969
> URL: https://issues.apache.org/jira/browse/SPARK-12969
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.0
> Environment: Spark Java 
>Reporter: Jais Sebastian
>
> Getting exception while  converting a string column( column is having spark 
> supported date format -MM-dd ) to date data type. Below is the code 
> snippet 
> List jsonData = Arrays.asList( 
> "{\"d\":\"2015-02-01\",\"n\":1}");
> JavaRDD dataRDD = 
> this.getSparkContext().parallelize(jsonData);
> DataFrame data = this.getSqlContext().read().json(dataRDD);
> DataFrame newData = data.select(data.col("d").cast("date"));
> newData.show();
> Above code will give the error
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> generated.java, Line 95, Column 28: Expression "scala.Option < Long > 
> longOpt16" is not an lvalue
> This happens only if we execute the program in client mode , it works if we 
> execute through spark submit. Here is the sample project : 
> https://github.com/uhonnavarkar/spark_test



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12759) Spark should fail fast if --executor-memory is too small for spark to start

2016-02-15 Thread Daniel Jalova (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147682#comment-15147682
 ] 

Daniel Jalova commented on SPARK-12759:
---

I will work on this, thanks.

> Spark should fail fast if --executor-memory is too small for spark to start
> ---
>
> Key: SPARK-12759
> URL: https://issues.apache.org/jira/browse/SPARK-12759
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Imran Rashid
>Priority: Minor
>
> With the UnifiedMemoryManager, the minimum memory for executor and driver 
> JVMs was increased to 450MB.  There is code in {{UnifiedMemoryManager}} to 
> provide a helpful warning if less than that much memory is provided.
> However if you set {{--executor-memory}} to something less than that, from 
> the driver process you just see executor failures with no warning, since the 
> more meaningful errors are buried in the executor logs.  Eg., on Yarn, you see
> {noformat}
> 16/01/11 13:59:32 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: 
> Container marked as failed: container_1452548703600_0001_01_02 on host: 
> imran-adhoc-2.vpc.cloudera.com. Exit status: 1. Diagnostics: Exception from 
> container-launch.
> Container id: container_1452548703600_0001_01_02
> Exit code: 1
> Stack trace: ExitCodeException exitCode=1: 
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:561)
>   at org.apache.hadoop.util.Shell.run(Shell.java:478)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:738)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:210)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Container exited with a non-zero exit code 1
> {noformat}
> Though there is already a message from {{UnifiedMemoryManager}} if there 
> isn't enough memory for the driver, as long as this is being changed it would 
> be nice if the message more clearly indicated the {{--driver-memory}} 
> configuration as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147663#comment-15147663
 ] 

Apache Spark commented on SPARK-13320:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/11208

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by "_1".



--
This

[jira] [Assigned] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13320:


Assignee: Apache Spark

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by "_1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (SPARK-13320) Confusing error message for Dataset API when using sum("*")

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13320:


Assignee: (was: Apache Spark)

> Confusing error message for Dataset API when using sum("*")
> ---
>
> Key: SPARK-13320
> URL: https://issues.apache.org/jira/browse/SPARK-13320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> pagecounts4PartitionsDS
>   .map(line => (line._1, line._3))
>   .toDF()
>   .groupBy($"_1")
>   .agg(sum("*") as "sumOccurances")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input 
> columns _1, _2;
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:318)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:107)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:117)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:121)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:125)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:125)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:105)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:133)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
>   at org.apache.spark.sql.GroupedData.toDF(GroupedData.scala:57)
>   at org.apache.spark.sql.GroupedData.agg(GroupedData.scala:213)
> {code}
> The error is with sum("*"), not the resolution of group by "_1".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SPARK-12583) spark shuffle fails with mesos after 2mins

2016-02-15 Thread Bertrand Bossy (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147642#comment-15147642
 ] 

Bertrand Bossy commented on SPARK-12583:


[~marmbrus] If this could make it into 1.6.1, that would be awesome.

> spark shuffle fails with mesos after 2mins
> --
>
> Key: SPARK-12583
> URL: https://issues.apache.org/jira/browse/SPARK-12583
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>
> See user mailing list "Executor deregistered after 2mins" for more details.
> As of 1.6, the driver registers with each shuffle manager via  
> MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
> automatically cleans up the data associate with that driver.
> However, the connection is terminated before this happens as it's idle. 
> Looking at a packet trace, after 120secs the shuffle manager is sending a FIN 
> packet to the driver.   The only way to delay this is to increase 
> spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
> I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
> newbie Scala skills to call the TransportContext call with 
> closeIdleConnections "false" and this didn't help (hadn't done the network 
> trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12583) spark shuffle fails with mesos after 2mins

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12583:


Assignee: (was: Apache Spark)

> spark shuffle fails with mesos after 2mins
> --
>
> Key: SPARK-12583
> URL: https://issues.apache.org/jira/browse/SPARK-12583
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>
> See user mailing list "Executor deregistered after 2mins" for more details.
> As of 1.6, the driver registers with each shuffle manager via  
> MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
> automatically cleans up the data associate with that driver.
> However, the connection is terminated before this happens as it's idle. 
> Looking at a packet trace, after 120secs the shuffle manager is sending a FIN 
> packet to the driver.   The only way to delay this is to increase 
> spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
> I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
> newbie Scala skills to call the TransportContext call with 
> closeIdleConnections "false" and this didn't help (hadn't done the network 
> trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12583) spark shuffle fails with mesos after 2mins

2016-02-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12583:


Assignee: Apache Spark

> spark shuffle fails with mesos after 2mins
> --
>
> Key: SPARK-12583
> URL: https://issues.apache.org/jira/browse/SPARK-12583
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>Assignee: Apache Spark
>
> See user mailing list "Executor deregistered after 2mins" for more details.
> As of 1.6, the driver registers with each shuffle manager via  
> MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
> automatically cleans up the data associate with that driver.
> However, the connection is terminated before this happens as it's idle. 
> Looking at a packet trace, after 120secs the shuffle manager is sending a FIN 
> packet to the driver.   The only way to delay this is to increase 
> spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
> I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
> newbie Scala skills to call the TransportContext call with 
> closeIdleConnections "false" and this didn't help (hadn't done the network 
> trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12583) spark shuffle fails with mesos after 2mins

2016-02-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147614#comment-15147614
 ] 

Apache Spark commented on SPARK-12583:
--

User 'bbossy' has created a pull request for this issue:
https://github.com/apache/spark/pull/11207

> spark shuffle fails with mesos after 2mins
> --
>
> Key: SPARK-12583
> URL: https://issues.apache.org/jira/browse/SPARK-12583
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>
> See user mailing list "Executor deregistered after 2mins" for more details.
> As of 1.6, the driver registers with each shuffle manager via  
> MesosExternalShuffleClient.  Once this disconnects, the shuffle manager 
> automatically cleans up the data associate with that driver.
> However, the connection is terminated before this happens as it's idle. 
> Looking at a packet trace, after 120secs the shuffle manager is sending a FIN 
> packet to the driver.   The only way to delay this is to increase 
> spark.shuffle.io.connectionTimeout=3600s on the shuffle manager.
> I patched the MesosExternalShuffleClient (and ExternalShuffleClient) with 
> newbie Scala skills to call the TransportContext call with 
> closeIdleConnections "false" and this didn't help (hadn't done the network 
> trace first).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13323) Type cast support in type inference during merging types.

2016-02-15 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147591#comment-15147591
 ] 

Davies Liu commented on SPARK-13323:


HiveTypeCoercion is pretty complicated, we may don't want to duplicate that in 
Python.

What's the problem right? Or just because of the TODO?

> Type cast support in type inference during merging types.
> -
>
> Key: SPARK-13323
> URL: https://issues.apache.org/jira/browse/SPARK-13323
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Hyukjin Kwon
>
> As described in {{types.py}}, there is a todo {{TODO: type cast (such as int 
> -> long)}}.
> Currently, PySpark infers types but does not try to find compatible types 
> when the given types are different during merging schemas.
> I think this can be done by resembling 
> {{HiveTypeCoercion.findTightestCommonTypeOfTwo}} for numbers and when one of 
> both is compared to {{StingType}}, just convert them into string.
> It looks the possible leaf data types are below:
> {code}
> # Mapping Python types to Spark SQL DataType
> _type_mappings = {
> type(None): NullType,
> bool: BooleanType,
> int: LongType,
> float: DoubleType,
> str: StringType,
> bytearray: BinaryType,
> decimal.Decimal: DecimalType,
> datetime.date: DateType,
> datetime.datetime: TimestampType,
> datetime.time: TimestampType,
> }
> {code}
> and they are converted pretty well to string as below:
> {code}
> >>> print str(None)
> None
> >>> print str(True)
> True
> >>> print str(float(0.1))
> 0.1
> >>> str(bytearray([255]))
> '\xff'
> >>> str(decimal.Decimal())
> '0'
> >>> str(datetime.date(1,1,1))
> '0001-01-01'
> >>> str(datetime.datetime(1,1,1))
> '0001-01-01 00:00:00'
> >>> str(datetime.time(1,1,1))
> '01:01:01'
> {code}
> First, I tried to find the relevant issue with this but I couldn't. Please 
> mark this as a duplicate if there is already.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-02-15 Thread Paulo Villegas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147434#comment-15147434
 ] 

Paulo Villegas edited comment on SPARK-4563 at 2/15/16 2:59 PM:


Hi. I would have a use case for this functionality: when the driver is running 
behind a NAT barrier. In that case the driver can send messages with a cluster 
master & executors, but they cannot reach the driver, hence it fails (it 
launches executors, but since they cannot register back with the driver, the 
driver decides they are not working and terminates them).

Given that we can do NAT, we could fix  the ports the driver is listening at 
(spark.driver.port, spark.fileserver.port, etc) and forward them through NAT, 
and the outside world could reach the driver easily. Except that it actually 
can't, because the driver advertises itself with the IP address of the internal 
network it binds to, not with the outside reachable IP. And both aspects (bind 
address & broadcast address) cannot be disentangled, currently.

Two real setups I've faced in which this would have been useful are:
 * machine in which the driver is running is in a home/private network, and the 
Spark processing cluster is reachable through the router, but not back due to 
the private addresses
 * driver is running in a virtual machine having a private address and trying 
to connect to a cluster in the host network. The VM cannot be put in bridge 
mode (which would solve the private address problem) due to corporate 
restrictions that preclude any non-authorized computers obtain a valid IP 
address in the host network.

I don't know if this is at all possible. I tried to look at the code, but it 
seemed to be hardcoded in the akka stack. SPARK-11638 SPARK-4389 seem to be 
related (or the same). The latter suggests that it is indeed not possible. 
Except that if Spark's RPC stack is no longer akka-based, then perhaps it is.


was (Author: paulovn):
Hi. I would have a use case for this functionality: when the driver is running 
behind a NAT barrier. In that case the driver can send messages with a cluster 
master & executors, but they cannot reach the driver, hence it fails (it 
launches executors, but since they cannot register back with the driver, the 
driver decides they are not working and terminates them).

Given that we can do NAT, we could fix  the ports the driver is listening at 
(spark.driver.port, spark.fileserver.port, etc) and forward them through NAT, 
and the outside world could reach the driver easily. Except that it actually 
can't, because the driver advertises itself with the IP address of the internal 
network it binds to, not with the outside reachable IP. And both aspects (bind 
address & broadcast address) cannot be disentangled, currently.

Two real setups I've faced in which this would have been useful are:
 * machine in which the driver is running is in a home/private network, and the 
Spark processing cluster is reachable through the router, but not back due to 
the private addresses
 * driver is running in a virtual machine having a private address and trying 
to connect to a cluster in the host network. The VM cannot be put in bridge 
mode (which would solve the private address problem) due to corporate 
restrictions that preclude any non-authorized computers obtain a valid IP 
address in the host network.

I don't know if this is at all possible. I tried to look at the code, but it 
seems to be hardcoded in the akka stack. SPARK-11638 SPARK-4389 seem to be 
related (or the same). The latter suggests that it is indeed not possible.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4563) Allow spark driver to bind to different ip then advertise ip

2016-02-15 Thread Paulo Villegas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147434#comment-15147434
 ] 

Paulo Villegas commented on SPARK-4563:
---

Hi. I would have a use case for this functionality: when the driver is running 
behind a NAT barrier. In that case the driver can send messages with a cluster 
master & executors, but they cannot reach the driver, hence it fails (it 
launches executors, but since they cannot register back with the driver, the 
driver decides they are not working and terminates them).

Given that we can do NAT, we could fix  the ports the driver is listening at 
(spark.driver.port, spark.fileserver.port, etc) and forward them through NAT, 
and the outside world could reach the driver easily. Except that it actually 
can't, because the driver advertises itself with the IP address of the internal 
network it binds to, not with the outside reachable IP. And both aspects (bind 
address & broadcast address) cannot be disentangled, currently.

Two real setups I've faced in which this would have been useful are:
 * machine in which the driver is running is in a home/private network, and the 
Spark processing cluster is reachable through the router, but not back due to 
the private addresses
 * driver is running in a virtual machine having a private address and trying 
to connect to a cluster in the host network. The VM cannot be put in bridge 
mode (which would solve the private address problem) due to corporate 
restrictions that preclude any non-authorized computers obtain a valid IP 
address in the host network.

I don't know if this is at all possible. I tried to look at the code, but it 
seems to be hardcoded in the akka stack. SPARK-11638 SPARK-4389 seem to be 
related (or the same). The latter suggests that it is indeed not possible.

> Allow spark driver to bind to different ip then advertise ip
> 
>
> Key: SPARK-4563
> URL: https://issues.apache.org/jira/browse/SPARK-4563
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Long Nguyen
>Priority: Minor
>
> Spark driver bind ip and advertise is not configurable. spark.driver.host is 
> only bind ip. SPARK_PUBLIC_DNS does not work for spark driver. Allow option 
> to set advertised ip/hostname



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13297) [SQL] Backticks cannot be escaped in column names

2016-02-15 Thread Grzegorz Chilkiewicz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15147371#comment-15147371
 ] 

Grzegorz Chilkiewicz commented on SPARK-13297:
--

I've verified it on: 
http://people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/spark-2.0.0-SNAPSHOT-bin-hadoop2.6.tgz
You are right, it looks like the problem is fixed there!

But still - it is not fixed in branch 1.6
I've found that this commit:
https://github.com/apache/spark/commit/7cd7f2202547224593517b392f56e49e4c94cabc
fixed issue in master branch.

Shouldn't we cherry-pick that commit? (it's big - it could be hard...)

> [SQL] Backticks cannot be escaped in column names
> -
>
> Key: SPARK-13297
> URL: https://issues.apache.org/jira/browse/SPARK-13297
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>
> We want to use backticks to escape spaces & minus signs in column names.
> Are we unable to escape backticks when a column name is surrounded by 
> backticks?
> It is not documented in: 
> http://spark.apache.org/docs/latest/sql-programming-guide.html
> In MySQL there is a way: double the backticks, but this trick doesn't work in 
> Spark-SQL.
> Am I correct or just missing something? Is there a way to escape backticks 
> inside a column name when it is surrounded by backticks?
> Code to reproduce the problem:
> https://github.com/grzegorz-chilkiewicz/SparkSqlEscapeBacktick



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 134 matches

Mail list logo