[jira] [Commented] (SPARK-12141) Use Jackson to serialize all events when writing event log

2015-12-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039776#comment-15039776
 ] 

Jean-Baptiste Onofré commented on SPARK-12141:
--

[~vanzin] Do you mind if I take a look on this one ?

> Use Jackson to serialize all events when writing event log
> --
>
> Key: SPARK-12141
> URL: https://issues.apache.org/jira/browse/SPARK-12141
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure to serialize events using Jackson, so that 
> manual serialization code is not needed anymore.
> We should write all events using that support, and remove all the manual 
> serialization code in {{JsonProtocol}}.
> Since the event log format is a semi-public API, I'm targeting this at 2.0. 
> Also, we can't remove the manual deserialization code, since we need to be 
> able to read old event logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer

2015-12-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039778#comment-15039778
 ] 

Jean-Baptiste Onofré commented on SPARK-12140:
--

[~vanzin] Do you mind if I try something there ? ;)

> Support Streaming UI in HistoryServer
> -
>
> Key: SPARK-12140
> URL: https://issues.apache.org/jira/browse/SPARK-12140
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>
> SPARK-11206 added infrastructure that would allow the streaming UI to be 
> shown in the History Server. We should add the necessary code to make that 
> happen, although it requires some changes to how events and listeners are 
> used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040771#comment-15040771
 ] 

Apache Spark commented on SPARK-12089:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/10142

> java.lang.NegativeArraySizeException when growing BufferHolder
> --
>
> Key: SPARK-12089
> URL: https://issues.apache.org/jira/browse/SPARK-12089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Erik Selin
>Priority: Blocker
>
> When running a large spark sql query including multiple joins I see tasks 
> failing with the following trace:
> {code}
> java.lang.NegativeArraySizeException
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76)
> at 
> org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> From the spark code it looks like this is due to a integer overflow when 
> growing a buffer length. The offending line {{BufferHolder.java:36}} is the 
> following in the version I'm running:
> {code}
> final byte[] tmp = new byte[length * 2];
> {code}
> This seems to indicate to me that this buffer will never be able to hold more 
> then 2G worth of data. And likely will hold even less since any length > 
> 1073741824 will cause a integer overflow and turn the new buffer size 
> negative.
> I hope I'm simply missing some critical config setting but it still seems 
> weird that we have a (rather low) upper limit on these buffers. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR

2015-12-03 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040773#comment-15040773
 ] 

Shivaram Venkataraman commented on SPARK-12144:
---

Personally I don't think this API is a good fit for SparkR -- This introduces a 
lot of chaining based methods which are awkward to write in R without using 
something like magrittr. I think the existing `read.df` and having `option = 
value` is more user-friendly.  Is there any functionality we will gain from 
this ?

> Implement DataFrameReader and DataFrameWriter API in SparkR
> ---
>
> Key: SPARK-12144
> URL: https://issues.apache.org/jira/browse/SPARK-12144
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> DataFrameReader API: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
> DataFrameWriter API: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread sachin aggarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039767#comment-15039767
 ] 

sachin aggarwal commented on SPARK-12117:
-

Hi,

can u suggest me some  work around to make this use case work in 1.5.1 ?

thanks 

> Column Aliases are Ignored in callUDF while using struct()
> --
>
> Key: SPARK-12117
> URL: https://issues.apache.org/jira/browse/SPARK-12117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: sachin aggarwal
>
> case where this works:
>   val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
> ("Rishabh", "2"))).toDF("myText", "id")
>   
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> steps to reproduce error case:
> 1)create a file copy following text--filename(a.json)
> { "myText": "Sachin Aggarwal","id": "1"}
> { "myText": "Rishabh","id": "2"}
> 2)define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
> 3)register the udf 
>  sqlContext.udf.register("mydef" ,mydef _)
> 4)read the input file 
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
> 5)make a call to UDF
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at 
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>  at 

[jira] [Commented] (SPARK-5264) Support `drop temporary table [if exists]` DDL command

2015-12-03 Thread niranda perera (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039713#comment-15039713
 ] 

niranda perera commented on SPARK-5264:
---

any update on this?

> Support `drop temporary table [if exists]` DDL command 
> ---
>
> Key: SPARK-5264
> URL: https://issues.apache.org/jira/browse/SPARK-5264
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Li Sheng
>Priority: Minor
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> Support `drop table` DDL command 
> i.e DROP [TEMPORARY] TABLE [IF EXISTS]tbl_name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR

2015-12-03 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12144:
---

 Summary: Implement DataFrameReader and DataFrameWriter API in 
SparkR
 Key: SPARK-12144
 URL: https://issues.apache.org/jira/browse/SPARK-12144
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Sun Rui


DataFrameReader API: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader

DataFrameWriter API: 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12139) REGEX Column Specification for Hive Queries

2015-12-03 Thread Derek Sabry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek Sabry updated SPARK-12139:

Description: 
When executing a query of the form

Select `(a)?+.+` from A,

Hive would interprets this query as a regular expression, which can be 
supported in the hive parser for spark

  was:
When executing a query of the form

Select `(a)?+.+` from A,

Hive interprets this query as a regular expression, which can be supported in 
the hive parser for spark


> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
>
> When executing a query of the form
> Select `(a)?+.+` from A,
> Hive would interprets this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR

2015-12-03 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041177#comment-15041177
 ] 

Sun Rui commented on SPARK-12144:
-

@shivaram, your opinion is reasonable. It seems that read.df and write.df does 
not cover all functionalities exposed by DataFrameReader and DataFrameWriter. 
Maybe we don't need to provide these two API, but provide more wrapping 
functions in SparkR.

> Implement DataFrameReader and DataFrameWriter API in SparkR
> ---
>
> Key: SPARK-12144
> URL: https://issues.apache.org/jira/browse/SPARK-12144
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> DataFrameReader API: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader
> DataFrameWriter API: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12104) collect() does not handle multiple columns with same name

2015-12-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12104.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10118
[https://github.com/apache/spark/pull/10118]

> collect() does not handle multiple columns with same name
> -
>
> Key: SPARK-12104
> URL: https://issues.apache.org/jira/browse/SPARK-12104
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Hossein Falaki
>Priority: Critical
> Fix For: 2.0.0, 1.6.1
>
>
> This is a regression from Spark 1.5
> Spark can produce DataFrames with identical names (e.g., after left outer 
> joins). In 1.5 when such a DataFrame was collected we ended up with an R 
> data.frame with modified column names:
> {code}
> > names(mySparkDF)
> [1] "date"   "name"   "name" 
> > names(collect(mySparkDF))
> [1] "date"   "name"   "name.1" 
> {code}
> But in 1.6 only the first column is included in the collected R data.frame. I 
> think SparkR should continue the old behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12139) REGEX Column Specification for Hive Queries

2015-12-03 Thread Derek Sabry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek Sabry updated SPARK-12139:

Description: 
When executing a query of the form

Select `(a)?\+.\+` from A,

Hive would interprets this query as a regular expression, which can be 
supported in the hive parser for spark

  was:
When executing a query of the form

Select `(a)?+.+` from A,

Hive would interprets this query as a regular expression, which can be 
supported in the hive parser for spark


> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interprets this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12139) REGEX Column Specification for Hive Queries

2015-12-03 Thread Derek Sabry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derek Sabry updated SPARK-12139:

Description: 
When executing a query of the form

Select `(a)?\+.\+` from A,

Hive would interpret this query as a regular expression, which can be supported 
in the hive parser for spark

  was:
When executing a query of the form

Select `(a)?\+.\+` from A,

Hive would interprets this query as a regular expression, which can be 
supported in the hive parser for spark


> REGEX Column Specification for Hive Queries
> ---
>
> Key: SPARK-12139
> URL: https://issues.apache.org/jira/browse/SPARK-12139
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Derek Sabry
>Priority: Minor
>
> When executing a query of the form
> Select `(a)?\+.\+` from A,
> Hive would interpret this query as a regular expression, which can be 
> supported in the hive parser for spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041232#comment-15041232
 ] 

Liang-Chi Hsieh commented on SPARK-12117:
-

Hmm, because the problem is due to the field names are erased, I think you can 
just use other API of Row, instead of getAs(fieldName). Even getAs() has 
another alternative that accepts a integer as the field index. Then you don't 
need to extract the filed with field name.

> Column Aliases are Ignored in callUDF while using struct()
> --
>
> Key: SPARK-12117
> URL: https://issues.apache.org/jira/browse/SPARK-12117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: sachin aggarwal
>
> case where this works:
>   val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
> ("Rishabh", "2"))).toDF("myText", "id")
>   
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> steps to reproduce error case:
> 1)create a file copy following text--filename(a.json)
> { "myText": "Sachin Aggarwal","id": "1"}
> { "myText": "Rishabh","id": "2"}
> 2)define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
> 3)register the udf 
>  sqlContext.udf.register("mydef" ,mydef _)
> 4)read the input file 
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
> 5)make a call to UDF
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at 
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at 
> 

[jira] [Assigned] (SPARK-12122) Recovered streaming context can sometimes run a batch twice

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12122:


Assignee: Apache Spark  (was: Tathagata Das)

> Recovered streaming context can sometimes run a batch twice
> ---
>
> Key: SPARK-12122
> URL: https://issues.apache.org/jira/browse/SPARK-12122
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Blocker
>
> After recovering from checkpoint, the JobGenerator figures out which batches 
> to run again. That can sometimes lead to a batch being submitted twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files

2015-12-03 Thread Ewan Higgs (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037593#comment-15037593
 ] 

Ewan Higgs commented on SPARK-5836:
---

[~tdas] 
{quote}
The only case there may be issues is when the external shuffle service is used.
{quote}

I see this problematic behaviour in ipython/pyspark notebooks. We can try to go 
through and unpersist and checkpoint and so on with the RDDs but the shuffle 
files don't seem to go away. We see this even though we are not using the 
external shuffle service.

> Highlight in Spark documentation that by default Spark does not delete its 
> temporary files
> --
>
> Key: SPARK-5836
> URL: https://issues.apache.org/jira/browse/SPARK-5836
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Tomasz Dudziak
>Assignee: Ilya Ganelin
>Priority: Minor
> Fix For: 1.3.1, 1.4.0
>
>
> We recently learnt the hard way (in a prod system) that Spark by default does 
> not delete its temporary files until it is stopped. WIthin a relatively short 
> time span of heavy Spark use the disk of our prod machine filled up 
> completely because of multiple shuffle files written to it. We think there 
> should be better documentation around the fact that after a job is finished 
> it leaves a lot of rubbish behind so that this does not come as a surprise.
> Probably a good place to highlight that fact would be the documentation of 
> {{spark.local.dir}} property, which controls where Spark temporary files are 
> written. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-03 Thread iward (JIRA)
iward created SPARK-12125:
-

 Summary: pull out nondeterministic expressions from Join
 Key: SPARK-12125
 URL: https://issues.apache.org/jira/browse/SPARK-12125
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2, 1.5.1, 1.5.0
Reporter: iward


Currently,*nondeterministic expressions* are only allowed in *Project* or 
*Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
be pulled out.

But,Sometime in many case,we will use nondeterministic expressions to process 
*join keys* avoiding data skew.for example:
{noformat}
select * 
from tableA a 
join 
(select * from tableB) b 
on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
(-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code

{noformat}

This PR introduce a mechanism to pull out nondeterministic expressions from 
*Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.

2015-12-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037678#comment-15037678
 ] 

Xusen Yin commented on SPARK-7379:
--

OK, I fixed my code in that PR. My case is getSplits returns an Array[Double], 
then in the mllib/common.py code, _java2py() method it fails to match JavaList.

{code}
if isinstance(r, (bytearray, bytes)):
r = PickleSerializer().loads(bytes(r), encoding=encoding)
{code}

The above code cath the Array[Double] but fails to deserialize it with Python3.

> pickle.loads expects a string instead of bytes in Python 3.
> ---
>
> Key: SPARK-7379
> URL: https://issues.apache.org/jira/browse/SPARK-7379
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> In PickleSerializer, we call pickle.loads in Python 3. However, the input obj 
> could be bytes, which works in Python 2 but not 3.
> The error message is
> {code}
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py",
>  line 418, in loads
> return pickle.loads(obj, encoding=encoding)
> TypeError: must be a unicode character, not bytes
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12127) Spark sql Support Cross join with Mongo DB

2015-12-03 Thread himanshu singhal (JIRA)
himanshu singhal created SPARK-12127:


 Summary: Spark sql Support Cross join with Mongo DB
 Key: SPARK-12127
 URL: https://issues.apache.org/jira/browse/SPARK-12127
 Project: Spark
  Issue Type: Question
  Components: Spark Core, SQL
Affects Versions: 1.5.2, 1.4.1
 Environment: Linex Ubuntu 14.4
Reporter: himanshu singhal


I am using spark sql to perform the various operation . But MongoDb is not  
giving the correct result on the version 1.4.0 and giving the error on the 
version 1.5.2 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12124) Spark Sql MongoDB Cross Join Not Working

2015-12-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037684#comment-15037684
 ] 

Hyukjin Kwon commented on SPARK-12124:
--

I think this might be asked to https://github.com/Stratio/spark-mongodb, as 
MongoDB datasource is not a internal datasource for Spark..

> Spark Sql MongoDB Cross Join Not Working
> 
>
> Key: SPARK-12124
> URL: https://issues.apache.org/jira/browse/SPARK-12124
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.1, 1.5.2
> Environment: Linux Ubuntu 14.4
>Reporter: himanshu singhal
>  Labels: test
>   Original Estimate: 1,680h
>  Remaining Estimate: 1,680h
>
> Cross Join In Spark Sql where source is mongo db is giving th wrong result
> 15/12/03 15:55:05 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
> 15/12/03 15:55:05 INFO NewHadoopRDD: Input split: 
> MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, 
> authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, 
> notimeout=false}
> 15/12/03 15:55:05 INFO NewHadoopRDD: Input split: 
> MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, 
> authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, 
> notimeout=false}
> 15/12/03 15:55:05 INFO MongoRecordReader: Read 1.0 documents from:
> 15/12/03 15:55:05 INFO MongoRecordReader: 
> MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, 
> authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, 
> notimeout=false}
> 15/12/03 15:55:05 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
> java.lang.IllegalStateException: Iterator has been closed
>   at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119)
>   at com.mongodb.DBCursor._hasNext(DBCursor.java:551)
>   at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
>   at 
> com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:73)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> 15/12/03 15:55:05 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
> localhost): java.lang.IllegalStateException: Iterator has been closed
>   at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119)
>   at com.mongodb.DBCursor._hasNext(DBCursor.java:551)
>   at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
>   at 

[jira] [Created] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.

2015-12-03 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-12126:


 Summary: JDBC datasource processes filters only commonly pushed 
down.
 Key: SPARK-12126
 URL: https://issues.apache.org/jira/browse/SPARK-12126
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Hyukjin Kwon


As suggested 
[here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646),
 Currently JDBC datasource only processes the filters pushed down from 
{{DataSourceStrategy}}.

Unlike ORC or Parquet, this can process pretty a lot of filters (for example, a 
+ b > 3) since it is just about string parsing.

As 
[here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526),
 using {{CatalystScan} trait might be one of solutions.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-03 Thread iward (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

iward updated SPARK-12125:
--
Fix Version/s: 1.6.0

> pull out nondeterministic expressions from Join
> ---
>
> Key: SPARK-12125
> URL: https://issues.apache.org/jira/browse/SPARK-12125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: iward
> Fix For: 1.6.0
>
>
> Currently,*nondeterministic expressions* are only allowed in *Project* or 
> *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
> be pulled out.
> But,Sometime in many case,we will use nondeterministic expressions to process 
> *join keys* avoiding data skew.for example:
> {noformat}
> select * 
> from tableA a 
> join 
> (select * from tableB) b 
> on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
> (-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code
> {noformat}
> This PR introduce a mechanism to pull out nondeterministic expressions from 
> *Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source

2015-12-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037670#comment-15037670
 ] 

Hyukjin Kwon commented on SPARK-9182:
-

Filed here https://issues.apache.org/jira/browse/SPARK-12126.

> filter and groupBy on DataFrames are not passed through to jdbc source
> --
>
> Key: SPARK-9182
> URL: https://issues.apache.org/jira/browse/SPARK-9182
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Yijie Shen
>Priority: Critical
>
> When running all of these API calls, the only one that passes the filter 
> through to the backend jdbc source is equality.  All filters in these 
> commands should be able to be passed through to the jdbc database source.
> {code}
> val url="jdbc:postgresql:grahn"
> val prop = new java.util.Properties
> val emp = sqlContext.read.jdbc(url, "emp", prop)
> emp.filter(emp("sal") === 5000).show()
> emp.filter(emp("sal") < 5000).show()
> emp.filter("sal = 3000").show()
> emp.filter("sal > 2500").show()
> emp.filter("sal >= 2500").show()
> emp.filter("sal < 2500").show()
> emp.filter("sal <= 2500").show()
> emp.filter("sal != 3000").show()
> emp.filter("sal between 3000 and 5000").show()
> emp.filter("ename in ('SCOTT','BLAKE')").show()
> {code}
> We see from the PostgreSQL query log the following is run, and see that only 
> equality predicates are passed through.
> {code}
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 5000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE 
> sal = 3000
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> LOG:  execute : SET extra_float_digits = 3
> LOG:  execute : SELECT 
> "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.

2015-12-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037678#comment-15037678
 ] 

Xusen Yin edited comment on SPARK-7379 at 12/3/15 11:19 AM:


OK, I fixed my code in that PR. My case is getSplits returns an Array[Double], 
then in the mllib/common.py code, _java2py() method it fails to match JavaList.

{code}
if isinstance(r, (bytearray, bytes)):
r = PickleSerializer().loads(bytes(r), encoding=encoding)
{code}

The above code cath the Array[Double] but fails to deserialize it with Python3.


was (Author: yinxusen):
OK, I fixed my code in that PR. My case is getSplits returns an Array[Double], 
then in the mllib/common.py code, _java2py() method it fails to match JavaList.

{code}
if isinstance(r, (bytearray, bytes)):
r = PickleSerializer().loads(bytes(r), encoding=encoding)
{code}

The above code cath the Array[Double] but fails to deserialize it with Python3.

> pickle.loads expects a string instead of bytes in Python 3.
> ---
>
> Key: SPARK-7379
> URL: https://issues.apache.org/jira/browse/SPARK-7379
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> In PickleSerializer, we call pickle.loads in Python 3. However, the input obj 
> could be bytes, which works in Python 2 but not 3.
> The error message is
> {code}
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py",
>  line 418, in loads
> return pickle.loads(obj, encoding=encoding)
> TypeError: must be a unicode character, not bytes
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.

2015-12-03 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-12126:
-
Description: 
As suggested 
[here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646],
 Currently JDBC datasource only processes the filters pushed down from 
{{DataSourceStrategy}}.

Unlike ORC or Parquet, this can process pretty a lot of filters (for example, a 
+ b > 3) since it is just about string parsing.

As 
[here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526],
 using {{CatalystScan}} trait might be one of solutions.



  was:
As suggested 
[here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646),
 Currently JDBC datasource only processes the filters pushed down from 
{{DataSourceStrategy}}.

Unlike ORC or Parquet, this can process pretty a lot of filters (for example, a 
+ b > 3) since it is just about string parsing.

As 
[here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526),
 using {{CatalystScan} trait might be one of solutions.




> JDBC datasource processes filters only commonly pushed down.
> 
>
> Key: SPARK-12126
> URL: https://issues.apache.org/jira/browse/SPARK-12126
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Hyukjin Kwon
>
> As suggested 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646],
>  Currently JDBC datasource only processes the filters pushed down from 
> {{DataSourceStrategy}}.
> Unlike ORC or Parquet, this can process pretty a lot of filters (for example, 
> a + b > 3) since it is just about string parsing.
> As 
> [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526],
>  using {{CatalystScan}} trait might be one of solutions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11238) SparkR: Documentation change for merge function

2015-12-03 Thread RaviShankar KS (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037661#comment-15037661
 ] 

RaviShankar KS commented on SPARK-11238:


Hi ,

I would like to take this up. 
Can you tell me where to update the release notes ?

Thanks and Regards,
Ravi

> SparkR: Documentation change for merge function
> ---
>
> Key: SPARK-11238
> URL: https://issues.apache.org/jira/browse/SPARK-11238
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>  Labels: releasenotes
>
> As discussed in pull request: https://github.com/apache/spark/pull/9012, the 
> signature of the merge function will be changed, therefore documentation 
> change is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12125:


Assignee: (was: Apache Spark)

> pull out nondeterministic expressions from Join
> ---
>
> Key: SPARK-12125
> URL: https://issues.apache.org/jira/browse/SPARK-12125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: iward
>
> Currently,*nondeterministic expressions* are only allowed in *Project* or 
> *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
> be pulled out.
> But,Sometime in many case,we will use nondeterministic expressions to process 
> *join keys* avoiding data skew.for example:
> {noformat}
> select * 
> from tableA a 
> join 
> (select * from tableB) b 
> on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
> (-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code
> {noformat}
> This PR introduce a mechanism to pull out nondeterministic expressions from 
> *Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12125) pull out nondeterministic expressions from Join

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12125:


Assignee: Apache Spark

> pull out nondeterministic expressions from Join
> ---
>
> Key: SPARK-12125
> URL: https://issues.apache.org/jira/browse/SPARK-12125
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: iward
>Assignee: Apache Spark
>
> Currently,*nondeterministic expressions* are only allowed in *Project* or 
> *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can 
> be pulled out.
> But,Sometime in many case,we will use nondeterministic expressions to process 
> *join keys* avoiding data skew.for example:
> {noformat}
> select * 
> from tableA a 
> join 
> (select * from tableB) b 
> on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( 
> (-rand() * 1000 ) as string ) else a.brand_code end ))  = b.brand_code
> {noformat}
> This PR introduce a mechanism to pull out nondeterministic expressions from 
> *Join*,so we can use nondeterministic expression in *Join* appropriately.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12120:


Assignee: (was: Apache Spark)

> Improve exception message when failing to initialize HiveContext in PySpark
> ---
>
> Key: SPARK-12120
> URL: https://issues.apache.org/jira/browse/SPARK-12120
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> I get the following exception message when failing to initialize HiveContext. 
> This is hard to figure out why HiveContext failed to initialize. Actually I 
> build spark with hive profile enabled. The reason the HiveContext failed is 
> due to I didn't start hdfs service. And actually I can see the full 
> stacktrace in spark-shell. And I also can see the full stack trace in 
> python2. The issue only exists in python2.x
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, 
> in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, 
> in _ssql_ctx
> "build/sbt assembly", e)
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037538#comment-15037538
 ] 

Apache Spark commented on SPARK-12120:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10126

> Improve exception message when failing to initialize HiveContext in PySpark
> ---
>
> Key: SPARK-12120
> URL: https://issues.apache.org/jira/browse/SPARK-12120
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> I get the following exception message when failing to initialize HiveContext. 
> This is hard to figure out why HiveContext failed to initialize. Actually I 
> build spark with hive profile enabled. The reason the HiveContext failed is 
> due to I didn't start hdfs service. And actually I can see the full 
> stacktrace in spark-shell. And I also can see the full stack trace in 
> python2. The issue only exists in python2.x
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, 
> in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, 
> in _ssql_ctx
> "build/sbt assembly", e)
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12120:


Assignee: Apache Spark

> Improve exception message when failing to initialize HiveContext in PySpark
> ---
>
> Key: SPARK-12120
> URL: https://issues.apache.org/jira/browse/SPARK-12120
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> I get the following exception message when failing to initialize HiveContext. 
> This is hard to figure out why HiveContext failed to initialize. Actually I 
> build spark with hive profile enabled. The reason the HiveContext failed is 
> due to I didn't start hdfs service. And actually I can see the full 
> stacktrace in spark-shell. And I also can see the full stack trace in 
> python2. The issue only exists in python2.x
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, 
> in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, 
> in _ssql_ctx
> "build/sbt assembly", e)
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12122) Recovered streaming context can sometimes run a batch twice

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037560#comment-15037560
 ] 

Apache Spark commented on SPARK-12122:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/10127

> Recovered streaming context can sometimes run a batch twice
> ---
>
> Key: SPARK-12122
> URL: https://issues.apache.org/jira/browse/SPARK-12122
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> After recovering from checkpoint, the JobGenerator figures out which batches 
> to run again. That can sometimes lead to a batch being submitted twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12122) Recovered streaming context can sometimes run a batch twice

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12122:


Assignee: Tathagata Das  (was: Apache Spark)

> Recovered streaming context can sometimes run a batch twice
> ---
>
> Key: SPARK-12122
> URL: https://issues.apache.org/jira/browse/SPARK-12122
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> After recovering from checkpoint, the JobGenerator figures out which batches 
> to run again. That can sometimes lead to a batch being submitted twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12088.
---
   Resolution: Fixed
 Assignee: Huaxin Gao
Fix Version/s: 2.0.0
   1.6.1

Resolved by https://github.com/apache/spark/pull/10095

> check connection.isClose before connection.getAutoCommit in JDBCRDD.close
> -
>
> Key: SPARK-12088
> URL: https://issues.apache.org/jira/browse/SPARK-12088
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> in JDBCRDD, it has
> if (!conn.getAutoCommit && !conn.isClosed) {
> try {
>   conn.commit()
> } 
> . . . . . .
> In my test, the connection is already closed so conn.getAutoCommit throw 
> Exception. We should check !conn.isClosed before checking !conn.getAutoCommit 
> to avoid the Exception. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12118) SparkR: Documentation change for isNaN

2015-12-03 Thread RaviShankar KS (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037660#comment-15037660
 ] 

RaviShankar KS commented on SPARK-12118:


Hi ,

I would like to take this up. 
Can you tell me where to update the release notes ?

Thanks and Regards,
Ravi

> SparkR: Documentation change for isNaN
> --
>
> Key: SPARK-12118
> URL: https://issues.apache.org/jira/browse/SPARK-12118
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Yanbo Liang
>Priority: Minor
>  Labels: releasenotes
>
> As discussed in pull request: https://github.com/apache/spark/pull/10037, we 
> replace DataFrame.isNaN with DataFrame.isnan at SparkR side. Because 
> DataFrame.isNaN has been deprecated and will be removed at Spark 2.0. We 
> should document the change at release notes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9697) Project Tungsten (Spark 1.6)

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037458#comment-15037458
 ] 

Sean Owen commented on SPARK-9697:
--

[~rxin] or should this at least be titled "... (Phase 2)"?

> Project Tungsten (Spark 1.6)
> 
>
> Key: SPARK-9697
> URL: https://issues.apache.org/jira/browse/SPARK-9697
> Project: Spark
>  Issue Type: Epic
>  Components: Block Manager, Shuffle, Spark Core, SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This epic tracks the 2nd phase of Project Tungsten, slotted for Spark 1.6 
> release.
> This epic tracks work items for Spark 1.6. More tickets can be found in:
> SPARK-7075: Tungsten-related work in Spark 1.5
> SPARK-9697: Tungsten-related work in Spark 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12119) Support compression in PySpark

2015-12-03 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12119:
--

 Summary: Support compression in PySpark
 Key: SPARK-12119
 URL: https://issues.apache.org/jira/browse/SPARK-12119
 Project: Spark
  Issue Type: Sub-task
Reporter: Jeff Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4117) Spark on Yarn handle AM being told command from RM

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4117:
---

Assignee: Apache Spark

> Spark on Yarn handle AM being told command from RM
> --
>
> Key: SPARK-4117
> URL: https://issues.apache.org/jira/browse/SPARK-4117
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> In the allocateResponse from the RM it can send commands that the AM should 
> follow. for instance AM_RESYNC and AM_SHUTDOWN.  We should add support for 
> those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12124) Spark Sql MongoDB Cross Join Not Working

2015-12-03 Thread himanshu singhal (JIRA)
himanshu singhal created SPARK-12124:


 Summary: Spark Sql MongoDB Cross Join Not Working
 Key: SPARK-12124
 URL: https://issues.apache.org/jira/browse/SPARK-12124
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2, 1.5.1, 1.4.1, 1.4.0
 Environment: Linux Ubuntu 14.4
Reporter: himanshu singhal


Cross Join In Spark Sql where source is mongo db is giving th wrong result

15/12/03 15:55:05 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)
15/12/03 15:55:05 INFO NewHadoopRDD: Input split: 
MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, 
authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, 
notimeout=false}
15/12/03 15:55:05 INFO NewHadoopRDD: Input split: 
MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, 
authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, 
notimeout=false}
15/12/03 15:55:05 INFO MongoRecordReader: Read 1.0 documents from:
15/12/03 15:55:05 INFO MongoRecordReader: 
MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, 
authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, 
notimeout=false}
15/12/03 15:55:05 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.lang.IllegalStateException: Iterator has been closed
at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119)
at com.mongodb.DBCursor._hasNext(DBCursor.java:551)
at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
at 
com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:73)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/12/03 15:55:05 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
localhost): java.lang.IllegalStateException: Iterator has been closed
at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119)
at com.mongodb.DBCursor._hasNext(DBCursor.java:551)
at com.mongodb.DBCursor.hasNext(DBCursor.java:571)
at 
com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:73)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

[jira] [Commented] (SPARK-4117) Spark on Yarn handle AM being told command from RM

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037617#comment-15037617
 ] 

Apache Spark commented on SPARK-4117:
-

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/10129

> Spark on Yarn handle AM being told command from RM
> --
>
> Key: SPARK-4117
> URL: https://issues.apache.org/jira/browse/SPARK-4117
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> In the allocateResponse from the RM it can send commands that the AM should 
> follow. for instance AM_RESYNC and AM_SHUTDOWN.  We should add support for 
> those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark

2015-12-03 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12120:
--

 Summary: Improve exception message when failing to initialize 
HiveContext in PySpark
 Key: SPARK-12120
 URL: https://issues.apache.org/jira/browse/SPARK-12120
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jeff Zhang
Priority: Minor


I get the following exception message when failing to initialize HiveContext. 
Actually I build spark with hive profile enabled. The reason the HiveContext 
failed to initialize is due to I didn't start hdfs service. And actually I can 
see the full stacktrace in spark-shell. And I also can see the full stack trace 
in python2. The issue only exists in python2.x

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, in 
createDataFrame
jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, in 
_ssql_ctx
"build/sbt assembly", e)
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12121) Remote Spark init.sh called from spark_ec2.py points to incorrect prebuilt image URL

2015-12-03 Thread Andre Schumacher (JIRA)
Andre Schumacher created SPARK-12121:


 Summary: Remote Spark init.sh called from spark_ec2.py points to 
incorrect prebuilt image URL
 Key: SPARK-12121
 URL: https://issues.apache.org/jira/browse/SPARK-12121
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.5.2
Reporter: Andre Schumacher



The tar ball for 1.5.2 with Hadoop 1 (the default) has the scala version 
appended to the file name, which leads to spark_ec2.py failing on starting up 
the cluster. Here is the record from the S3 contents:


spark-1.5.2-bin-hadoop1-scala2.11.tgz
2015-11-10T06:45:17.000Z
"056fc68e549db27d986da707f19e39c8-4"
234574403
STANDARD


Maybe one could provide one without the scala suffix (default)?

A workaround is to use set the Hadoop version to a version different from 1 
when calling spark-ec2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark

2015-12-03 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12120:
---
Description: 
I get the following exception message when failing to initialize HiveContext. 
This is hard to figure out why HiveContext failed to initialize. Actually I 
build spark with hive profile enabled. The reason the HiveContext failed is due 
to I didn't start hdfs service. And actually I can see the full stacktrace in 
spark-shell. And I also can see the full stack trace in python2. The issue only 
exists in python2.x

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, in 
createDataFrame
jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, in 
_ssql_ctx
"build/sbt assembly", e)
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))

{code}



  was:
I get the following exception message when failing to initialize HiveContext. 
Actually I build spark with hive profile enabled. The reason the HiveContext 
failed to initialize is due to I didn't start hdfs service. And actually I can 
see the full stacktrace in spark-shell. And I also can see the full stack trace 
in python2. The issue only exists in python2.x

{code}
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, in 
createDataFrame
jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, in 
_ssql_ctx
"build/sbt assembly", e)
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))

{code}


> Improve exception message when failing to initialize HiveContext in PySpark
> ---
>
> Key: SPARK-12120
> URL: https://issues.apache.org/jira/browse/SPARK-12120
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> I get the following exception message when failing to initialize HiveContext. 
> This is hard to figure out why HiveContext failed to initialize. Actually I 
> build spark with hive profile enabled. The reason the HiveContext failed is 
> due to I didn't start hdfs service. And actually I can see the full 
> stacktrace in spark-shell. And I also can see the full stack trace in 
> python2. The issue only exists in python2.x
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, 
> in createDataFrame
> jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json())
>   File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, 
> in _ssql_ctx
> "build/sbt assembly", e)
> Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run 
> build/sbt assembly", Py4JJavaError(u'An error occurred while calling 
> None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12122) Recovered streaming context can sometimes run a batch twice

2015-12-03 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-12122:
-

 Summary: Recovered streaming context can sometimes run a batch 
twice
 Key: SPARK-12122
 URL: https://issues.apache.org/jira/browse/SPARK-12122
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


After recovering from checkpoint, the JobGenerator figures out which batches to 
run again. That can sometimes lead to a batch being submitted twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12123) Spark ava.lang.NullPointerException

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12123.
---
Resolution: Invalid

[~michael_han] This should be a question on user@, not a JIRA. Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  You 
need to narrow down your question more anyway, not just post your code.

> Spark ava.lang.NullPointerException
> ---
>
> Key: SPARK-12123
> URL: https://issues.apache.org/jira/browse/SPARK-12123
> Project: Spark
>  Issue Type: Question
>Affects Versions: 1.5.2
>Reporter: Michael Han
>
> Hi,
> I'm fresh to study Spark.
> I download Spark 1.5.2 into my windows C disk.
> Download the latest Eclipse and create a Java project with maven,
> The only java class is:
> package com.qad;
> import org.apache.spark.api.java.*;
> import java.io.BufferedWriter;
> import java.io.File;
> import java.io.FileWriter;
> import org.apache.spark.SparkConf;
> import org.apache.spark.api.java.function.Function;
> public class SparkTest1 {
>   public static void main(String[] args) {
>   String logFile = "README.md"; // Should be some file on your system
>   SparkConf conf = new 
> SparkConf().setMaster("spark://192.168.79.1:7077").setAppName("Simple 
> Application");   
>   JavaSparkContext sc = new JavaSparkContext(conf);
>   
>   // example 1
>   JavaRDD logData = sc.textFile(logFile); 
>   long numAs, numBs;
>   numAs = 0;
>   numBs = 0;
>   
>   JavaRDD logData2 = logData.filter(new 
> Function() { 
>   
>   private static final long serialVersionUID = 1L;
>   public Boolean call(String s) { return 
> s.contains("Spark is a fast"); }
>   });
>   numAs = logData2.count();
>   
>   String content = "Lines with a: " + numAs + ", lines with b: " + 
> numBs;
>   System.out.println(content);
>   WriteText(content,"mh6log.txt");
>   
>   sc.close();
>   
> }
>   
>   private static void WriteText(String content,String fileName)
>   {
>   try {
>   
>   File logFile=new File(fileName);
>   BufferedWriter writer = new BufferedWriter(new 
> FileWriter(logFile));
>   writer.write (content);
>   //Close writer
>   writer.close();
>   } catch(Exception e) {
>   e.printStackTrace();
>   }
>   }
> }
> The pom are:
> http://maven.apache.org/POM/4.0.0; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
> xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> http://maven.apache.org/xsd/maven-4.0.0.xsd;>
>   4.0.0
>   Spark-Test
>   Spark-Test
>   1.0
>   
> src
> 
>   
> maven-compiler-plugin
> 3.3
> 
>   1.8
>   1.8
> 
>   
> 
>   
>   
>   
>   org.apache.spark
>   spark-core_2.10
>   1.5.2
>   
>   
> 
> I can run this java class correctly in Eclipse, but exceptions when I using 
> the following command to commit it:
> spark-submit --master local --class com.qad.SparkTest1 Spark-Test-1.0.jar
> Who knows which step I was wrong? Thank you.
> The exceptions are:
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to s
> tage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
> task
>  0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
> at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
> at org.apache.hadoop.util.Shell.run(Shell.java:455)
> at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
> 715)
> at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
> at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
> $Executor$$updateDependencies$5.apply(Executor.scala:405)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
> $Executor$$updateDependencies$5.apply(Executor.scala:397)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
> TraversableLike.scala:772)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
> la:98)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
> la:98)
> at 
> 

[jira] [Resolved] (SPARK-12111) need upgrade instruction

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12111.
---
Resolution: Not A Problem

[~aedwip] I am someone who knows the details of Spark gets built and installed. 
Please do not reopen this issue unless there's a significant change in the 
discussion. Here, you have a fine question, but one that is not a JIRA, as I 
said.

This is already documented on the Spark site. Here, you describe a situation 
where I think you set up a cluster by hand (not clear, you mentioned 
spark-ec2?). That process is already clear to you and documented on the project 
site. There's really nothing more to know. You replace the old binaries with 
the new ones, and start again. I'm not aware of any non-trivial changes in the 
behavior of the standalone scripts.

If you mean you want something to automatically manage and update clusters: I 
suppose EMR's Spark sort of does that, in that they'll eventually make a new 
release with a new Spark that you can deploy. In the same spirit, distros like 
CDH manage things liket his for you. spark-ec2 won't do this automatically for 
you. 

> need upgrade instruction
> 
>
> Key: SPARK-12111
> URL: https://issues.apache.org/jira/browse/SPARK-12111
> Project: Spark
>  Issue Type: Documentation
>  Components: EC2
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>  Labels: build, documentation
>
> I have looked all over the spark website and googled. I have not found 
> instructions for how to upgrade spark in general let alone a cluster created 
> by using spark-ec2 script
> thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12111) need upgrade instruction

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-12111.
-

> need upgrade instruction
> 
>
> Key: SPARK-12111
> URL: https://issues.apache.org/jira/browse/SPARK-12111
> Project: Spark
>  Issue Type: Documentation
>  Components: EC2
>Affects Versions: 1.5.1
>Reporter: Andrew Davidson
>  Labels: build, documentation
>
> I have looked all over the spark website and googled. I have not found 
> instructions for how to upgrade spark in general let alone a cluster created 
> by using spark-ec2 script
> thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12123) Spark ava.lang.NullPointerException

2015-12-03 Thread Michael Han (JIRA)
Michael Han created SPARK-12123:
---

 Summary: Spark ava.lang.NullPointerException
 Key: SPARK-12123
 URL: https://issues.apache.org/jira/browse/SPARK-12123
 Project: Spark
  Issue Type: Question
Affects Versions: 1.5.2
Reporter: Michael Han


Hi,
I'm fresh to study Spark.
I download Spark 1.5.2 into my windows C disk.

Download the latest Eclipse and create a Java project with maven,
The only java class is:

package com.qad;
import org.apache.spark.api.java.*;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;


public class SparkTest1 {
public static void main(String[] args) {
String logFile = "README.md"; // Should be some file on your system
SparkConf conf = new 
SparkConf().setMaster("spark://192.168.79.1:7077").setAppName("Simple 
Application");   
JavaSparkContext sc = new JavaSparkContext(conf);

// example 1
JavaRDD logData = sc.textFile(logFile); 
long numAs, numBs;
numAs = 0;
numBs = 0;

JavaRDD logData2 = logData.filter(new 
Function() { 

private static final long serialVersionUID = 1L;

public Boolean call(String s) { return 
s.contains("Spark is a fast"); }
});
numAs = logData2.count();

String content = "Lines with a: " + numAs + ", lines with b: " + 
numBs;
System.out.println(content);
WriteText(content,"mh6log.txt");

sc.close();

  }

private static void WriteText(String content,String fileName)
{
try {

File logFile=new File(fileName);

BufferedWriter writer = new BufferedWriter(new 
FileWriter(logFile));
writer.write (content);

//Close writer
writer.close();
} catch(Exception e) {
e.printStackTrace();
}
}

}

The pom are:
http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; 
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
  4.0.0
  Spark-Test
  Spark-Test
  1.0
  
src

  
maven-compiler-plugin
3.3

  1.8
  1.8

  

  
  

org.apache.spark
spark-core_2.10
1.5.2

  


I can run this java class correctly in Eclipse, but exceptions when I using the 
following command to commit it:
spark-submit --master local --class com.qad.SparkTest1 Spark-Test-1.0.jar

Who knows which step I was wrong? Thank you.

The exceptions are:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to s
tage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task
 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:405)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor
$Executor$$updateDependencies$5.apply(Executor.scala:397)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(
TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca
la:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala
:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s
cala:771)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor
$$updateDependencies(Executor.scala:397)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.

[jira] [Assigned] (SPARK-4117) Spark on Yarn handle AM being told command from RM

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4117:
---

Assignee: (was: Apache Spark)

> Spark on Yarn handle AM being told command from RM
> --
>
> Key: SPARK-4117
> URL: https://issues.apache.org/jira/browse/SPARK-4117
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>
> In the allocateResponse from the RM it can send commands that the AM should 
> follow. for instance AM_RESYNC and AM_SHUTDOWN.  We should add support for 
> those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12117:


Assignee: Apache Spark

> Column Aliases are Ignored in callUDF while using struct()
> --
>
> Key: SPARK-12117
> URL: https://issues.apache.org/jira/browse/SPARK-12117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: sachin aggarwal
>Assignee: Apache Spark
>
> case where this works:
>   val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
> ("Rishabh", "2"))).toDF("myText", "id")
>   
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> steps to reproduce error case:
> 1)create a file copy following text--filename(a.json)
> { "myText": "Sachin Aggarwal","id": "1"}
> { "myText": "Rishabh","id": "2"}
> 2)define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
> 3)register the udf 
>  sqlContext.udf.register("mydef" ,mydef _)
> 4)read the input file 
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
> 5)make a call to UDF
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at 
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>  

[jira] [Updated] (SPARK-12104) collect() does not handle multiple columns with same name

2015-12-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12104:
--
Assignee: Sun Rui

> collect() does not handle multiple columns with same name
> -
>
> Key: SPARK-12104
> URL: https://issues.apache.org/jira/browse/SPARK-12104
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Hossein Falaki
>Assignee: Sun Rui
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> This is a regression from Spark 1.5
> Spark can produce DataFrames with identical names (e.g., after left outer 
> joins). In 1.5 when such a DataFrame was collected we ended up with an R 
> data.frame with modified column names:
> {code}
> > names(mySparkDF)
> [1] "date"   "name"   "name" 
> > names(collect(mySparkDF))
> [1] "date"   "name"   "name.1" 
> {code}
> But in 1.6 only the first column is included in the collected R data.frame. I 
> think SparkR should continue the old behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12117:


Assignee: (was: Apache Spark)

> Column Aliases are Ignored in callUDF while using struct()
> --
>
> Key: SPARK-12117
> URL: https://issues.apache.org/jira/browse/SPARK-12117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: sachin aggarwal
>
> case where this works:
>   val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
> ("Rishabh", "2"))).toDF("myText", "id")
>   
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> steps to reproduce error case:
> 1)create a file copy following text--filename(a.json)
> { "myText": "Sachin Aggarwal","id": "1"}
> { "myText": "Rishabh","id": "2"}
> 2)define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
> 3)register the udf 
>  sqlContext.udf.register("mydef" ,mydef _)
> 4)read the input file 
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
> 5)make a call to UDF
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at 
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>  at 
> 

[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039761#comment-15039761
 ] 

Apache Spark commented on SPARK-12117:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10140

> Column Aliases are Ignored in callUDF while using struct()
> --
>
> Key: SPARK-12117
> URL: https://issues.apache.org/jira/browse/SPARK-12117
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: sachin aggarwal
>
> case where this works:
>   val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), 
> ("Rishabh", "2"))).toDF("myText", "id")
>   
> TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> steps to reproduce error case:
> 1)create a file copy following text--filename(a.json)
> { "myText": "Sachin Aggarwal","id": "1"}
> { "myText": "Rishabh","id": "2"}
> 2)define a simple UDF
> def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]}
> 3)register the udf 
>  sqlContext.udf.register("mydef" ,mydef _)
> 4)read the input file 
> val TestDoc2=sqlContext.read.json("/tmp/a.json")
> 5)make a call to UDF
> TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show
> ERROR received:
> java.lang.IllegalArgumentException: Field "Text" does not exist.
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at 
> org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234)
>  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
>  at scala.collection.AbstractMap.getOrElse(Map.scala:58)
>  at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233)
>  at 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212)
>  at org.apache.spark.sql.Row$class.getAs(Row.scala:325)
>  at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
>  at 
> $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74)
>  at 
> org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55)
>  at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>  at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>  at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>  at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>  at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>  at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>  at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>  at org.apache.spark.scheduler.Task.run(Task.scala:88)
>  at 

[jira] [Resolved] (SPARK-2688) Need a way to run multiple data pipeline concurrently

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2688.
--
Resolution: Won't Fix

> Need a way to run multiple data pipeline concurrently
> -
>
> Key: SPARK-2688
> URL: https://issues.apache.org/jira/browse/SPARK-2688
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Xuefu Zhang
>
> Suppose we want to do the following data processing: 
> {code}
> rdd1 -> rdd2 -> rdd3
>| -> rdd4
>| -> rdd5
>\ -> rdd6
> {code}
> where -> represents a transformation. rdd3 to rrdd6 are all derived from an 
> intermediate rdd2. We use foreach(fn) with a dummy function to trigger the 
> execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> 
> rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be 
> recomputed. This is very inefficient. Ideally, we should be able to trigger 
> the execution the whole graph and reuse rdd2, but there doesn't seem to be a 
> way doing so. Tez already realized the importance of this (TEZ-391), so I 
> think Spark should provide this too.
> This is required for Hive to support multi-insert queries. HIVE-7292.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037808#comment-15037808
 ] 

Sean Owen commented on SPARK-5312:
--

[~boyork] are you going to do more with this?

> Use sbt to detect new or changed public classes in PRs
> --
>
> Key: SPARK-5312
> URL: https://issues.apache.org/jira/browse/SPARK-5312
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Nicholas Chammas
>Priority: Minor
>
> We currently use an [unwieldy grep/sed 
> contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174]
>  to detect new public classes in PRs.
> -Apparently, sbt lets you get a list of public classes [much more 
> directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via 
> {{show compile:discoveredMainClasses}}. We should use that instead.-
> There is a tool called [ClassUtil|http://software.clapper.org/classutil/] 
> that seems to help give this kind of information much more directly. We 
> should look into using that.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11801) Notify driver when OOM is thrown before executor JVM is killed

2015-12-03 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037869#comment-15037869
 ] 

Mridul Muralidharan commented on SPARK-11801:
-



I have preference for (a) because I am not sure if we can do (b) given the 
number of corner cases :-)
But if it can be pulled off decently, sure why not - what I would not like is 
inconsistent behavior, where sometimes a behavior is exhibited and other times 
some other - because of the inherent instability at OOM and VM exit.
Right now, I know that we will get a task failure at driver and I investigate 
cause of it at executor - whether it is jni crash, OOM, growth of memory which 
lead to yarn killing executor, etc.

> Notify driver when OOM is thrown before executor JVM is killed 
> ---
>
> Key: SPARK-11801
> URL: https://issues.apache.org/jira/browse/SPARK-11801
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12105:


Assignee: Apache Spark

> Add a DataFrame.show() with argument for output PrintStream
> ---
>
> Key: SPARK-12105
> URL: https://issues.apache.org/jira/browse/SPARK-12105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Dean Wampler
>Assignee: Apache Spark
>Priority: Minor
>
> It would be nice to send the output of DataFrame.show(...) to a different 
> output stream than stdout, including just capturing the string itself. This 
> is useful, e.g., for testing. Actually, it would be sufficient and perhaps 
> better to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-03 Thread Philip Dodds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037839#comment-15037839
 ] 

Philip Dodds commented on SPARK-12128:
--

I tried to add a testcase,  but when I do it using a simplified approach 

{code}
test("SPARK-12128: Multiplication of decimals in dataframe returning null") {
withTempTable("t") {
  Seq((Decimal(2), Decimal(2)), (Decimal(3), Decimal(3))).toDF("a", 
"b").registerTempTable("t")
  checkAnswer(sql("SELECT a*b FROM t"),
Seq(Row(Decimal(4.0).toBigDecimal),
Row(Decimal(9.0).toBigDecimal)))
}
  }
{code}

It then appears to work,  though i did need to covert the target result to big 
decimal

> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037795#comment-15037795
 ] 

Apache Spark commented on SPARK-12105:
--

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/10130

> Add a DataFrame.show() with argument for output PrintStream
> ---
>
> Key: SPARK-12105
> URL: https://issues.apache.org/jira/browse/SPARK-12105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Dean Wampler
>Priority: Minor
>
> It would be nice to send the output of DataFrame.show(...) to a different 
> output stream than stdout, including just capturing the string itself. This 
> is useful, e.g., for testing. Actually, it would be sufficient and perhaps 
> better to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12105:


Assignee: (was: Apache Spark)

> Add a DataFrame.show() with argument for output PrintStream
> ---
>
> Key: SPARK-12105
> URL: https://issues.apache.org/jira/browse/SPARK-12105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Dean Wampler
>Priority: Minor
>
> It would be nice to send the output of DataFrame.show(...) to a different 
> output stream than stdout, including just capturing the string itself. This 
> is useful, e.g., for testing. Actually, it would be sufficient and perhaps 
> better to just make DataFrame.showString a public method, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5261.
--
Resolution: Duplicate

Re-resolving as a duplicate of several subsequent issues

> In some cases ,The value of word's vector representation is too big
> ---
>
> Key: SPARK-5261
> URL: https://issues.apache.org/jira/browse/SPARK-5261
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Guoqiang Li
>
> Get data:
> {code:none}
> normalize_text() {
>   awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e 
> "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \
>   -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ 
> ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \
>   -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e 
> 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \
>   -e 's/«/ /g' | tr 0-9 " "
> }
> wget 
> http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz
> gzip -d news.2013.en.shuffled.gz
> normalize_text < news.2013.en.shuffled > data.txt
> {code}
> {code:none}
> import org.apache.spark.mllib.feature.Word2Vec
> val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable }
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(5)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res1: Float = 375059.84
> val word2Vec = new Word2Vec()
> word2Vec.
>   setVectorSize(100).
>   setSeed(42L).
>   setNumIterations(5).
>   setNumPartitions(36).
>   setMinCount(100)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
> res3: Float = 1661285.2 
>  val word2Vec = new Word2Vec()
>  word2Vec.
> setVectorSize(100).
> setSeed(42L).
> setNumIterations(5).
> setNumPartitions(1)
> val model = word2Vec.fit(text)
> model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / 
> model.getVectors.size
> => 
>  0.13889
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-03 Thread Philip Dodds (JIRA)
Philip Dodds created SPARK-12128:


 Summary: Multiplication on decimals in dataframe returns null
 Key: SPARK-12128
 URL: https://issues.apache.org/jira/browse/SPARK-12128
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2, 1.5.1, 1.5.0
 Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
Reporter: Philip Dodds


I hit a weird issue when I tried to multiply to decimals in a select (either in 
scala or as SQL), and Im assuming I must be missing the point.

The issue is fairly easy to recreate with something like the following:

{code:java}
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
import org.apache.spark.sql.types.Decimal

case class Trade(quantity: Decimal,price: Decimal)

val data = Seq.fill(100) {
  val price = Decimal(20+scala.util.Random.nextInt(10))
val quantity = Decimal(20+scala.util.Random.nextInt(10))

  Trade(quantity, price)
}

val trades = sc.parallelize(data).toDF()
trades.registerTempTable("trades")

trades.select(trades("price")*trades("quantity")).show

sqlContext.sql("select 
price/quantity,price*quantity,price+quantity,price-quantity from trades").show

{code}

The odd part is if you run it you will see that the addition/division and 
subtraction works but the multiplication returns a null.

Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)

ie. 

{code}
+--+

|(price * quantity)|

+--+

|  null|

|  null|

|  null|

|  null|

|  null|

+--+



+++++

| _c0| _c1| _c2| _c3|

+++++

|0.952380952380952381|null|41.00...|-1.00...|

|1.380952380952380952|null|50.00...|8.00|

|1.272727272727272727|null|50.00...|6.00|

|0.83|null|44.00...|-4.00...|

|1.00|null|58.00...|   0E-18|

+++++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7893) Complex Operators between Graphs

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037813#comment-15037813
 ] 

Sean Owen commented on SPARK-7893:
--

[~andyyehoo] this and its subtasks don't seem to have progressed; I'm going to 
close it if there's no intent to proceed.

> Complex Operators between Graphs
> 
>
> Key: SPARK-7893
> URL: https://issues.apache.org/jira/browse/SPARK-7893
> Project: Spark
>  Issue Type: Umbrella
>  Components: GraphX
>Reporter: Andy Huang
>  Labels: complex, graph, join, operators, union
>
> Currently there are 30+ operators in GraphX, while few of them consider 
> operators between graphs. The only one is _*mask*_, which takes another graph 
> as a parameter and return a new graph.
> In many complex case,such as _*streaming graph, small graph merge into huge 
> graph*_, higher level operators of graphs can help users to focus and think 
> in graph. Performance optimization can be done internally and be transparent 
> to them.
> Complex graph operator list is 
> here:[complex_graph_operations|http://techieme.in/complex-graph-operations/]
> * Union of Graphs ( G ∪ H )
> * Intersection of Graphs( G ∩ H)
> * Graph Join
> * Difference of Graphs(G – H)
> * Graph Complement
> * Line Graph ( L(G) )
> This issue will be index of all these operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037773#comment-15037773
 ] 

Cody Koeninger commented on SPARK-12103:


A cursory review of the Kafka project documentation should reveal that messages 
have a key (used for distribution among partitions) and a value. Why would one 
reasonably expect that Spark documentation referring to a Kafka message key was 
instead supposed to be the message topic?

If you really want the topic name in each item of the rdd, create your separate 
streams, map over them to add the topic name, then union them together into a 
single stream.

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12103.
---
Resolution: Not A Problem

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments

2015-12-03 Thread Doug Balog (JIRA)
Doug Balog created SPARK-12129:
--

 Summary: spark.submit.deployMode not being picked up in 
SparkSubmitArguments
 Key: SPARK-12129
 URL: https://issues.apache.org/jira/browse/SPARK-12129
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.5.2, 1.4.1, 1.3.1
 Environment: YARN, MESOS
Reporter: Doug Balog
Priority: Trivial


In sparkSubmitArguments.scala

{code}
 deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull
{code}

This code should probably check to see if the env variable SPARK_SUBMIT is set 
and then look up `spark.submit.deployMode` in sparkProperties.

To see this, run something like {code}spark-submit -v --conf 
spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi 
$SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code}
The last `spark.submit.deployMode` will be set to client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver

2015-12-03 Thread Lianhui Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lianhui Wang updated SPARK-12130:
-
Component/s: YARN
 Shuffle

> Replace shuffleManagerClass with shortShuffleMgrNames in 
> ExternalShuffleBlockResolver
> -
>
> Key: SPARK-12130
> URL: https://issues.apache.org/jira/browse/SPARK-12130
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Reporter: Lianhui Wang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver

2015-12-03 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-12130:


 Summary: Replace shuffleManagerClass with shortShuffleMgrNames in 
ExternalShuffleBlockResolver
 Key: SPARK-12130
 URL: https://issues.apache.org/jira/browse/SPARK-12130
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10854) MesosExecutorBackend: Received launchTask but executor was null

2015-12-03 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038058#comment-15038058
 ] 

Matei Zaharia commented on SPARK-10854:
---

Just a note, I saw a log where this happened, and the sequence of events is 
that the executor logs a launchTask callback before registered(). It could be a 
synchronization thing or a problem in the Mesos library.

> MesosExecutorBackend: Received launchTask but executor was null
> ---
>
> Key: SPARK-10854
> URL: https://issues.apache.org/jira/browse/SPARK-10854
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.4.0
> Environment: Spark 1.4.0
> Mesos 0.23.0
> Docker 1.8.1
>Reporter: Kevin Matzen
>Priority: Minor
>
> Sometimes my tasks get stuck in staging.  Here's stdout from one such worker. 
>  I'm running mesos-slave inside a docker container with the host's docker 
> exposed and I'm using Spark's docker support to launch the worker inside its 
> own container.  Both containers are running.  I'm using pyspark.  I can see 
> mesos-slave and java running, but I do not see python running.
> {noformat}
> WARNING: Your kernel does not support swap limit capabilities, memory limited 
> without swap.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered signal handlers for 
> [TERM, HUP, INT]
> I0928 15:02:09.65854138 exec.cpp:132] Version: 0.23.0
> 15/09/28 15:02:09 ERROR MesosExecutorBackend: Received launchTask but 
> executor was null
> I0928 15:02:09.70295554 exec.cpp:206] Executor registered on slave 
> 20150928-044200-1140850698-5050-8-S190
> 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered with Mesos as 
> executor ID 20150928-044200-1140850698-5050-8-S190 with 1 cpus
> 15/09/28 15:02:09 INFO SecurityManager: Changing view acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: Changing modify acls to: root
> 15/09/28 15:02:09 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/09/28 15:02:10 INFO Slf4jLogger: Slf4jLogger started
> 15/09/28 15:02:10 INFO Remoting: Starting remoting
> 15/09/28 15:02:10 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkExecutor@:56458]
> 15/09/28 15:02:10 INFO Utils: Successfully started service 'sparkExecutor' on 
> port 56458.
> 15/09/28 15:02:10 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-28a21c2d-54cc-40b3-b0c2-cc3624f1a73c/blockmgr-f2336fec-e1ea-44f1-bd5c-9257049d5e7b
> 15/09/28 15:02:10 INFO MemoryStore: MemoryStore started with capacity 52.1 MB
> 15/09/28 15:02:11 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/09/28 15:02:11 INFO Executor: Starting executor ID 
> 20150928-044200-1140850698-5050-8-S190 on host 
> 15/09/28 15:02:11 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57431.
> 15/09/28 15:02:11 INFO NettyBlockTransferService: Server created on 57431
> 15/09/28 15:02:11 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/28 15:02:11 INFO BlockManagerMaster: Registered BlockManager
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038084#comment-15038084
 ] 

Cody Koeninger commented on SPARK-12103:


On the off chance you're acting in good faith, actually did read the Kafka 
documentation and Spark API documentation, and were still somehow unclear that 
K meant the type of the message Key and V meant the type of the message 
Value... 

I'll submit a pr to add @tparam docs for every type in the KafkaUtils api

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-03 Thread Philip Dodds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038104#comment-15038104
 ] 

Philip Dodds commented on SPARK-12128:
--

I took a look at the plan

{code}
Results do not match for query:
== Parsed Logical Plan ==
'Project [unresolvedalias(('a * 'b))]
+- 'UnresolvedRelation `t`, None

== Analyzed Logical Plan ==
_c0: decimal(38,36)
Project [CheckOverflow((promote_precision(cast(a#45 as decimal(38,18))) * 
promote_precision(cast(b#46 as decimal(38,18, DecimalType(38,36)) AS _c0#47]
+- Subquery t
   +- Project [_1#43 AS a#45,_2#44 AS b#46]
  +- LocalRelation [_1#43,_2#44], 
[[20.00,20.00],[20.00,20.00]]

== Optimized Logical Plan ==
LocalRelation [_c0#47], [[null],[null]]

== Physical Plan ==
LocalTableScan [_c0#47], [[null],[null]]
== Results ==
!== Correct Answer - 2 ==   == Spark Answer - 2 ==
![400.0]  [null]
![400.0]  [null]
  
{code}

The scale appears to have doubled and thus if the result has a more than two 
digits to the right of the point it'll end up as an overflow.   I'm not sure if 
it is the way I'm using the decimal or the code that is wrong?


> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038106#comment-15038106
 ] 

Dan Dutrow commented on SPARK-12103:


Thanks. I don't mean to be a pain. I did make an honest attempt to figure it 
out and I'm not completely numb, just inexperienced with Kafka. I was trying to 
find a better way to pool resources and not have to dedicate a core to each 
receiver. Having the key be the topic name was wishful thinking and nothing in 
the documentation corrected that assumption.


> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12131:


Assignee: Cheng Lian  (was: Apache Spark)

> Cannot create ExpressionEncoder for Array[T] where T is a nested class
> --
>
> Key: SPARK-12131
> URL: https://issues.apache.org/jira/browse/SPARK-12131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following snippet reproduces this issue under {{spark-shell}}:
> {noformat}
> import sqlContext._
> case class N(i: Int)
> case class T(a: Array[N])
> val ds = Seq(T(Array(N(1.toDS
> ds.show()
> {noformat}
> Note that, classes defined in Scala REPL are inherently nested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038038#comment-15038038
 ] 

Apache Spark commented on SPARK-11827:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/10125

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Priority: Minor
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11827:


Assignee: Apache Spark

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Assignee: Apache Spark
>Priority: Minor
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038115#comment-15038115
 ] 

Apache Spark commented on SPARK-12103:
--

User 'koeninger' has created a pull request for this issue:
https://github.com/apache/spark/pull/10132

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger reopened SPARK-12103:


PR for doc change:

https://github.com/apache/spark/pull/10132

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments

2015-12-03 Thread Doug Balog (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038148#comment-15038148
 ] 

Doug Balog commented on SPARK-12129:


Ok, Thanks.

> spark.submit.deployMode not being picked up in SparkSubmitArguments
> ---
>
> Key: SPARK-12129
> URL: https://issues.apache.org/jira/browse/SPARK-12129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
> Environment: YARN, MESOS
>Reporter: Doug Balog
>Priority: Trivial
>
> In sparkSubmitArguments.scala
> {code}
>  deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull
> {code}
> This code should probably check to see if the env variable SPARK_SUBMIT is 
> set and then look up `spark.submit.deployMode` in sparkProperties.
> To see this, run something like {code}spark-submit -v --conf 
> spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi 
> $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code}
> The last `spark.submit.deployMode` will be set to client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9039) Jobs page shows nonsensical task-progress-bar numbers when speculation occurs

2015-12-03 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038027#comment-15038027
 ] 

Ryan Williams commented on SPARK-9039:
--

Just saw this in Spark 1.5.2

!http://f.cl.ly/items/0x2y2N292i3w1K0r2T2N/Screen%20Shot%202015-12-01%20at%204.29.19%20PM.png!

[Here|https://gist.github.com/ryan-williams/5d9902dd81c36db9608a] is the event 
log for that Spark app; if you can get a history server to load just the first 
47 lines of that, that should be the point in time where I saw the bug. I can't 
get the history server to read {{head -n 47}} of that file.

> Jobs page shows nonsensical task-progress-bar numbers when speculation occurs
> -
>
> Key: SPARK-9039
> URL: https://issues.apache.org/jira/browse/SPARK-9039
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.1
>Reporter: Ryan Williams
>Priority: Minor
>
> The stages page normalizes its task-progress-bars by task indexes, but the 
> job page does not:
> !http://f.cl.ly/items/043O173E453U3H3i010t/Screen%20Shot%202015-07-13%20at%209.46.48%20PM.png!
> Note the 11258/10302, and the light-blue "running task" bar DOM element that 
> is wrapped onto a next line and hidden.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments

2015-12-03 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-12129.

Resolution: Invalid

That's an internal property, it's not documented, and users are not supposed to 
set it.

I think what you're really looking for is SPARK-10123.

> spark.submit.deployMode not being picked up in SparkSubmitArguments
> ---
>
> Key: SPARK-12129
> URL: https://issues.apache.org/jira/browse/SPARK-12129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
> Environment: YARN, MESOS
>Reporter: Doug Balog
>Priority: Trivial
>
> In sparkSubmitArguments.scala
> {code}
>  deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull
> {code}
> This code should probably check to see if the env variable SPARK_SUBMIT is 
> set and then look up `spark.submit.deployMode` in sparkProperties.
> To see this, run something like {code}spark-submit -v --conf 
> spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi 
> $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code}
> The last `spark.submit.deployMode` will be set to client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-03 Thread Philip Dodds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038068#comment-15038068
 ] 

Philip Dodds commented on SPARK-12128:
--

Actually that was a little too simple,  I went back to the use case and tried 
this and again it fails

A simple failing test case would be

{code}

test("SPARK-12128: Multiplication of decimals in dataframe returning null") {
withTempTable("t") {
  Seq((Decimal(20), Decimal(20)), (Decimal(20), Decimal(20))).toDF("a", 
"b").registerTempTable("t")
  checkAnswer(sql("SELECT a*b FROM t"),
Seq(Row(Decimal(4.0).toBigDecimal),
Row(Decimal(9.0).toBigDecimal)))
}
  }

{code}

> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-03 Thread Philip Dodds (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip Dodds closed SPARK-12128.

Resolution: Invalid

The issue is the promotion of the scale,  since the default BigDecimal is a 
precision of 38, and a scale of 19 the multiplication will promote that using 
the logic defined.

This will mean that (since 38 is the max precision) the scale will be the two 
scales added together (36) meaning the result is (38,36) and thus we get an 
overflow.

I assume the solution is to cast the two decimals going in to ensure the result 
has enough,  ie.

SELECT cast(a as decimal(38,9))*cast(b as decimal(38,9)) FROM t



> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11827:


Assignee: (was: Apache Spark)

> Support java.math.BigInteger in Type-Inference utilities for POJOs
> --
>
> Key: SPARK-11827
> URL: https://issues.apache.org/jira/browse/SPARK-11827
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Abhilash Srimat Tirumala Pallerlamudi
>Priority: Minor
>
> I get the below exception when creating DataFrame using RDD of JavaBean 
> having a property of type java.math.BigInteger
> scala.MatchError: class java.math.BigInteger (of class java.lang.Class)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182)
> at 
> org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419)
> at 
> org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447)
> I don't see the support for java.math.BigInteger in 
> org.apache.spark.sql.catalyst.JavaTypeInference.scala 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class

2015-12-03 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-12131:
--

 Summary: Cannot create ExpressionEncoder for Array[T] where T is a 
nested class
 Key: SPARK-12131
 URL: https://issues.apache.org/jira/browse/SPARK-12131
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Cheng Lian
Assignee: Cheng Lian


The following snippet reproduces this issue under {{spark-shell}}:

{noformat}
import sqlContext._
case class N(i: Int)
case class T(a: Array[N])
val ds = Seq(T(Array(N(1.toDS
ds.show()
{noformat}

Note that, classes defined in Scala REPL are inherently nested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class

2015-12-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038121#comment-15038121
 ] 

Apache Spark commented on SPARK-12131:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/10133

> Cannot create ExpressionEncoder for Array[T] where T is a nested class
> --
>
> Key: SPARK-12131
> URL: https://issues.apache.org/jira/browse/SPARK-12131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following snippet reproduces this issue under {{spark-shell}}:
> {noformat}
> import sqlContext._
> case class N(i: Int)
> case class T(a: Array[N])
> val ds = Seq(T(Array(N(1.toDS
> ds.show()
> {noformat}
> Note that, classes defined in Scala REPL are inherently nested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class

2015-12-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12131:


Assignee: Apache Spark  (was: Cheng Lian)

> Cannot create ExpressionEncoder for Array[T] where T is a nested class
> --
>
> Key: SPARK-12131
> URL: https://issues.apache.org/jira/browse/SPARK-12131
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> The following snippet reproduces this issue under {{spark-shell}}:
> {noformat}
> import sqlContext._
> case class N(i: Int)
> case class T(a: Array[N])
> val ds = Seq(T(Array(N(1.toDS
> ds.show()
> {noformat}
> Note that, classes defined in Scala REPL are inherently nested.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12116) Document workaround when method conflicts with another R package, like dplyr

2015-12-03 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12116.
---
   Resolution: Fixed
 Assignee: Felix Cheung
Fix Version/s: 1.6.0

Resolved by https://github.com/apache/spark/pull/10119

> Document workaround when method conflicts with another R package, like dplyr
> 
>
> Key: SPARK-12116
> URL: https://issues.apache.org/jira/browse/SPARK-12116
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 1.6.0
>
>
> See https://issues.apache.org/jira/browse/SPARK-11886



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037986#comment-15037986
 ] 

Cody Koeninger commented on SPARK-12103:


Knowing that kafka messages have a key and value isn't an "Kafka expert" thing. 
 Serious question - did you read http://kafka.apache.org/documentation.html  
before posting this?

The api documentation for createStream at 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$

keyTypeClass Key type of DStream
valueTypeClass value type of Dstream
keyDecoderClass Type of kafka key decoder
valueDecoderClass Type of kafka value decoder

How is this not clear?

If you have a specific improvement to docs, send a PR

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3200.
--
Resolution: Cannot Reproduce

It works as of 1.6, so must have gotten fixed along the way somewhere, but I 
don't know where

> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen, when one refers something that refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null

2015-12-03 Thread Philip Dodds (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037933#comment-15037933
 ] 

Philip Dodds commented on SPARK-12128:
--

The problem is with the Decimal constructor I believe,  if you use

{code}
val data = Seq.fill(5) {
 Trade(Decimal(BigDecimal(5),38,20), Decimal(BigDecimal(5),38,20))
   }
{code}

Then you will get the correct result,   can probably just close the issue

> Multiplication on decimals in dataframe returns null
> 
>
> Key: SPARK-12128
> URL: https://issues.apache.org/jira/browse/SPARK-12128
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
> Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2
>Reporter: Philip Dodds
>
> I hit a weird issue when I tried to multiply to decimals in a select (either 
> in scala or as SQL), and Im assuming I must be missing the point.
> The issue is fairly easy to recreate with something like the following:
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> import org.apache.spark.sql.types.Decimal
> case class Trade(quantity: Decimal,price: Decimal)
> val data = Seq.fill(100) {
>   val price = Decimal(20+scala.util.Random.nextInt(10))
> val quantity = Decimal(20+scala.util.Random.nextInt(10))
>   Trade(quantity, price)
> }
> val trades = sc.parallelize(data).toDF()
> trades.registerTempTable("trades")
> trades.select(trades("price")*trades("quantity")).show
> sqlContext.sql("select 
> price/quantity,price*quantity,price+quantity,price-quantity from trades").show
> {code}
> The odd part is if you run it you will see that the addition/division and 
> subtraction works but the multiplication returns a null.
> Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11)
> ie. 
> {code}
> +--+
> |(price * quantity)|
> +--+
> |  null|
> |  null|
> |  null|
> |  null|
> |  null|
> +--+
> +++++
> | _c0| _c1| _c2| _c3|
> +++++
> |0.952380952380952381|null|41.00...|-1.00...|
> |1.380952380952380952|null|50.00...|8.00|
> |1.272727272727272727|null|50.00...|6.00|
> |0.83|null|44.00...|-4.00...|
> |1.00|null|58.00...|   0E-18|
> +++++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8796) make sure SparkPlan is only executed at driver side

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8796.
--
Resolution: Won't Fix

> make sure SparkPlan is only executed at driver side
> ---
>
> Key: SPARK-8796
> URL: https://issues.apache.org/jira/browse/SPARK-8796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037955#comment-15037955
 ] 

Dan Dutrow edited comment on SPARK-12103 at 12/3/15 4:16 PM:
-

I'm sure all of this is obvious to Kafka experts. Nowhere in the Spark scala 
function documentation does it say that the Left value is the Kafka key, except 
if you make a guess about what K is. You have to trace through Spark and Kafka 
code to figure it out. Please update the KafkaWordCount.scala and 
DirectKafkaWordCount.scala to include a comment about what is in the ._1 field.


was (Author: dutrow):
I'm sure all of this is obvious to Kafka experts. Nowhere in the Scala Spark 
documentation does it say that the Left value is the Kafka key, except if you 
make a guess about what K is. You have to trace through Spark and Kafka code to 
figure it out. Please update the KafkaWordCount.scala and 
DirectKafkaWordCount.scala to include a comment about what is in the ._1 field.

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037955#comment-15037955
 ] 

Dan Dutrow edited comment on SPARK-12103 at 12/3/15 4:15 PM:
-

I'm sure all of this is obvious to Kafka experts. Nowhere in the Scala Spark 
documentation does it say that the Left value is the Kafka key, except if you 
make a guess about what K is. You have to trace through Spark and Kafka code to 
figure it out. Please update the KafkaWordCount.scala and 
DirectKafkaWordCount.scala to include a comment about what is in the ._1 field.


was (Author: dutrow):
I'm sure all of this is obvious to Kafka experts. Nowhere in the spark 
documentation does it say that the Left value is the Kafka key, except if you 
make a guess about what K is. You have to trace through Spark and Kafka code to 
figure it out. Please update the KafkaWordCount.scala and 
DirectKafkaWordCount.scala to include a comment about what is in the ._1 field.

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037915#comment-15037915
 ] 

Sean Owen commented on SPARK-12129:
---

I'm not sure you're intended to set {{spark.submit.deployMode}} directly; use 
{{--deploy-mode}}. At least, the former is undocumented and appears to be 
internal, and setting it gets ignored. On those grounds, I'd close this. 

> spark.submit.deployMode not being picked up in SparkSubmitArguments
> ---
>
> Key: SPARK-12129
> URL: https://issues.apache.org/jira/browse/SPARK-12129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
> Environment: YARN, MESOS
>Reporter: Doug Balog
>Priority: Trivial
>
> In sparkSubmitArguments.scala
> {code}
>  deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull
> {code}
> This code should probably check to see if the env variable SPARK_SUBMIT is 
> set and then look up `spark.submit.deployMode` in sparkProperties.
> To see this, run something like {code}spark-submit -v --conf 
> spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi 
> $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code}
> The last `spark.submit.deployMode` will be set to client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-7337.


> FPGrowth algo throwing OutOfMemoryError
> ---
>
> Key: SPARK-7337
> URL: https://issues.apache.org/jira/browse/SPARK-7337
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
> Environment: Ubuntu
>Reporter: Amit Gupta
> Attachments: FPGrowthBug.png
>
>
> When running FPGrowth algo with huge data in GBs and with numPartitions=500 
> then after some time it throws OutOfMemoryError.
> Algo runs correctly upto "collect at FPGrowth.scala:131" where it creates 500 
> tasks. It fails at next stage "flatMap at FPGrowth.scala:150" where it fails 
> to create 500 tasks and create some internal calculated 17 tasks.
> Please refer to attachment - print screen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError

2015-12-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7337.
--
Resolution: Invalid

> FPGrowth algo throwing OutOfMemoryError
> ---
>
> Key: SPARK-7337
> URL: https://issues.apache.org/jira/browse/SPARK-7337
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
> Environment: Ubuntu
>Reporter: Amit Gupta
> Attachments: FPGrowthBug.png
>
>
> When running FPGrowth algo with huge data in GBs and with numPartitions=500 
> then after some time it throws OutOfMemoryError.
> Algo runs correctly upto "collect at FPGrowth.scala:131" where it creates 500 
> tasks. It fails at next stage "flatMap at FPGrowth.scala:150" where it fails 
> to create 500 tasks and create some internal calculated 17 tasks.
> Please refer to attachment - print screen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Dan Dutrow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow reopened SPARK-12103:


Please update the Kafka word count examples to comment on what the key field 
is. The current examples ignore the key without explanation.

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments

2015-12-03 Thread Doug Balog (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037969#comment-15037969
 ] 

Doug Balog commented on SPARK-12129:


 I thought using {{--conf spark.submit.deployMode=cluster}} with SparkPi was an 
easy way to demonstrate the problem.

I'm trying to set the default deployMode for {{spark-submit}} to be cluster 
mode. So I have {{master}} set to {{yarn}}, and {{spark.submit.deployMode}} set 
to {{cluster}} in  {{spark-defaults.conf}}. If I set {{master}} to 
{{yarn-cluster}}, {{spark-shell}} complains about cluster deploy mode is not 
applicable. 



> spark.submit.deployMode not being picked up in SparkSubmitArguments
> ---
>
> Key: SPARK-12129
> URL: https://issues.apache.org/jira/browse/SPARK-12129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
> Environment: YARN, MESOS
>Reporter: Doug Balog
>Priority: Trivial
>
> In sparkSubmitArguments.scala
> {code}
>  deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull
> {code}
> This code should probably check to see if the env variable SPARK_SUBMIT is 
> set and then look up `spark.submit.deployMode` in sparkProperties.
> To see this, run something like {code}spark-submit -v --conf 
> spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi 
> $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code}
> The last `spark.submit.deployMode` will be set to client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected

2015-12-03 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037955#comment-15037955
 ] 

Dan Dutrow edited comment on SPARK-12103 at 12/3/15 3:53 PM:
-

I'm sure all of this is obvious to Kafka experts. Nowhere in the spark 
documentation does it say that the Left value is the Kafka key, except if you 
make a guess about what K is. You have to trace through Spark and Kafka code to 
figure it out. Please update the KafkaWordCount.scala and 
DirectKafkaWordCount.scala to include a comment about what is in the ._1 field.


was (Author: dutrow):
I'm sure all of this is obvious to Kafka experts. Nowhere in the spark 
documentation does it say that the Left value is the Kafka key, except if you 
make a guess about what K is. You have to trace through Spark and Kafka code to 
figure it out. Please update the DirectKafkaWordCount.scala to include a 
comment about what is in the ._1 field.

> KafkaUtils createStream with multiple topics -- does not work as expected
> -
>
> Key: SPARK-12103
> URL: https://issues.apache.org/jira/browse/SPARK-12103
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>Priority: Minor
> Fix For: 1.4.2
>
>
> (Note: yes, there is a Direct API that may be better, but it's not the 
> easiest thing to get started with. The Kafka Receiver API still needs to 
> work, especially for newcomers)
> When creating a receiver stream using KafkaUtils, there is a valid use case 
> where you would want to use one (or a few) Kafka Streaming Receiver to pool 
> resources. I have 10+ topics and don't want to dedicate 10 cores to 
> processing all of them. However, when reading the data procuced by 
> KafkaUtils.createStream, the DStream[(String,String)] does not properly 
> insert the topic name into the tuple. The left-key always null, making it 
> impossible to know what topic that data came from other than stashing your 
> key into the value.  Is there a way around that problem?
>  CODE
> val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, 
> "topicE" -> 1, "topicF" -> 1, ...)
> val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( 
> i =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, consumerProperties,
> topics,
> StorageLevel.MEMORY_ONLY_SER))
> val unioned :DStream[(String,String)] = ssc.union(streams)
> unioned.flatMap(x => {
>val (key, value) = x
>   // key is always null!
>   // value has data from any one of my topics
>   key match ... {
>   ..
>   }
> }
>  END CODE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037980#comment-15037980
 ] 

Sean Owen commented on SPARK-12129:
---

Hm I see. Let's see if others chime in about what is supposed to happen here.

> spark.submit.deployMode not being picked up in SparkSubmitArguments
> ---
>
> Key: SPARK-12129
> URL: https://issues.apache.org/jira/browse/SPARK-12129
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.1, 1.4.1, 1.5.2
> Environment: YARN, MESOS
>Reporter: Doug Balog
>Priority: Trivial
>
> In sparkSubmitArguments.scala
> {code}
>  deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull
> {code}
> This code should probably check to see if the env variable SPARK_SUBMIT is 
> set and then look up `spark.submit.deployMode` in sparkProperties.
> To see this, run something like {code}spark-submit -v --conf 
> spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi 
> $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code}
> The last `spark.submit.deployMode` will be set to client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-12-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037932#comment-15037932
 ] 

Sean Owen commented on SPARK-5081:
--

Is this still an issue? I'm trying to figure out whether we believe there is 
still something to do here. Note there have been some shuffle and snappy 
changes in between.

> Shuffle write increases
> ---
>
> Key: SPARK-5081
> URL: https://issues.apache.org/jira/browse/SPARK-5081
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.2.0
>Reporter: Kevin Jung
>Priority: Critical
> Attachments: Spark_Debug.pdf, diff.txt
>
>
> The size of shuffle write showing in spark web UI is much different when I 
> execute same spark job with same input data in both spark 1.1 and spark 1.2. 
> At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
> in spark 1.2. 
> I set spark.shuffle.manager option to hash because it's default value is 
> changed but spark 1.2 still writes shuffle output more than spark 1.1.
> It can increase disk I/O overhead exponentially as the input file gets bigger 
> and it causes the jobs take more time to complete. 
> In the case of about 100GB input, for example, the size of shuffle write is 
> 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
> spark 1.1
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |9|saveAsTextFile| |1169.4KB| |
> |12|combineByKey| |1265.4KB|1275.0KB|
> |6|sortByKey| |1276.5KB| |
> |8|mapPartitions| |91.0MB|1383.1KB|
> |4|apply| |89.4MB| |
> |5|sortBy|155.6MB| |98.1MB|
> |3|sortBy|155.6MB| | |
> |1|collect| |2.1MB| |
> |2|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |
> spark 1.2
> ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
> |12|saveAsTextFile| |1170.2KB| |
> |11|combineByKey| |1264.5KB|1275.0KB|
> |8|sortByKey| |1273.6KB| |
> |7|mapPartitions| |134.5MB|1383.1KB|
> |5|zipWithIndex| |132.5MB| |
> |4|sortBy|155.6MB| |146.9MB|
> |3|sortBy|155.6MB| | |
> |2|collect| |2.0MB| |
> |1|mapValues|155.6MB| |2.2MB|
> |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >