[jira] [Commented] (SPARK-12141) Use Jackson to serialize all events when writing event log
[ https://issues.apache.org/jira/browse/SPARK-12141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039776#comment-15039776 ] Jean-Baptiste Onofré commented on SPARK-12141: -- [~vanzin] Do you mind if I take a look on this one ? > Use Jackson to serialize all events when writing event log > -- > > Key: SPARK-12141 > URL: https://issues.apache.org/jira/browse/SPARK-12141 > Project: Spark > Issue Type: Task > Components: Spark Core >Reporter: Marcelo Vanzin > > SPARK-11206 added infrastructure to serialize events using Jackson, so that > manual serialization code is not needed anymore. > We should write all events using that support, and remove all the manual > serialization code in {{JsonProtocol}}. > Since the event log format is a semi-public API, I'm targeting this at 2.0. > Also, we can't remove the manual deserialization code, since we need to be > able to read old event logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12140) Support Streaming UI in HistoryServer
[ https://issues.apache.org/jira/browse/SPARK-12140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039778#comment-15039778 ] Jean-Baptiste Onofré commented on SPARK-12140: -- [~vanzin] Do you mind if I try something there ? ;) > Support Streaming UI in HistoryServer > - > > Key: SPARK-12140 > URL: https://issues.apache.org/jira/browse/SPARK-12140 > Project: Spark > Issue Type: New Feature > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin > > SPARK-11206 added infrastructure that would allow the streaming UI to be > shown in the History Server. We should add the necessary code to make that > happen, although it requires some changes to how events and listeners are > used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12089) java.lang.NegativeArraySizeException when growing BufferHolder
[ https://issues.apache.org/jira/browse/SPARK-12089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040771#comment-15040771 ] Apache Spark commented on SPARK-12089: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/10142 > java.lang.NegativeArraySizeException when growing BufferHolder > -- > > Key: SPARK-12089 > URL: https://issues.apache.org/jira/browse/SPARK-12089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Erik Selin >Priority: Blocker > > When running a large spark sql query including multiple joins I see tasks > failing with the following trace: > {code} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:36) > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:188) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.joins.OneSideOuterIterator.getRow(SortMergeOuterJoin.scala:288) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:76) > at > org.apache.spark.sql.execution.RowIteratorToScala.next(RowIterator.scala:62) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {code} > From the spark code it looks like this is due to a integer overflow when > growing a buffer length. The offending line {{BufferHolder.java:36}} is the > following in the version I'm running: > {code} > final byte[] tmp = new byte[length * 2]; > {code} > This seems to indicate to me that this buffer will never be able to hold more > then 2G worth of data. And likely will hold even less since any length > > 1073741824 will cause a integer overflow and turn the new buffer size > negative. > I hope I'm simply missing some critical config setting but it still seems > weird that we have a (rather low) upper limit on these buffers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15040773#comment-15040773 ] Shivaram Venkataraman commented on SPARK-12144: --- Personally I don't think this API is a good fit for SparkR -- This introduces a lot of chaining based methods which are awkward to write in R without using something like magrittr. I think the existing `read.df` and having `option = value` is more user-friendly. Is there any functionality we will gain from this ? > Implement DataFrameReader and DataFrameWriter API in SparkR > --- > > Key: SPARK-12144 > URL: https://issues.apache.org/jira/browse/SPARK-12144 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Sun Rui > > DataFrameReader API: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader > DataFrameWriter API: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()
[ https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039767#comment-15039767 ] sachin aggarwal commented on SPARK-12117: - Hi, can u suggest me some work around to make this use case work in 1.5.1 ? thanks > Column Aliases are Ignored in callUDF while using struct() > -- > > Key: SPARK-12117 > URL: https://issues.apache.org/jira/browse/SPARK-12117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: sachin aggarwal > > case where this works: > val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), > ("Rishabh", "2"))).toDF("myText", "id") > > TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > steps to reproduce error case: > 1)create a file copy following text--filename(a.json) > { "myText": "Sachin Aggarwal","id": "1"} > { "myText": "Rishabh","id": "2"} > 2)define a simple UDF > def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]} > 3)register the udf > sqlContext.udf.register("mydef" ,mydef _) > 4)read the input file > val TestDoc2=sqlContext.read.json("/tmp/a.json") > 5)make a call to UDF > TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > ERROR received: > java.lang.IllegalArgumentException: Field "Text" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233) > at > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212) > at org.apache.spark.sql.Row$class.getAs(Row.scala:325) > at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191) > at > $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at
[jira] [Commented] (SPARK-5264) Support `drop temporary table [if exists]` DDL command
[ https://issues.apache.org/jira/browse/SPARK-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039713#comment-15039713 ] niranda perera commented on SPARK-5264: --- any update on this? > Support `drop temporary table [if exists]` DDL command > --- > > Key: SPARK-5264 > URL: https://issues.apache.org/jira/browse/SPARK-5264 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.3.0 >Reporter: Li Sheng >Priority: Minor > Original Estimate: 72h > Remaining Estimate: 72h > > Support `drop table` DDL command > i.e DROP [TEMPORARY] TABLE [IF EXISTS]tbl_name -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR
Sun Rui created SPARK-12144: --- Summary: Implement DataFrameReader and DataFrameWriter API in SparkR Key: SPARK-12144 URL: https://issues.apache.org/jira/browse/SPARK-12144 Project: Spark Issue Type: New Feature Components: SparkR Affects Versions: 1.5.2 Reporter: Sun Rui DataFrameReader API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader DataFrameWriter API: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Derek Sabry updated SPARK-12139: Description: When executing a query of the form Select `(a)?+.+` from A, Hive would interprets this query as a regular expression, which can be supported in the hive parser for spark was: When executing a query of the form Select `(a)?+.+` from A, Hive interprets this query as a regular expression, which can be supported in the hive parser for spark > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Priority: Minor > > When executing a query of the form > Select `(a)?+.+` from A, > Hive would interprets this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12144) Implement DataFrameReader and DataFrameWriter API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041177#comment-15041177 ] Sun Rui commented on SPARK-12144: - @shivaram, your opinion is reasonable. It seems that read.df and write.df does not cover all functionalities exposed by DataFrameReader and DataFrameWriter. Maybe we don't need to provide these two API, but provide more wrapping functions in SparkR. > Implement DataFrameReader and DataFrameWriter API in SparkR > --- > > Key: SPARK-12144 > URL: https://issues.apache.org/jira/browse/SPARK-12144 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Sun Rui > > DataFrameReader API: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader > DataFrameWriter API: > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12104) collect() does not handle multiple columns with same name
[ https://issues.apache.org/jira/browse/SPARK-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12104. --- Resolution: Fixed Fix Version/s: 1.6.1 2.0.0 Issue resolved by pull request 10118 [https://github.com/apache/spark/pull/10118] > collect() does not handle multiple columns with same name > - > > Key: SPARK-12104 > URL: https://issues.apache.org/jira/browse/SPARK-12104 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Hossein Falaki >Priority: Critical > Fix For: 2.0.0, 1.6.1 > > > This is a regression from Spark 1.5 > Spark can produce DataFrames with identical names (e.g., after left outer > joins). In 1.5 when such a DataFrame was collected we ended up with an R > data.frame with modified column names: > {code} > > names(mySparkDF) > [1] "date" "name" "name" > > names(collect(mySparkDF)) > [1] "date" "name" "name.1" > {code} > But in 1.6 only the first column is included in the collected R data.frame. I > think SparkR should continue the old behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Derek Sabry updated SPARK-12139: Description: When executing a query of the form Select `(a)?\+.\+` from A, Hive would interprets this query as a regular expression, which can be supported in the hive parser for spark was: When executing a query of the form Select `(a)?+.+` from A, Hive would interprets this query as a regular expression, which can be supported in the hive parser for spark > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Priority: Minor > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interprets this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12139) REGEX Column Specification for Hive Queries
[ https://issues.apache.org/jira/browse/SPARK-12139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Derek Sabry updated SPARK-12139: Description: When executing a query of the form Select `(a)?\+.\+` from A, Hive would interpret this query as a regular expression, which can be supported in the hive parser for spark was: When executing a query of the form Select `(a)?\+.\+` from A, Hive would interprets this query as a regular expression, which can be supported in the hive parser for spark > REGEX Column Specification for Hive Queries > --- > > Key: SPARK-12139 > URL: https://issues.apache.org/jira/browse/SPARK-12139 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Derek Sabry >Priority: Minor > > When executing a query of the form > Select `(a)?\+.\+` from A, > Hive would interpret this query as a regular expression, which can be > supported in the hive parser for spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()
[ https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15041232#comment-15041232 ] Liang-Chi Hsieh commented on SPARK-12117: - Hmm, because the problem is due to the field names are erased, I think you can just use other API of Row, instead of getAs(fieldName). Even getAs() has another alternative that accepts a integer as the field index. Then you don't need to extract the filed with field name. > Column Aliases are Ignored in callUDF while using struct() > -- > > Key: SPARK-12117 > URL: https://issues.apache.org/jira/browse/SPARK-12117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: sachin aggarwal > > case where this works: > val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), > ("Rishabh", "2"))).toDF("myText", "id") > > TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > steps to reproduce error case: > 1)create a file copy following text--filename(a.json) > { "myText": "Sachin Aggarwal","id": "1"} > { "myText": "Rishabh","id": "2"} > 2)define a simple UDF > def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]} > 3)register the udf > sqlContext.udf.register("mydef" ,mydef _) > 4)read the input file > val TestDoc2=sqlContext.read.json("/tmp/a.json") > 5)make a call to UDF > TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > ERROR received: > java.lang.IllegalArgumentException: Field "Text" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233) > at > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212) > at org.apache.spark.sql.Row$class.getAs(Row.scala:325) > at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191) > at > $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at >
[jira] [Assigned] (SPARK-12122) Recovered streaming context can sometimes run a batch twice
[ https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12122: Assignee: Apache Spark (was: Tathagata Das) > Recovered streaming context can sometimes run a batch twice > --- > > Key: SPARK-12122 > URL: https://issues.apache.org/jira/browse/SPARK-12122 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Assignee: Apache Spark >Priority: Blocker > > After recovering from checkpoint, the JobGenerator figures out which batches > to run again. That can sometimes lead to a batch being submitted twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5836) Highlight in Spark documentation that by default Spark does not delete its temporary files
[ https://issues.apache.org/jira/browse/SPARK-5836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037593#comment-15037593 ] Ewan Higgs commented on SPARK-5836: --- [~tdas] {quote} The only case there may be issues is when the external shuffle service is used. {quote} I see this problematic behaviour in ipython/pyspark notebooks. We can try to go through and unpersist and checkpoint and so on with the RDDs but the shuffle files don't seem to go away. We see this even though we are not using the external shuffle service. > Highlight in Spark documentation that by default Spark does not delete its > temporary files > -- > > Key: SPARK-5836 > URL: https://issues.apache.org/jira/browse/SPARK-5836 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Tomasz Dudziak >Assignee: Ilya Ganelin >Priority: Minor > Fix For: 1.3.1, 1.4.0 > > > We recently learnt the hard way (in a prod system) that Spark by default does > not delete its temporary files until it is stopped. WIthin a relatively short > time span of heavy Spark use the disk of our prod machine filled up > completely because of multiple shuffle files written to it. We think there > should be better documentation around the fact that after a job is finished > it leaves a lot of rubbish behind so that this does not come as a surprise. > Probably a good place to highlight that fact would be the documentation of > {{spark.local.dir}} property, which controls where Spark temporary files are > written. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12125) pull out nondeterministic expressions from Join
iward created SPARK-12125: - Summary: pull out nondeterministic expressions from Join Key: SPARK-12125 URL: https://issues.apache.org/jira/browse/SPARK-12125 Project: Spark Issue Type: Bug Affects Versions: 1.5.2, 1.5.1, 1.5.0 Reporter: iward Currently,*nondeterministic expressions* are only allowed in *Project* or *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can be pulled out. But,Sometime in many case,we will use nondeterministic expressions to process *join keys* avoiding data skew.for example: {noformat} select * from tableA a join (select * from tableB) b on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code {noformat} This PR introduce a mechanism to pull out nondeterministic expressions from *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.
[ https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037678#comment-15037678 ] Xusen Yin commented on SPARK-7379: -- OK, I fixed my code in that PR. My case is getSplits returns an Array[Double], then in the mllib/common.py code, _java2py() method it fails to match JavaList. {code} if isinstance(r, (bytearray, bytes)): r = PickleSerializer().loads(bytes(r), encoding=encoding) {code} The above code cath the Array[Double] but fails to deserialize it with Python3. > pickle.loads expects a string instead of bytes in Python 3. > --- > > Key: SPARK-7379 > URL: https://issues.apache.org/jira/browse/SPARK-7379 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Davies Liu > > In PickleSerializer, we call pickle.loads in Python 3. However, the input obj > could be bytes, which works in Python 2 but not 3. > The error message is > {code} > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py", > line 418, in loads > return pickle.loads(obj, encoding=encoding) > TypeError: must be a unicode character, not bytes > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12127) Spark sql Support Cross join with Mongo DB
himanshu singhal created SPARK-12127: Summary: Spark sql Support Cross join with Mongo DB Key: SPARK-12127 URL: https://issues.apache.org/jira/browse/SPARK-12127 Project: Spark Issue Type: Question Components: Spark Core, SQL Affects Versions: 1.5.2, 1.4.1 Environment: Linex Ubuntu 14.4 Reporter: himanshu singhal I am using spark sql to perform the various operation . But MongoDb is not giving the correct result on the version 1.4.0 and giving the error on the version 1.5.2 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12124) Spark Sql MongoDB Cross Join Not Working
[ https://issues.apache.org/jira/browse/SPARK-12124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037684#comment-15037684 ] Hyukjin Kwon commented on SPARK-12124: -- I think this might be asked to https://github.com/Stratio/spark-mongodb, as MongoDB datasource is not a internal datasource for Spark.. > Spark Sql MongoDB Cross Join Not Working > > > Key: SPARK-12124 > URL: https://issues.apache.org/jira/browse/SPARK-12124 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.1, 1.5.2 > Environment: Linux Ubuntu 14.4 >Reporter: himanshu singhal > Labels: test > Original Estimate: 1,680h > Remaining Estimate: 1,680h > > Cross Join In Spark Sql where source is mongo db is giving th wrong result > 15/12/03 15:55:05 INFO Executor: Running task 0.0 in stage 2.0 (TID 2) > 15/12/03 15:55:05 INFO NewHadoopRDD: Input split: > MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, > authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, > notimeout=false} > 15/12/03 15:55:05 INFO NewHadoopRDD: Input split: > MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, > authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, > notimeout=false} > 15/12/03 15:55:05 INFO MongoRecordReader: Read 1.0 documents from: > 15/12/03 15:55:05 INFO MongoRecordReader: > MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, > authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, > notimeout=false} > 15/12/03 15:55:05 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) > java.lang.IllegalStateException: Iterator has been closed > at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119) > at com.mongodb.DBCursor._hasNext(DBCursor.java:551) > at com.mongodb.DBCursor.hasNext(DBCursor.java:571) > at > com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:73) > at > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > 15/12/03 15:55:05 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, > localhost): java.lang.IllegalStateException: Iterator has been closed > at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119) > at com.mongodb.DBCursor._hasNext(DBCursor.java:551) > at com.mongodb.DBCursor.hasNext(DBCursor.java:571) > at
[jira] [Created] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.
Hyukjin Kwon created SPARK-12126: Summary: JDBC datasource processes filters only commonly pushed down. Key: SPARK-12126 URL: https://issues.apache.org/jira/browse/SPARK-12126 Project: Spark Issue Type: Improvement Components: SQL Reporter: Hyukjin Kwon As suggested [here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646), Currently JDBC datasource only processes the filters pushed down from {{DataSourceStrategy}}. Unlike ORC or Parquet, this can process pretty a lot of filters (for example, a + b > 3) since it is just about string parsing. As [here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526), using {{CatalystScan} trait might be one of solutions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12125) pull out nondeterministic expressions from Join
[ https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] iward updated SPARK-12125: -- Fix Version/s: 1.6.0 > pull out nondeterministic expressions from Join > --- > > Key: SPARK-12125 > URL: https://issues.apache.org/jira/browse/SPARK-12125 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2 >Reporter: iward > Fix For: 1.6.0 > > > Currently,*nondeterministic expressions* are only allowed in *Project* or > *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can > be pulled out. > But,Sometime in many case,we will use nondeterministic expressions to process > *join keys* avoiding data skew.for example: > {noformat} > select * > from tableA a > join > (select * from tableB) b > on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( > (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code > {noformat} > This PR introduce a mechanism to pull out nondeterministic expressions from > *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9182) filter and groupBy on DataFrames are not passed through to jdbc source
[ https://issues.apache.org/jira/browse/SPARK-9182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037670#comment-15037670 ] Hyukjin Kwon commented on SPARK-9182: - Filed here https://issues.apache.org/jira/browse/SPARK-12126. > filter and groupBy on DataFrames are not passed through to jdbc source > -- > > Key: SPARK-9182 > URL: https://issues.apache.org/jira/browse/SPARK-9182 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Greg Rahn >Assignee: Yijie Shen >Priority: Critical > > When running all of these API calls, the only one that passes the filter > through to the backend jdbc source is equality. All filters in these > commands should be able to be passed through to the jdbc database source. > {code} > val url="jdbc:postgresql:grahn" > val prop = new java.util.Properties > val emp = sqlContext.read.jdbc(url, "emp", prop) > emp.filter(emp("sal") === 5000).show() > emp.filter(emp("sal") < 5000).show() > emp.filter("sal = 3000").show() > emp.filter("sal > 2500").show() > emp.filter("sal >= 2500").show() > emp.filter("sal < 2500").show() > emp.filter("sal <= 2500").show() > emp.filter("sal != 3000").show() > emp.filter("sal between 3000 and 5000").show() > emp.filter("ename in ('SCOTT','BLAKE')").show() > {code} > We see from the PostgreSQL query log the following is run, and see that only > equality predicates are passed through. > {code} > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE > sal = 5000 > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp WHERE > sal = 3000 > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > LOG: execute : SET extra_float_digits = 3 > LOG: execute : SELECT > "empno","ename","job","mgr","hiredate","sal","comm","deptno" FROM emp > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7379) pickle.loads expects a string instead of bytes in Python 3.
[ https://issues.apache.org/jira/browse/SPARK-7379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037678#comment-15037678 ] Xusen Yin edited comment on SPARK-7379 at 12/3/15 11:19 AM: OK, I fixed my code in that PR. My case is getSplits returns an Array[Double], then in the mllib/common.py code, _java2py() method it fails to match JavaList. {code} if isinstance(r, (bytearray, bytes)): r = PickleSerializer().loads(bytes(r), encoding=encoding) {code} The above code cath the Array[Double] but fails to deserialize it with Python3. was (Author: yinxusen): OK, I fixed my code in that PR. My case is getSplits returns an Array[Double], then in the mllib/common.py code, _java2py() method it fails to match JavaList. {code} if isinstance(r, (bytearray, bytes)): r = PickleSerializer().loads(bytes(r), encoding=encoding) {code} The above code cath the Array[Double] but fails to deserialize it with Python3. > pickle.loads expects a string instead of bytes in Python 3. > --- > > Key: SPARK-7379 > URL: https://issues.apache.org/jira/browse/SPARK-7379 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng >Assignee: Davies Liu > > In PickleSerializer, we call pickle.loads in Python 3. However, the input obj > could be bytes, which works in Python 2 but not 3. > The error message is > {code} > File > "/home/jenkins/workspace/SparkPullRequestBuilder@3/python/pyspark/serializers.py", > line 418, in loads > return pickle.loads(obj, encoding=encoding) > TypeError: must be a unicode character, not bytes > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12126) JDBC datasource processes filters only commonly pushed down.
[ https://issues.apache.org/jira/browse/SPARK-12126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-12126: - Description: As suggested [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646], Currently JDBC datasource only processes the filters pushed down from {{DataSourceStrategy}}. Unlike ORC or Parquet, this can process pretty a lot of filters (for example, a + b > 3) since it is just about string parsing. As [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526], using {{CatalystScan}} trait might be one of solutions. was: As suggested [here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646), Currently JDBC datasource only processes the filters pushed down from {{DataSourceStrategy}}. Unlike ORC or Parquet, this can process pretty a lot of filters (for example, a + b > 3) since it is just about string parsing. As [here](https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526), using {{CatalystScan} trait might be one of solutions. > JDBC datasource processes filters only commonly pushed down. > > > Key: SPARK-12126 > URL: https://issues.apache.org/jira/browse/SPARK-12126 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Hyukjin Kwon > > As suggested > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=14955646=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14955646], > Currently JDBC datasource only processes the filters pushed down from > {{DataSourceStrategy}}. > Unlike ORC or Parquet, this can process pretty a lot of filters (for example, > a + b > 3) since it is just about string parsing. > As > [here|https://issues.apache.org/jira/browse/SPARK-9182?focusedCommentId=15031526=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15031526], > using {{CatalystScan}} trait might be one of solutions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11238) SparkR: Documentation change for merge function
[ https://issues.apache.org/jira/browse/SPARK-11238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037661#comment-15037661 ] RaviShankar KS commented on SPARK-11238: Hi , I would like to take this up. Can you tell me where to update the release notes ? Thanks and Regards, Ravi > SparkR: Documentation change for merge function > --- > > Key: SPARK-11238 > URL: https://issues.apache.org/jira/browse/SPARK-11238 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan > Labels: releasenotes > > As discussed in pull request: https://github.com/apache/spark/pull/9012, the > signature of the merge function will be changed, therefore documentation > change is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12125) pull out nondeterministic expressions from Join
[ https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12125: Assignee: (was: Apache Spark) > pull out nondeterministic expressions from Join > --- > > Key: SPARK-12125 > URL: https://issues.apache.org/jira/browse/SPARK-12125 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2 >Reporter: iward > > Currently,*nondeterministic expressions* are only allowed in *Project* or > *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can > be pulled out. > But,Sometime in many case,we will use nondeterministic expressions to process > *join keys* avoiding data skew.for example: > {noformat} > select * > from tableA a > join > (select * from tableB) b > on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( > (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code > {noformat} > This PR introduce a mechanism to pull out nondeterministic expressions from > *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12125) pull out nondeterministic expressions from Join
[ https://issues.apache.org/jira/browse/SPARK-12125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12125: Assignee: Apache Spark > pull out nondeterministic expressions from Join > --- > > Key: SPARK-12125 > URL: https://issues.apache.org/jira/browse/SPARK-12125 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2 >Reporter: iward >Assignee: Apache Spark > > Currently,*nondeterministic expressions* are only allowed in *Project* or > *Filter*,And only when we use nondeterministic expressions in *UnaryNode* can > be pulled out. > But,Sometime in many case,we will use nondeterministic expressions to process > *join keys* avoiding data skew.for example: > {noformat} > select * > from tableA a > join > (select * from tableB) b > on upper((case when (a.brand_code is null or a.brand_code = '' ) then cast( > (-rand() * 1000 ) as string ) else a.brand_code end )) = b.brand_code > {noformat} > This PR introduce a mechanism to pull out nondeterministic expressions from > *Join*,so we can use nondeterministic expression in *Join* appropriately. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12120: Assignee: (was: Apache Spark) > Improve exception message when failing to initialize HiveContext in PySpark > --- > > Key: SPARK-12120 > URL: https://issues.apache.org/jira/browse/SPARK-12120 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > > I get the following exception message when failing to initialize HiveContext. > This is hard to figure out why HiveContext failed to initialize. Actually I > build spark with hive profile enabled. The reason the HiveContext failed is > due to I didn't start hdfs service. And actually I can see the full > stacktrace in spark-shell. And I also can see the full stack trace in > python2. The issue only exists in python2.x > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, > in createDataFrame > jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, > in _ssql_ctx > "build/sbt assembly", e) > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError(u'An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037538#comment-15037538 ] Apache Spark commented on SPARK-12120: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/10126 > Improve exception message when failing to initialize HiveContext in PySpark > --- > > Key: SPARK-12120 > URL: https://issues.apache.org/jira/browse/SPARK-12120 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > > I get the following exception message when failing to initialize HiveContext. > This is hard to figure out why HiveContext failed to initialize. Actually I > build spark with hive profile enabled. The reason the HiveContext failed is > due to I didn't start hdfs service. And actually I can see the full > stacktrace in spark-shell. And I also can see the full stack trace in > python2. The issue only exists in python2.x > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, > in createDataFrame > jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, > in _ssql_ctx > "build/sbt assembly", e) > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError(u'An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12120: Assignee: Apache Spark > Improve exception message when failing to initialize HiveContext in PySpark > --- > > Key: SPARK-12120 > URL: https://issues.apache.org/jira/browse/SPARK-12120 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > > I get the following exception message when failing to initialize HiveContext. > This is hard to figure out why HiveContext failed to initialize. Actually I > build spark with hive profile enabled. The reason the HiveContext failed is > due to I didn't start hdfs service. And actually I can see the full > stacktrace in spark-shell. And I also can see the full stack trace in > python2. The issue only exists in python2.x > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, > in createDataFrame > jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, > in _ssql_ctx > "build/sbt assembly", e) > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError(u'An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12122) Recovered streaming context can sometimes run a batch twice
[ https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037560#comment-15037560 ] Apache Spark commented on SPARK-12122: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/10127 > Recovered streaming context can sometimes run a batch twice > --- > > Key: SPARK-12122 > URL: https://issues.apache.org/jira/browse/SPARK-12122 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > After recovering from checkpoint, the JobGenerator figures out which batches > to run again. That can sometimes lead to a batch being submitted twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12122) Recovered streaming context can sometimes run a batch twice
[ https://issues.apache.org/jira/browse/SPARK-12122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12122: Assignee: Tathagata Das (was: Apache Spark) > Recovered streaming context can sometimes run a batch twice > --- > > Key: SPARK-12122 > URL: https://issues.apache.org/jira/browse/SPARK-12122 > Project: Spark > Issue Type: Bug > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Blocker > > After recovering from checkpoint, the JobGenerator figures out which batches > to run again. That can sometimes lead to a batch being submitted twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12088) check connection.isClose before connection.getAutoCommit in JDBCRDD.close
[ https://issues.apache.org/jira/browse/SPARK-12088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12088. --- Resolution: Fixed Assignee: Huaxin Gao Fix Version/s: 2.0.0 1.6.1 Resolved by https://github.com/apache/spark/pull/10095 > check connection.isClose before connection.getAutoCommit in JDBCRDD.close > - > > Key: SPARK-12088 > URL: https://issues.apache.org/jira/browse/SPARK-12088 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 1.6.1, 2.0.0 > > > in JDBCRDD, it has > if (!conn.getAutoCommit && !conn.isClosed) { > try { > conn.commit() > } > . . . . . . > In my test, the connection is already closed so conn.getAutoCommit throw > Exception. We should check !conn.isClosed before checking !conn.getAutoCommit > to avoid the Exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12118) SparkR: Documentation change for isNaN
[ https://issues.apache.org/jira/browse/SPARK-12118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037660#comment-15037660 ] RaviShankar KS commented on SPARK-12118: Hi , I would like to take this up. Can you tell me where to update the release notes ? Thanks and Regards, Ravi > SparkR: Documentation change for isNaN > -- > > Key: SPARK-12118 > URL: https://issues.apache.org/jira/browse/SPARK-12118 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang >Priority: Minor > Labels: releasenotes > > As discussed in pull request: https://github.com/apache/spark/pull/10037, we > replace DataFrame.isNaN with DataFrame.isnan at SparkR side. Because > DataFrame.isNaN has been deprecated and will be removed at Spark 2.0. We > should document the change at release notes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9697) Project Tungsten (Spark 1.6)
[ https://issues.apache.org/jira/browse/SPARK-9697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037458#comment-15037458 ] Sean Owen commented on SPARK-9697: -- [~rxin] or should this at least be titled "... (Phase 2)"? > Project Tungsten (Spark 1.6) > > > Key: SPARK-9697 > URL: https://issues.apache.org/jira/browse/SPARK-9697 > Project: Spark > Issue Type: Epic > Components: Block Manager, Shuffle, Spark Core, SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > This epic tracks the 2nd phase of Project Tungsten, slotted for Spark 1.6 > release. > This epic tracks work items for Spark 1.6. More tickets can be found in: > SPARK-7075: Tungsten-related work in Spark 1.5 > SPARK-9697: Tungsten-related work in Spark 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12119) Support compression in PySpark
Jeff Zhang created SPARK-12119: -- Summary: Support compression in PySpark Key: SPARK-12119 URL: https://issues.apache.org/jira/browse/SPARK-12119 Project: Spark Issue Type: Sub-task Reporter: Jeff Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4117) Spark on Yarn handle AM being told command from RM
[ https://issues.apache.org/jira/browse/SPARK-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4117: --- Assignee: Apache Spark > Spark on Yarn handle AM being told command from RM > -- > > Key: SPARK-4117 > URL: https://issues.apache.org/jira/browse/SPARK-4117 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Assignee: Apache Spark > > In the allocateResponse from the RM it can send commands that the AM should > follow. for instance AM_RESYNC and AM_SHUTDOWN. We should add support for > those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12124) Spark Sql MongoDB Cross Join Not Working
himanshu singhal created SPARK-12124: Summary: Spark Sql MongoDB Cross Join Not Working Key: SPARK-12124 URL: https://issues.apache.org/jira/browse/SPARK-12124 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.2, 1.5.1, 1.4.1, 1.4.0 Environment: Linux Ubuntu 14.4 Reporter: himanshu singhal Cross Join In Spark Sql where source is mongo db is giving th wrong result 15/12/03 15:55:05 INFO Executor: Running task 0.0 in stage 2.0 (TID 2) 15/12/03 15:55:05 INFO NewHadoopRDD: Input split: MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, notimeout=false} 15/12/03 15:55:05 INFO NewHadoopRDD: Input split: MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, notimeout=false} 15/12/03 15:55:05 INFO MongoRecordReader: Read 1.0 documents from: 15/12/03 15:55:05 INFO MongoRecordReader: MongoInputSplit{URI=mongodb://localhost:27017/watchtower2.Patient, authURI=null, min={ }, max={ }, query={ }, sort={ }, fields={ }, notimeout=false} 15/12/03 15:55:05 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.lang.IllegalStateException: Iterator has been closed at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119) at com.mongodb.DBCursor._hasNext(DBCursor.java:551) at com.mongodb.DBCursor.hasNext(DBCursor.java:571) at com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:73) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/12/03 15:55:05 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.lang.IllegalStateException: Iterator has been closed at com.mongodb.QueryResultIterator.hasNext(QueryResultIterator.java:119) at com.mongodb.DBCursor._hasNext(DBCursor.java:551) at com.mongodb.DBCursor.hasNext(DBCursor.java:571) at com.mongodb.hadoop.input.MongoRecordReader.nextKeyValue(MongoRecordReader.java:73) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:163) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
[jira] [Commented] (SPARK-4117) Spark on Yarn handle AM being told command from RM
[ https://issues.apache.org/jira/browse/SPARK-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037617#comment-15037617 ] Apache Spark commented on SPARK-4117: - User 'devaraj-kavali' has created a pull request for this issue: https://github.com/apache/spark/pull/10129 > Spark on Yarn handle AM being told command from RM > -- > > Key: SPARK-4117 > URL: https://issues.apache.org/jira/browse/SPARK-4117 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves > > In the allocateResponse from the RM it can send commands that the AM should > follow. for instance AM_RESYNC and AM_SHUTDOWN. We should add support for > those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark
Jeff Zhang created SPARK-12120: -- Summary: Improve exception message when failing to initialize HiveContext in PySpark Key: SPARK-12120 URL: https://issues.apache.org/jira/browse/SPARK-12120 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Jeff Zhang Priority: Minor I get the following exception message when failing to initialize HiveContext. Actually I build spark with hive profile enabled. The reason the HiveContext failed to initialize is due to I didn't start hdfs service. And actually I can see the full stacktrace in spark-shell. And I also can see the full stack trace in python2. The issue only exists in python2.x {code} Traceback (most recent call last): File "", line 1, in File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, in createDataFrame jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, in _ssql_ctx "build/sbt assembly", e) Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12121) Remote Spark init.sh called from spark_ec2.py points to incorrect prebuilt image URL
Andre Schumacher created SPARK-12121: Summary: Remote Spark init.sh called from spark_ec2.py points to incorrect prebuilt image URL Key: SPARK-12121 URL: https://issues.apache.org/jira/browse/SPARK-12121 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.5.2 Reporter: Andre Schumacher The tar ball for 1.5.2 with Hadoop 1 (the default) has the scala version appended to the file name, which leads to spark_ec2.py failing on starting up the cluster. Here is the record from the S3 contents: spark-1.5.2-bin-hadoop1-scala2.11.tgz 2015-11-10T06:45:17.000Z "056fc68e549db27d986da707f19e39c8-4" 234574403 STANDARD Maybe one could provide one without the scala suffix (default)? A workaround is to use set the Hadoop version to a version different from 1 when calling spark-ec2. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12120) Improve exception message when failing to initialize HiveContext in PySpark
[ https://issues.apache.org/jira/browse/SPARK-12120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-12120: --- Description: I get the following exception message when failing to initialize HiveContext. This is hard to figure out why HiveContext failed to initialize. Actually I build spark with hive profile enabled. The reason the HiveContext failed is due to I didn't start hdfs service. And actually I can see the full stacktrace in spark-shell. And I also can see the full stack trace in python2. The issue only exists in python2.x {code} Traceback (most recent call last): File "", line 1, in File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, in createDataFrame jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, in _ssql_ctx "build/sbt assembly", e) Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) {code} was: I get the following exception message when failing to initialize HiveContext. Actually I build spark with hive profile enabled. The reason the HiveContext failed to initialize is due to I didn't start hdfs service. And actually I can see the full stacktrace in spark-shell. And I also can see the full stack trace in python2. The issue only exists in python2.x {code} Traceback (most recent call last): File "", line 1, in File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, in createDataFrame jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, in _ssql_ctx "build/sbt assembly", e) Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) {code} > Improve exception message when failing to initialize HiveContext in PySpark > --- > > Key: SPARK-12120 > URL: https://issues.apache.org/jira/browse/SPARK-12120 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > > I get the following exception message when failing to initialize HiveContext. > This is hard to figure out why HiveContext failed to initialize. Actually I > build spark with hive profile enabled. The reason the HiveContext failed is > due to I didn't start hdfs service. And actually I can see the full > stacktrace in spark-shell. And I also can see the full stack trace in > python2. The issue only exists in python2.x > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 430, > in createDataFrame > jdf = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), schema.json()) > File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 691, > in _ssql_ctx > "build/sbt assembly", e) > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run > build/sbt assembly", Py4JJavaError(u'An error occurred while calling > None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o34)) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12122) Recovered streaming context can sometimes run a batch twice
Tathagata Das created SPARK-12122: - Summary: Recovered streaming context can sometimes run a batch twice Key: SPARK-12122 URL: https://issues.apache.org/jira/browse/SPARK-12122 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Blocker After recovering from checkpoint, the JobGenerator figures out which batches to run again. That can sometimes lead to a batch being submitted twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12123) Spark ava.lang.NullPointerException
[ https://issues.apache.org/jira/browse/SPARK-12123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12123. --- Resolution: Invalid [~michael_han] This should be a question on user@, not a JIRA. Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You need to narrow down your question more anyway, not just post your code. > Spark ava.lang.NullPointerException > --- > > Key: SPARK-12123 > URL: https://issues.apache.org/jira/browse/SPARK-12123 > Project: Spark > Issue Type: Question >Affects Versions: 1.5.2 >Reporter: Michael Han > > Hi, > I'm fresh to study Spark. > I download Spark 1.5.2 into my windows C disk. > Download the latest Eclipse and create a Java project with maven, > The only java class is: > package com.qad; > import org.apache.spark.api.java.*; > import java.io.BufferedWriter; > import java.io.File; > import java.io.FileWriter; > import org.apache.spark.SparkConf; > import org.apache.spark.api.java.function.Function; > public class SparkTest1 { > public static void main(String[] args) { > String logFile = "README.md"; // Should be some file on your system > SparkConf conf = new > SparkConf().setMaster("spark://192.168.79.1:7077").setAppName("Simple > Application"); > JavaSparkContext sc = new JavaSparkContext(conf); > > // example 1 > JavaRDD logData = sc.textFile(logFile); > long numAs, numBs; > numAs = 0; > numBs = 0; > > JavaRDD logData2 = logData.filter(new > Function() { > > private static final long serialVersionUID = 1L; > public Boolean call(String s) { return > s.contains("Spark is a fast"); } > }); > numAs = logData2.count(); > > String content = "Lines with a: " + numAs + ", lines with b: " + > numBs; > System.out.println(content); > WriteText(content,"mh6log.txt"); > > sc.close(); > > } > > private static void WriteText(String content,String fileName) > { > try { > > File logFile=new File(fileName); > BufferedWriter writer = new BufferedWriter(new > FileWriter(logFile)); > writer.write (content); > //Close writer > writer.close(); > } catch(Exception e) { > e.printStackTrace(); > } > } > } > The pom are: > http://maven.apache.org/POM/4.0.0; > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; > xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 > http://maven.apache.org/xsd/maven-4.0.0.xsd;> > 4.0.0 > Spark-Test > Spark-Test > 1.0 > > src > > > maven-compiler-plugin > 3.3 > > 1.8 > 1.8 > > > > > > > org.apache.spark > spark-core_2.10 > 1.5.2 > > > > I can run this java class correctly in Eclipse, but exceptions when I using > the following command to commit it: > spark-submit --master local --class com.qad.SparkTest1 Spark-Test-1.0.jar > Who knows which step I was wrong? Thank you. > The exceptions are: > Exception in thread "main" org.apache.spark.SparkException: Job aborted due > to s > tage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost > task > 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException > at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) > at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java: > 715) > at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) > at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor > $Executor$$updateDependencies$5.apply(Executor.scala:405) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor > $Executor$$updateDependencies$5.apply(Executor.scala:397) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply( > TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca > la:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca > la:98) > at >
[jira] [Resolved] (SPARK-12111) need upgrade instruction
[ https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12111. --- Resolution: Not A Problem [~aedwip] I am someone who knows the details of Spark gets built and installed. Please do not reopen this issue unless there's a significant change in the discussion. Here, you have a fine question, but one that is not a JIRA, as I said. This is already documented on the Spark site. Here, you describe a situation where I think you set up a cluster by hand (not clear, you mentioned spark-ec2?). That process is already clear to you and documented on the project site. There's really nothing more to know. You replace the old binaries with the new ones, and start again. I'm not aware of any non-trivial changes in the behavior of the standalone scripts. If you mean you want something to automatically manage and update clusters: I suppose EMR's Spark sort of does that, in that they'll eventually make a new release with a new Spark that you can deploy. In the same spirit, distros like CDH manage things liket his for you. spark-ec2 won't do this automatically for you. > need upgrade instruction > > > Key: SPARK-12111 > URL: https://issues.apache.org/jira/browse/SPARK-12111 > Project: Spark > Issue Type: Documentation > Components: EC2 >Affects Versions: 1.5.1 >Reporter: Andrew Davidson > Labels: build, documentation > > I have looked all over the spark website and googled. I have not found > instructions for how to upgrade spark in general let alone a cluster created > by using spark-ec2 script > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12111) need upgrade instruction
[ https://issues.apache.org/jira/browse/SPARK-12111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-12111. - > need upgrade instruction > > > Key: SPARK-12111 > URL: https://issues.apache.org/jira/browse/SPARK-12111 > Project: Spark > Issue Type: Documentation > Components: EC2 >Affects Versions: 1.5.1 >Reporter: Andrew Davidson > Labels: build, documentation > > I have looked all over the spark website and googled. I have not found > instructions for how to upgrade spark in general let alone a cluster created > by using spark-ec2 script > thanks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12123) Spark ava.lang.NullPointerException
Michael Han created SPARK-12123: --- Summary: Spark ava.lang.NullPointerException Key: SPARK-12123 URL: https://issues.apache.org/jira/browse/SPARK-12123 Project: Spark Issue Type: Question Affects Versions: 1.5.2 Reporter: Michael Han Hi, I'm fresh to study Spark. I download Spark 1.5.2 into my windows C disk. Download the latest Eclipse and create a Java project with maven, The only java class is: package com.qad; import org.apache.spark.api.java.*; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import org.apache.spark.SparkConf; import org.apache.spark.api.java.function.Function; public class SparkTest1 { public static void main(String[] args) { String logFile = "README.md"; // Should be some file on your system SparkConf conf = new SparkConf().setMaster("spark://192.168.79.1:7077").setAppName("Simple Application"); JavaSparkContext sc = new JavaSparkContext(conf); // example 1 JavaRDD logData = sc.textFile(logFile); long numAs, numBs; numAs = 0; numBs = 0; JavaRDD logData2 = logData.filter(new Function() { private static final long serialVersionUID = 1L; public Boolean call(String s) { return s.contains("Spark is a fast"); } }); numAs = logData2.count(); String content = "Lines with a: " + numAs + ", lines with b: " + numBs; System.out.println(content); WriteText(content,"mh6log.txt"); sc.close(); } private static void WriteText(String content,String fileName) { try { File logFile=new File(fileName); BufferedWriter writer = new BufferedWriter(new FileWriter(logFile)); writer.write (content); //Close writer writer.close(); } catch(Exception e) { e.printStackTrace(); } } } The pom are: http://maven.apache.org/POM/4.0.0; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd;> 4.0.0 Spark-Test Spark-Test 1.0 src maven-compiler-plugin 3.3 1.8 1.8 org.apache.spark spark-core_2.10 1.5.2 I can run this java class correctly in Eclipse, but exceptions when I using the following command to commit it: spark-submit --master local --class com.qad.SparkTest1 Spark-Test-1.0.jar Who knows which step I was wrong? Thank you. The exceptions are: Exception in thread "main" org.apache.spark.SparkException: Job aborted due to s tage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java: 715) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:381) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor $Executor$$updateDependencies$5.apply(Executor.scala:405) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor $Executor$$updateDependencies$5.apply(Executor.scala:397) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply( TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca la:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.sca la:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala :226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.s cala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor $$updateDependencies(Executor.scala:397) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
[jira] [Assigned] (SPARK-4117) Spark on Yarn handle AM being told command from RM
[ https://issues.apache.org/jira/browse/SPARK-4117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4117: --- Assignee: (was: Apache Spark) > Spark on Yarn handle AM being told command from RM > -- > > Key: SPARK-4117 > URL: https://issues.apache.org/jira/browse/SPARK-4117 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves > > In the allocateResponse from the RM it can send commands that the AM should > follow. for instance AM_RESYNC and AM_SHUTDOWN. We should add support for > those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()
[ https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12117: Assignee: Apache Spark > Column Aliases are Ignored in callUDF while using struct() > -- > > Key: SPARK-12117 > URL: https://issues.apache.org/jira/browse/SPARK-12117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: sachin aggarwal >Assignee: Apache Spark > > case where this works: > val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), > ("Rishabh", "2"))).toDF("myText", "id") > > TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > steps to reproduce error case: > 1)create a file copy following text--filename(a.json) > { "myText": "Sachin Aggarwal","id": "1"} > { "myText": "Rishabh","id": "2"} > 2)define a simple UDF > def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]} > 3)register the udf > sqlContext.udf.register("mydef" ,mydef _) > 4)read the input file > val TestDoc2=sqlContext.read.json("/tmp/a.json") > 5)make a call to UDF > TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > ERROR received: > java.lang.IllegalArgumentException: Field "Text" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233) > at > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212) > at org.apache.spark.sql.Row$class.getAs(Row.scala:325) > at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191) > at > $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >
[jira] [Updated] (SPARK-12104) collect() does not handle multiple columns with same name
[ https://issues.apache.org/jira/browse/SPARK-12104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-12104: -- Assignee: Sun Rui > collect() does not handle multiple columns with same name > - > > Key: SPARK-12104 > URL: https://issues.apache.org/jira/browse/SPARK-12104 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Hossein Falaki >Assignee: Sun Rui >Priority: Critical > Fix For: 1.6.1, 2.0.0 > > > This is a regression from Spark 1.5 > Spark can produce DataFrames with identical names (e.g., after left outer > joins). In 1.5 when such a DataFrame was collected we ended up with an R > data.frame with modified column names: > {code} > > names(mySparkDF) > [1] "date" "name" "name" > > names(collect(mySparkDF)) > [1] "date" "name" "name.1" > {code} > But in 1.6 only the first column is included in the collected R data.frame. I > think SparkR should continue the old behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()
[ https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12117: Assignee: (was: Apache Spark) > Column Aliases are Ignored in callUDF while using struct() > -- > > Key: SPARK-12117 > URL: https://issues.apache.org/jira/browse/SPARK-12117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: sachin aggarwal > > case where this works: > val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), > ("Rishabh", "2"))).toDF("myText", "id") > > TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > steps to reproduce error case: > 1)create a file copy following text--filename(a.json) > { "myText": "Sachin Aggarwal","id": "1"} > { "myText": "Rishabh","id": "2"} > 2)define a simple UDF > def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]} > 3)register the udf > sqlContext.udf.register("mydef" ,mydef _) > 4)read the input file > val TestDoc2=sqlContext.read.json("/tmp/a.json") > 5)make a call to UDF > TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > ERROR received: > java.lang.IllegalArgumentException: Field "Text" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233) > at > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212) > at org.apache.spark.sql.Row$class.getAs(Row.scala:325) > at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191) > at > $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at >
[jira] [Commented] (SPARK-12117) Column Aliases are Ignored in callUDF while using struct()
[ https://issues.apache.org/jira/browse/SPARK-12117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039761#comment-15039761 ] Apache Spark commented on SPARK-12117: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/10140 > Column Aliases are Ignored in callUDF while using struct() > -- > > Key: SPARK-12117 > URL: https://issues.apache.org/jira/browse/SPARK-12117 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: sachin aggarwal > > case where this works: > val TestDoc1 = sqlContext.createDataFrame(Seq(("sachin aggarwal", "1"), > ("Rishabh", "2"))).toDF("myText", "id") > > TestDoc1.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > steps to reproduce error case: > 1)create a file copy following text--filename(a.json) > { "myText": "Sachin Aggarwal","id": "1"} > { "myText": "Rishabh","id": "2"} > 2)define a simple UDF > def mydef(r:Row)={println(r.schema); r.getAs("Text").asInstanceOf[String]} > 3)register the udf > sqlContext.udf.register("mydef" ,mydef _) > 4)read the input file > val TestDoc2=sqlContext.read.json("/tmp/a.json") > 5)make a call to UDF > TestDoc2.select(callUDF("mydef",struct($"myText".as("Text"),$"id".as("label"))).as("col1")).show > ERROR received: > java.lang.IllegalArgumentException: Field "Text" does not exist. > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at > org.apache.spark.sql.types.StructType$$anonfun$fieldIndex$1.apply(StructType.scala:234) > at scala.collection.MapLike$class.getOrElse(MapLike.scala:128) > at scala.collection.AbstractMap.getOrElse(Map.scala:58) > at org.apache.spark.sql.types.StructType.fieldIndex(StructType.scala:233) > at > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema.fieldIndex(rows.scala:212) > at org.apache.spark.sql.Row$class.getAs(Row.scala:325) > at org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191) > at > $line414.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.mydef(:107) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > $line419.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$c57ec8bf9b0d5f6161b97741d596ff0wC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:110) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:75) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:74) > at > org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:964) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:55) > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$2.apply(basicOperators.scala:53) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at
[jira] [Resolved] (SPARK-2688) Need a way to run multiple data pipeline concurrently
[ https://issues.apache.org/jira/browse/SPARK-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2688. -- Resolution: Won't Fix > Need a way to run multiple data pipeline concurrently > - > > Key: SPARK-2688 > URL: https://issues.apache.org/jira/browse/SPARK-2688 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Xuefu Zhang > > Suppose we want to do the following data processing: > {code} > rdd1 -> rdd2 -> rdd3 >| -> rdd4 >| -> rdd5 >\ -> rdd6 > {code} > where -> represents a transformation. rdd3 to rrdd6 are all derived from an > intermediate rdd2. We use foreach(fn) with a dummy function to trigger the > execution. However, rdd.foreach(fn) only trigger pipeline rdd1 -> rdd2 -> > rdd3. To make things worse, when we call rdd4.foreach(), rdd2 will be > recomputed. This is very inefficient. Ideally, we should be able to trigger > the execution the whole graph and reuse rdd2, but there doesn't seem to be a > way doing so. Tez already realized the importance of this (TEZ-391), so I > think Spark should provide this too. > This is required for Hive to support multi-insert queries. HIVE-7292. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5312) Use sbt to detect new or changed public classes in PRs
[ https://issues.apache.org/jira/browse/SPARK-5312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037808#comment-15037808 ] Sean Owen commented on SPARK-5312: -- [~boyork] are you going to do more with this? > Use sbt to detect new or changed public classes in PRs > -- > > Key: SPARK-5312 > URL: https://issues.apache.org/jira/browse/SPARK-5312 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Reporter: Nicholas Chammas >Priority: Minor > > We currently use an [unwieldy grep/sed > contraption|https://github.com/apache/spark/blob/19556454881e05a6c2470d406d50f004b88088a2/dev/run-tests-jenkins#L152-L174] > to detect new public classes in PRs. > -Apparently, sbt lets you get a list of public classes [much more > directly|http://www.scala-sbt.org/0.13/docs/Howto-Inspect-the-Build.html] via > {{show compile:discoveredMainClasses}}. We should use that instead.- > There is a tool called [ClassUtil|http://software.clapper.org/classutil/] > that seems to help give this kind of information much more directly. We > should look into using that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11801) Notify driver when OOM is thrown before executor JVM is killed
[ https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037869#comment-15037869 ] Mridul Muralidharan commented on SPARK-11801: - I have preference for (a) because I am not sure if we can do (b) given the number of corner cases :-) But if it can be pulled off decently, sure why not - what I would not like is inconsistent behavior, where sometimes a behavior is exhibited and other times some other - because of the inherent instability at OOM and VM exit. Right now, I know that we will get a task failure at driver and I investigate cause of it at executor - whether it is jni crash, OOM, growth of memory which lead to yarn killing executor, etc. > Notify driver when OOM is thrown before executor JVM is killed > --- > > Key: SPARK-11801 > URL: https://issues.apache.org/jira/browse/SPARK-11801 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Srinivasa Reddy Vundela >Priority: Minor > > Here is some background for the issue. > Customer got OOM exception in one of the task and executor got killed with > kill %p. It is unclear in driver logs/Spark UI why the task is lost or > executor is lost. Customer has to look into the executor logs to see OOM is > the cause for the task/executor lost. > It would be helpful if driver logs/spark UI shows the reason for task > failures by making sure that task updates the driver with OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12105: Assignee: Apache Spark > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Assignee: Apache Spark >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null
[ https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037839#comment-15037839 ] Philip Dodds commented on SPARK-12128: -- I tried to add a testcase, but when I do it using a simplified approach {code} test("SPARK-12128: Multiplication of decimals in dataframe returning null") { withTempTable("t") { Seq((Decimal(2), Decimal(2)), (Decimal(3), Decimal(3))).toDF("a", "b").registerTempTable("t") checkAnswer(sql("SELECT a*b FROM t"), Seq(Row(Decimal(4.0).toBigDecimal), Row(Decimal(9.0).toBigDecimal))) } } {code} It then appears to work, though i did need to covert the target result to big decimal > Multiplication on decimals in dataframe returns null > > > Key: SPARK-12128 > URL: https://issues.apache.org/jira/browse/SPARK-12128 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 >Reporter: Philip Dodds > > I hit a weird issue when I tried to multiply to decimals in a select (either > in scala or as SQL), and Im assuming I must be missing the point. > The issue is fairly easy to recreate with something like the following: > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > case class Trade(quantity: Decimal,price: Decimal) > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > Trade(quantity, price) > } > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > trades.select(trades("price")*trades("quantity")).show > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from trades").show > {code} > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > ie. > {code} > +--+ > |(price * quantity)| > +--+ > | null| > | null| > | null| > | null| > | null| > +--+ > +++++ > | _c0| _c1| _c2| _c3| > +++++ > |0.952380952380952381|null|41.00...|-1.00...| > |1.380952380952380952|null|50.00...|8.00| > |1.272727272727272727|null|50.00...|6.00| > |0.83|null|44.00...|-4.00...| > |1.00|null|58.00...| 0E-18| > +++++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037795#comment-15037795 ] Apache Spark commented on SPARK-12105: -- User 'jbonofre' has created a pull request for this issue: https://github.com/apache/spark/pull/10130 > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12105) Add a DataFrame.show() with argument for output PrintStream
[ https://issues.apache.org/jira/browse/SPARK-12105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12105: Assignee: (was: Apache Spark) > Add a DataFrame.show() with argument for output PrintStream > --- > > Key: SPARK-12105 > URL: https://issues.apache.org/jira/browse/SPARK-12105 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Dean Wampler >Priority: Minor > > It would be nice to send the output of DataFrame.show(...) to a different > output stream than stdout, including just capturing the string itself. This > is useful, e.g., for testing. Actually, it would be sufficient and perhaps > better to just make DataFrame.showString a public method, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5261) In some cases ,The value of word's vector representation is too big
[ https://issues.apache.org/jira/browse/SPARK-5261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-5261. -- Resolution: Duplicate Re-resolving as a duplicate of several subsequent issues > In some cases ,The value of word's vector representation is too big > --- > > Key: SPARK-5261 > URL: https://issues.apache.org/jira/browse/SPARK-5261 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Guoqiang Li > > Get data: > {code:none} > normalize_text() { > awk '{print tolower($0);}' | sed -e "s/’/'/g" -e "s/′/'/g" -e "s/''/ /g" -e > "s/'/ ' /g" -e "s/“/\"/g" -e "s/”/\"/g" \ > -e 's/"/ " /g' -e 's/\./ \. /g' -e 's// /g' -e 's/, / , /g' -e 's/(/ > ( /g' -e 's/)/ ) /g' -e 's/\!/ \! /g' \ > -e 's/\?/ \? /g' -e 's/\;/ /g' -e 's/\:/ /g' -e 's/-/ - /g' -e 's/=/ /g' -e > 's/=/ /g' -e 's/*/ /g' -e 's/|/ /g' \ > -e 's/«/ /g' | tr 0-9 " " > } > wget > http://www.statmt.org/wmt14/training-monolingual-news-crawl/news.2013.en.shuffled.gz > gzip -d news.2013.en.shuffled.gz > normalize_text < news.2013.en.shuffled > data.txt > {code} > {code:none} > import org.apache.spark.mllib.feature.Word2Vec > val text = sc.textFile("dataPath").map { t => t.split(" ").toIterable } > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(36). > setMinCount(5) > val model = word2Vec.fit(text) > model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / > model.getVectors.size > => > res1: Float = 375059.84 > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(36). > setMinCount(100) > val model = word2Vec.fit(text) > model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / > model.getVectors.size > => > res3: Float = 1661285.2 > val word2Vec = new Word2Vec() > word2Vec. > setVectorSize(100). > setSeed(42L). > setNumIterations(5). > setNumPartitions(1) > val model = word2Vec.fit(text) > model.getVectors.map { t => t._2.map(_.abs).sum }.sum / 100 / > model.getVectors.size > => > 0.13889 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12128) Multiplication on decimals in dataframe returns null
Philip Dodds created SPARK-12128: Summary: Multiplication on decimals in dataframe returns null Key: SPARK-12128 URL: https://issues.apache.org/jira/browse/SPARK-12128 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2, 1.5.1, 1.5.0 Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 Reporter: Philip Dodds I hit a weird issue when I tried to multiply to decimals in a select (either in scala or as SQL), and Im assuming I must be missing the point. The issue is fairly easy to recreate with something like the following: {code:java} val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import org.apache.spark.sql.types.Decimal case class Trade(quantity: Decimal,price: Decimal) val data = Seq.fill(100) { val price = Decimal(20+scala.util.Random.nextInt(10)) val quantity = Decimal(20+scala.util.Random.nextInt(10)) Trade(quantity, price) } val trades = sc.parallelize(data).toDF() trades.registerTempTable("trades") trades.select(trades("price")*trades("quantity")).show sqlContext.sql("select price/quantity,price*quantity,price+quantity,price-quantity from trades").show {code} The odd part is if you run it you will see that the addition/division and subtraction works but the multiplication returns a null. Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) ie. {code} +--+ |(price * quantity)| +--+ | null| | null| | null| | null| | null| +--+ +++++ | _c0| _c1| _c2| _c3| +++++ |0.952380952380952381|null|41.00...|-1.00...| |1.380952380952380952|null|50.00...|8.00| |1.272727272727272727|null|50.00...|6.00| |0.83|null|44.00...|-4.00...| |1.00|null|58.00...| 0E-18| +++++ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7893) Complex Operators between Graphs
[ https://issues.apache.org/jira/browse/SPARK-7893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037813#comment-15037813 ] Sean Owen commented on SPARK-7893: -- [~andyyehoo] this and its subtasks don't seem to have progressed; I'm going to close it if there's no intent to proceed. > Complex Operators between Graphs > > > Key: SPARK-7893 > URL: https://issues.apache.org/jira/browse/SPARK-7893 > Project: Spark > Issue Type: Umbrella > Components: GraphX >Reporter: Andy Huang > Labels: complex, graph, join, operators, union > > Currently there are 30+ operators in GraphX, while few of them consider > operators between graphs. The only one is _*mask*_, which takes another graph > as a parameter and return a new graph. > In many complex case,such as _*streaming graph, small graph merge into huge > graph*_, higher level operators of graphs can help users to focus and think > in graph. Performance optimization can be done internally and be transparent > to them. > Complex graph operator list is > here:[complex_graph_operations|http://techieme.in/complex-graph-operations/] > * Union of Graphs ( G ∪ H ) > * Intersection of Graphs( G ∩ H) > * Graph Join > * Difference of Graphs(G – H) > * Graph Complement > * Line Graph ( L(G) ) > This issue will be index of all these operators -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037773#comment-15037773 ] Cody Koeninger commented on SPARK-12103: A cursory review of the Kafka project documentation should reveal that messages have a key (used for distribution among partitions) and a value. Why would one reasonably expect that Spark documentation referring to a Kafka message key was instead supposed to be the message topic? If you really want the topic name in each item of the rdd, create your separate streams, map over them to add the topic name, then union them together into a single stream. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12103. --- Resolution: Not A Problem > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments
Doug Balog created SPARK-12129: -- Summary: spark.submit.deployMode not being picked up in SparkSubmitArguments Key: SPARK-12129 URL: https://issues.apache.org/jira/browse/SPARK-12129 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.5.2, 1.4.1, 1.3.1 Environment: YARN, MESOS Reporter: Doug Balog Priority: Trivial In sparkSubmitArguments.scala {code} deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull {code} This code should probably check to see if the env variable SPARK_SUBMIT is set and then look up `spark.submit.deployMode` in sparkProperties. To see this, run something like {code}spark-submit -v --conf spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code} The last `spark.submit.deployMode` will be set to client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver
[ https://issues.apache.org/jira/browse/SPARK-12130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-12130: - Component/s: YARN Shuffle > Replace shuffleManagerClass with shortShuffleMgrNames in > ExternalShuffleBlockResolver > - > > Key: SPARK-12130 > URL: https://issues.apache.org/jira/browse/SPARK-12130 > Project: Spark > Issue Type: Bug > Components: Shuffle, YARN >Reporter: Lianhui Wang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12130) Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver
Lianhui Wang created SPARK-12130: Summary: Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver Key: SPARK-12130 URL: https://issues.apache.org/jira/browse/SPARK-12130 Project: Spark Issue Type: Bug Reporter: Lianhui Wang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10854) MesosExecutorBackend: Received launchTask but executor was null
[ https://issues.apache.org/jira/browse/SPARK-10854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038058#comment-15038058 ] Matei Zaharia commented on SPARK-10854: --- Just a note, I saw a log where this happened, and the sequence of events is that the executor logs a launchTask callback before registered(). It could be a synchronization thing or a problem in the Mesos library. > MesosExecutorBackend: Received launchTask but executor was null > --- > > Key: SPARK-10854 > URL: https://issues.apache.org/jira/browse/SPARK-10854 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.0 > Environment: Spark 1.4.0 > Mesos 0.23.0 > Docker 1.8.1 >Reporter: Kevin Matzen >Priority: Minor > > Sometimes my tasks get stuck in staging. Here's stdout from one such worker. > I'm running mesos-slave inside a docker container with the host's docker > exposed and I'm using Spark's docker support to launch the worker inside its > own container. Both containers are running. I'm using pyspark. I can see > mesos-slave and java running, but I do not see python running. > {noformat} > WARNING: Your kernel does not support swap limit capabilities, memory limited > without swap. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered signal handlers for > [TERM, HUP, INT] > I0928 15:02:09.65854138 exec.cpp:132] Version: 0.23.0 > 15/09/28 15:02:09 ERROR MesosExecutorBackend: Received launchTask but > executor was null > I0928 15:02:09.70295554 exec.cpp:206] Executor registered on slave > 20150928-044200-1140850698-5050-8-S190 > 15/09/28 15:02:09 INFO MesosExecutorBackend: Registered with Mesos as > executor ID 20150928-044200-1140850698-5050-8-S190 with 1 cpus > 15/09/28 15:02:09 INFO SecurityManager: Changing view acls to: root > 15/09/28 15:02:09 INFO SecurityManager: Changing modify acls to: root > 15/09/28 15:02:09 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(root); users > with modify permissions: Set(root) > 15/09/28 15:02:10 INFO Slf4jLogger: Slf4jLogger started > 15/09/28 15:02:10 INFO Remoting: Starting remoting > 15/09/28 15:02:10 INFO Remoting: Remoting started; listening on addresses > :[akka.tcp://sparkExecutor@:56458] > 15/09/28 15:02:10 INFO Utils: Successfully started service 'sparkExecutor' on > port 56458. > 15/09/28 15:02:10 INFO DiskBlockManager: Created local directory at > /tmp/spark-28a21c2d-54cc-40b3-b0c2-cc3624f1a73c/blockmgr-f2336fec-e1ea-44f1-bd5c-9257049d5e7b > 15/09/28 15:02:10 INFO MemoryStore: MemoryStore started with capacity 52.1 MB > 15/09/28 15:02:11 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 15/09/28 15:02:11 INFO Executor: Starting executor ID > 20150928-044200-1140850698-5050-8-S190 on host > 15/09/28 15:02:11 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 57431. > 15/09/28 15:02:11 INFO NettyBlockTransferService: Server created on 57431 > 15/09/28 15:02:11 INFO BlockManagerMaster: Trying to register BlockManager > 15/09/28 15:02:11 INFO BlockManagerMaster: Registered BlockManager > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038084#comment-15038084 ] Cody Koeninger commented on SPARK-12103: On the off chance you're acting in good faith, actually did read the Kafka documentation and Spark API documentation, and were still somehow unclear that K meant the type of the message Key and V meant the type of the message Value... I'll submit a pr to add @tparam docs for every type in the KafkaUtils api > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null
[ https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038104#comment-15038104 ] Philip Dodds commented on SPARK-12128: -- I took a look at the plan {code} Results do not match for query: == Parsed Logical Plan == 'Project [unresolvedalias(('a * 'b))] +- 'UnresolvedRelation `t`, None == Analyzed Logical Plan == _c0: decimal(38,36) Project [CheckOverflow((promote_precision(cast(a#45 as decimal(38,18))) * promote_precision(cast(b#46 as decimal(38,18, DecimalType(38,36)) AS _c0#47] +- Subquery t +- Project [_1#43 AS a#45,_2#44 AS b#46] +- LocalRelation [_1#43,_2#44], [[20.00,20.00],[20.00,20.00]] == Optimized Logical Plan == LocalRelation [_c0#47], [[null],[null]] == Physical Plan == LocalTableScan [_c0#47], [[null],[null]] == Results == !== Correct Answer - 2 == == Spark Answer - 2 == ![400.0] [null] ![400.0] [null] {code} The scale appears to have doubled and thus if the result has a more than two digits to the right of the point it'll end up as an overflow. I'm not sure if it is the way I'm using the decimal or the code that is wrong? > Multiplication on decimals in dataframe returns null > > > Key: SPARK-12128 > URL: https://issues.apache.org/jira/browse/SPARK-12128 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 >Reporter: Philip Dodds > > I hit a weird issue when I tried to multiply to decimals in a select (either > in scala or as SQL), and Im assuming I must be missing the point. > The issue is fairly easy to recreate with something like the following: > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > case class Trade(quantity: Decimal,price: Decimal) > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > Trade(quantity, price) > } > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > trades.select(trades("price")*trades("quantity")).show > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from trades").show > {code} > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > ie. > {code} > +--+ > |(price * quantity)| > +--+ > | null| > | null| > | null| > | null| > | null| > +--+ > +++++ > | _c0| _c1| _c2| _c3| > +++++ > |0.952380952380952381|null|41.00...|-1.00...| > |1.380952380952380952|null|50.00...|8.00| > |1.272727272727272727|null|50.00...|6.00| > |0.83|null|44.00...|-4.00...| > |1.00|null|58.00...| 0E-18| > +++++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038106#comment-15038106 ] Dan Dutrow commented on SPARK-12103: Thanks. I don't mean to be a pain. I did make an honest attempt to figure it out and I'm not completely numb, just inexperienced with Kafka. I was trying to find a better way to pool resources and not have to dedicate a core to each receiver. Having the key be the topic name was wishful thinking and nothing in the documentation corrected that assumption. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class
[ https://issues.apache.org/jira/browse/SPARK-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12131: Assignee: Cheng Lian (was: Apache Spark) > Cannot create ExpressionEncoder for Array[T] where T is a nested class > -- > > Key: SPARK-12131 > URL: https://issues.apache.org/jira/browse/SPARK-12131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following snippet reproduces this issue under {{spark-shell}}: > {noformat} > import sqlContext._ > case class N(i: Int) > case class T(a: Array[N]) > val ds = Seq(T(Array(N(1.toDS > ds.show() > {noformat} > Note that, classes defined in Scala REPL are inherently nested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs
[ https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038038#comment-15038038 ] Apache Spark commented on SPARK-11827: -- User 'kevinyu98' has created a pull request for this issue: https://github.com/apache/spark/pull/10125 > Support java.math.BigInteger in Type-Inference utilities for POJOs > -- > > Key: SPARK-11827 > URL: https://issues.apache.org/jira/browse/SPARK-11827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Abhilash Srimat Tirumala Pallerlamudi >Priority: Minor > > I get the below exception when creating DataFrame using RDD of JavaBean > having a property of type java.math.BigInteger > scala.MatchError: class java.math.BigInteger (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447) > I don't see the support for java.math.BigInteger in > org.apache.spark.sql.catalyst.JavaTypeInference.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs
[ https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11827: Assignee: Apache Spark > Support java.math.BigInteger in Type-Inference utilities for POJOs > -- > > Key: SPARK-11827 > URL: https://issues.apache.org/jira/browse/SPARK-11827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Abhilash Srimat Tirumala Pallerlamudi >Assignee: Apache Spark >Priority: Minor > > I get the below exception when creating DataFrame using RDD of JavaBean > having a property of type java.math.BigInteger > scala.MatchError: class java.math.BigInteger (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447) > I don't see the support for java.math.BigInteger in > org.apache.spark.sql.catalyst.JavaTypeInference.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038115#comment-15038115 ] Apache Spark commented on SPARK-12103: -- User 'koeninger' has created a pull request for this issue: https://github.com/apache/spark/pull/10132 > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger reopened SPARK-12103: PR for doc change: https://github.com/apache/spark/pull/10132 > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038148#comment-15038148 ] Doug Balog commented on SPARK-12129: Ok, Thanks. > spark.submit.deployMode not being picked up in SparkSubmitArguments > --- > > Key: SPARK-12129 > URL: https://issues.apache.org/jira/browse/SPARK-12129 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Environment: YARN, MESOS >Reporter: Doug Balog >Priority: Trivial > > In sparkSubmitArguments.scala > {code} > deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull > {code} > This code should probably check to see if the env variable SPARK_SUBMIT is > set and then look up `spark.submit.deployMode` in sparkProperties. > To see this, run something like {code}spark-submit -v --conf > spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi > $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code} > The last `spark.submit.deployMode` will be set to client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9039) Jobs page shows nonsensical task-progress-bar numbers when speculation occurs
[ https://issues.apache.org/jira/browse/SPARK-9039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038027#comment-15038027 ] Ryan Williams commented on SPARK-9039: -- Just saw this in Spark 1.5.2 !http://f.cl.ly/items/0x2y2N292i3w1K0r2T2N/Screen%20Shot%202015-12-01%20at%204.29.19%20PM.png! [Here|https://gist.github.com/ryan-williams/5d9902dd81c36db9608a] is the event log for that Spark app; if you can get a history server to load just the first 47 lines of that, that should be the point in time where I saw the bug. I can't get the history server to read {{head -n 47}} of that file. > Jobs page shows nonsensical task-progress-bar numbers when speculation occurs > - > > Key: SPARK-9039 > URL: https://issues.apache.org/jira/browse/SPARK-9039 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.1 >Reporter: Ryan Williams >Priority: Minor > > The stages page normalizes its task-progress-bars by task indexes, but the > job page does not: > !http://f.cl.ly/items/043O173E453U3H3i010t/Screen%20Shot%202015-07-13%20at%209.46.48%20PM.png! > Note the 11258/10302, and the light-blue "running task" bar DOM element that > is wrapped onto a next line and hidden. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-12129. Resolution: Invalid That's an internal property, it's not documented, and users are not supposed to set it. I think what you're really looking for is SPARK-10123. > spark.submit.deployMode not being picked up in SparkSubmitArguments > --- > > Key: SPARK-12129 > URL: https://issues.apache.org/jira/browse/SPARK-12129 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Environment: YARN, MESOS >Reporter: Doug Balog >Priority: Trivial > > In sparkSubmitArguments.scala > {code} > deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull > {code} > This code should probably check to see if the env variable SPARK_SUBMIT is > set and then look up `spark.submit.deployMode` in sparkProperties. > To see this, run something like {code}spark-submit -v --conf > spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi > $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code} > The last `spark.submit.deployMode` will be set to client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null
[ https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038068#comment-15038068 ] Philip Dodds commented on SPARK-12128: -- Actually that was a little too simple, I went back to the use case and tried this and again it fails A simple failing test case would be {code} test("SPARK-12128: Multiplication of decimals in dataframe returning null") { withTempTable("t") { Seq((Decimal(20), Decimal(20)), (Decimal(20), Decimal(20))).toDF("a", "b").registerTempTable("t") checkAnswer(sql("SELECT a*b FROM t"), Seq(Row(Decimal(4.0).toBigDecimal), Row(Decimal(9.0).toBigDecimal))) } } {code} > Multiplication on decimals in dataframe returns null > > > Key: SPARK-12128 > URL: https://issues.apache.org/jira/browse/SPARK-12128 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 >Reporter: Philip Dodds > > I hit a weird issue when I tried to multiply to decimals in a select (either > in scala or as SQL), and Im assuming I must be missing the point. > The issue is fairly easy to recreate with something like the following: > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > case class Trade(quantity: Decimal,price: Decimal) > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > Trade(quantity, price) > } > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > trades.select(trades("price")*trades("quantity")).show > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from trades").show > {code} > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > ie. > {code} > +--+ > |(price * quantity)| > +--+ > | null| > | null| > | null| > | null| > | null| > +--+ > +++++ > | _c0| _c1| _c2| _c3| > +++++ > |0.952380952380952381|null|41.00...|-1.00...| > |1.380952380952380952|null|50.00...|8.00| > |1.272727272727272727|null|50.00...|6.00| > |0.83|null|44.00...|-4.00...| > |1.00|null|58.00...| 0E-18| > +++++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12128) Multiplication on decimals in dataframe returns null
[ https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philip Dodds closed SPARK-12128. Resolution: Invalid The issue is the promotion of the scale, since the default BigDecimal is a precision of 38, and a scale of 19 the multiplication will promote that using the logic defined. This will mean that (since 38 is the max precision) the scale will be the two scales added together (36) meaning the result is (38,36) and thus we get an overflow. I assume the solution is to cast the two decimals going in to ensure the result has enough, ie. SELECT cast(a as decimal(38,9))*cast(b as decimal(38,9)) FROM t > Multiplication on decimals in dataframe returns null > > > Key: SPARK-12128 > URL: https://issues.apache.org/jira/browse/SPARK-12128 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 >Reporter: Philip Dodds > > I hit a weird issue when I tried to multiply to decimals in a select (either > in scala or as SQL), and Im assuming I must be missing the point. > The issue is fairly easy to recreate with something like the following: > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > case class Trade(quantity: Decimal,price: Decimal) > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > Trade(quantity, price) > } > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > trades.select(trades("price")*trades("quantity")).show > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from trades").show > {code} > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > ie. > {code} > +--+ > |(price * quantity)| > +--+ > | null| > | null| > | null| > | null| > | null| > +--+ > +++++ > | _c0| _c1| _c2| _c3| > +++++ > |0.952380952380952381|null|41.00...|-1.00...| > |1.380952380952380952|null|50.00...|8.00| > |1.272727272727272727|null|50.00...|6.00| > |0.83|null|44.00...|-4.00...| > |1.00|null|58.00...| 0E-18| > +++++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11827) Support java.math.BigInteger in Type-Inference utilities for POJOs
[ https://issues.apache.org/jira/browse/SPARK-11827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11827: Assignee: (was: Apache Spark) > Support java.math.BigInteger in Type-Inference utilities for POJOs > -- > > Key: SPARK-11827 > URL: https://issues.apache.org/jira/browse/SPARK-11827 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Abhilash Srimat Tirumala Pallerlamudi >Priority: Minor > > I get the below exception when creating DataFrame using RDD of JavaBean > having a property of type java.math.BigInteger > scala.MatchError: class java.math.BigInteger (of class java.lang.Class) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1182) > at > org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1181) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) > at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1181) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:419) > at > org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:447) > I don't see the support for java.math.BigInteger in > org.apache.spark.sql.catalyst.JavaTypeInference.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class
Cheng Lian created SPARK-12131: -- Summary: Cannot create ExpressionEncoder for Array[T] where T is a nested class Key: SPARK-12131 URL: https://issues.apache.org/jira/browse/SPARK-12131 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Cheng Lian Assignee: Cheng Lian The following snippet reproduces this issue under {{spark-shell}}: {noformat} import sqlContext._ case class N(i: Int) case class T(a: Array[N]) val ds = Seq(T(Array(N(1.toDS ds.show() {noformat} Note that, classes defined in Scala REPL are inherently nested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class
[ https://issues.apache.org/jira/browse/SPARK-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15038121#comment-15038121 ] Apache Spark commented on SPARK-12131: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/10133 > Cannot create ExpressionEncoder for Array[T] where T is a nested class > -- > > Key: SPARK-12131 > URL: https://issues.apache.org/jira/browse/SPARK-12131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following snippet reproduces this issue under {{spark-shell}}: > {noformat} > import sqlContext._ > case class N(i: Int) > case class T(a: Array[N]) > val ds = Seq(T(Array(N(1.toDS > ds.show() > {noformat} > Note that, classes defined in Scala REPL are inherently nested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12131) Cannot create ExpressionEncoder for Array[T] where T is a nested class
[ https://issues.apache.org/jira/browse/SPARK-12131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12131: Assignee: Apache Spark (was: Cheng Lian) > Cannot create ExpressionEncoder for Array[T] where T is a nested class > -- > > Key: SPARK-12131 > URL: https://issues.apache.org/jira/browse/SPARK-12131 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > The following snippet reproduces this issue under {{spark-shell}}: > {noformat} > import sqlContext._ > case class N(i: Int) > case class T(a: Array[N]) > val ds = Seq(T(Array(N(1.toDS > ds.show() > {noformat} > Note that, classes defined in Scala REPL are inherently nested. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12116) Document workaround when method conflicts with another R package, like dplyr
[ https://issues.apache.org/jira/browse/SPARK-12116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-12116. --- Resolution: Fixed Assignee: Felix Cheung Fix Version/s: 1.6.0 Resolved by https://github.com/apache/spark/pull/10119 > Document workaround when method conflicts with another R package, like dplyr > > > Key: SPARK-12116 > URL: https://issues.apache.org/jira/browse/SPARK-12116 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 1.6.0 > > > See https://issues.apache.org/jira/browse/SPARK-11886 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037986#comment-15037986 ] Cody Koeninger commented on SPARK-12103: Knowing that kafka messages have a key and value isn't an "Kafka expert" thing. Serious question - did you read http://kafka.apache.org/documentation.html before posting this? The api documentation for createStream at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$ keyTypeClass Key type of DStream valueTypeClass value type of Dstream keyDecoderClass Type of kafka key decoder valueDecoderClass Type of kafka value decoder How is this not clear? If you have a specific improvement to docs, send a PR > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3200) Class defined with reference to external variables crashes in REPL.
[ https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-3200. -- Resolution: Cannot Reproduce It works as of 1.6, so must have gotten fixed along the way somewhere, but I don't know where > Class defined with reference to external variables crashes in REPL. > --- > > Key: SPARK-3200 > URL: https://issues.apache.org/jira/browse/SPARK-3200 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.1.0 >Reporter: Prashant Sharma >Assignee: Prashant Sharma > > Reproducer: > {noformat} > val a = sc.textFile("README.md").count > case class A(i: Int) { val j = a} > sc.parallelize(1 to 10).map(A(_)).collect() > {noformat} > This will happen, when one refers something that refers sc and not otherwise. > There are many ways to work around this, like directly assign a constant > value instead of referring the variable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12128) Multiplication on decimals in dataframe returns null
[ https://issues.apache.org/jira/browse/SPARK-12128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037933#comment-15037933 ] Philip Dodds commented on SPARK-12128: -- The problem is with the Decimal constructor I believe, if you use {code} val data = Seq.fill(5) { Trade(Decimal(BigDecimal(5),38,20), Decimal(BigDecimal(5),38,20)) } {code} Then you will get the correct result, can probably just close the issue > Multiplication on decimals in dataframe returns null > > > Key: SPARK-12128 > URL: https://issues.apache.org/jira/browse/SPARK-12128 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: Scala 2.11/Spark 1.5.0/1.5.1/1.5.2 >Reporter: Philip Dodds > > I hit a weird issue when I tried to multiply to decimals in a select (either > in scala or as SQL), and Im assuming I must be missing the point. > The issue is fairly easy to recreate with something like the following: > {code:java} > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > import sqlContext.implicits._ > import org.apache.spark.sql.types.Decimal > case class Trade(quantity: Decimal,price: Decimal) > val data = Seq.fill(100) { > val price = Decimal(20+scala.util.Random.nextInt(10)) > val quantity = Decimal(20+scala.util.Random.nextInt(10)) > Trade(quantity, price) > } > val trades = sc.parallelize(data).toDF() > trades.registerTempTable("trades") > trades.select(trades("price")*trades("quantity")).show > sqlContext.sql("select > price/quantity,price*quantity,price+quantity,price-quantity from trades").show > {code} > The odd part is if you run it you will see that the addition/division and > subtraction works but the multiplication returns a null. > Tested on 1.5.1/1.5.2 (Scala 2.10 and 2.11) > ie. > {code} > +--+ > |(price * quantity)| > +--+ > | null| > | null| > | null| > | null| > | null| > +--+ > +++++ > | _c0| _c1| _c2| _c3| > +++++ > |0.952380952380952381|null|41.00...|-1.00...| > |1.380952380952380952|null|50.00...|8.00| > |1.272727272727272727|null|50.00...|6.00| > |0.83|null|44.00...|-4.00...| > |1.00|null|58.00...| 0E-18| > +++++ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8796) make sure SparkPlan is only executed at driver side
[ https://issues.apache.org/jira/browse/SPARK-8796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8796. -- Resolution: Won't Fix > make sure SparkPlan is only executed at driver side > --- > > Key: SPARK-8796 > URL: https://issues.apache.org/jira/browse/SPARK-8796 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037955#comment-15037955 ] Dan Dutrow edited comment on SPARK-12103 at 12/3/15 4:16 PM: - I'm sure all of this is obvious to Kafka experts. Nowhere in the Spark scala function documentation does it say that the Left value is the Kafka key, except if you make a guess about what K is. You have to trace through Spark and Kafka code to figure it out. Please update the KafkaWordCount.scala and DirectKafkaWordCount.scala to include a comment about what is in the ._1 field. was (Author: dutrow): I'm sure all of this is obvious to Kafka experts. Nowhere in the Scala Spark documentation does it say that the Left value is the Kafka key, except if you make a guess about what K is. You have to trace through Spark and Kafka code to figure it out. Please update the KafkaWordCount.scala and DirectKafkaWordCount.scala to include a comment about what is in the ._1 field. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037955#comment-15037955 ] Dan Dutrow edited comment on SPARK-12103 at 12/3/15 4:15 PM: - I'm sure all of this is obvious to Kafka experts. Nowhere in the Scala Spark documentation does it say that the Left value is the Kafka key, except if you make a guess about what K is. You have to trace through Spark and Kafka code to figure it out. Please update the KafkaWordCount.scala and DirectKafkaWordCount.scala to include a comment about what is in the ._1 field. was (Author: dutrow): I'm sure all of this is obvious to Kafka experts. Nowhere in the spark documentation does it say that the Left value is the Kafka key, except if you make a guess about what K is. You have to trace through Spark and Kafka code to figure it out. Please update the KafkaWordCount.scala and DirectKafkaWordCount.scala to include a comment about what is in the ._1 field. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037915#comment-15037915 ] Sean Owen commented on SPARK-12129: --- I'm not sure you're intended to set {{spark.submit.deployMode}} directly; use {{--deploy-mode}}. At least, the former is undocumented and appears to be internal, and setting it gets ignored. On those grounds, I'd close this. > spark.submit.deployMode not being picked up in SparkSubmitArguments > --- > > Key: SPARK-12129 > URL: https://issues.apache.org/jira/browse/SPARK-12129 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Environment: YARN, MESOS >Reporter: Doug Balog >Priority: Trivial > > In sparkSubmitArguments.scala > {code} > deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull > {code} > This code should probably check to see if the env variable SPARK_SUBMIT is > set and then look up `spark.submit.deployMode` in sparkProperties. > To see this, run something like {code}spark-submit -v --conf > spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi > $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code} > The last `spark.submit.deployMode` will be set to client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-7337. > FPGrowth algo throwing OutOfMemoryError > --- > > Key: SPARK-7337 > URL: https://issues.apache.org/jira/browse/SPARK-7337 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1 > Environment: Ubuntu >Reporter: Amit Gupta > Attachments: FPGrowthBug.png > > > When running FPGrowth algo with huge data in GBs and with numPartitions=500 > then after some time it throws OutOfMemoryError. > Algo runs correctly upto "collect at FPGrowth.scala:131" where it creates 500 > tasks. It fails at next stage "flatMap at FPGrowth.scala:150" where it fails > to create 500 tasks and create some internal calculated 17 tasks. > Please refer to attachment - print screen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7337) FPGrowth algo throwing OutOfMemoryError
[ https://issues.apache.org/jira/browse/SPARK-7337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7337. -- Resolution: Invalid > FPGrowth algo throwing OutOfMemoryError > --- > > Key: SPARK-7337 > URL: https://issues.apache.org/jira/browse/SPARK-7337 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.3.1 > Environment: Ubuntu >Reporter: Amit Gupta > Attachments: FPGrowthBug.png > > > When running FPGrowth algo with huge data in GBs and with numPartitions=500 > then after some time it throws OutOfMemoryError. > Algo runs correctly upto "collect at FPGrowth.scala:131" where it creates 500 > tasks. It fails at next stage "flatMap at FPGrowth.scala:150" where it fails > to create 500 tasks and create some internal calculated 17 tasks. > Please refer to attachment - print screen. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dan Dutrow reopened SPARK-12103: Please update the Kafka word count examples to comment on what the key field is. The current examples ignore the key without explanation. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037969#comment-15037969 ] Doug Balog commented on SPARK-12129: I thought using {{--conf spark.submit.deployMode=cluster}} with SparkPi was an easy way to demonstrate the problem. I'm trying to set the default deployMode for {{spark-submit}} to be cluster mode. So I have {{master}} set to {{yarn}}, and {{spark.submit.deployMode}} set to {{cluster}} in {{spark-defaults.conf}}. If I set {{master}} to {{yarn-cluster}}, {{spark-shell}} complains about cluster deploy mode is not applicable. > spark.submit.deployMode not being picked up in SparkSubmitArguments > --- > > Key: SPARK-12129 > URL: https://issues.apache.org/jira/browse/SPARK-12129 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Environment: YARN, MESOS >Reporter: Doug Balog >Priority: Trivial > > In sparkSubmitArguments.scala > {code} > deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull > {code} > This code should probably check to see if the env variable SPARK_SUBMIT is > set and then look up `spark.submit.deployMode` in sparkProperties. > To see this, run something like {code}spark-submit -v --conf > spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi > $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code} > The last `spark.submit.deployMode` will be set to client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12103) KafkaUtils createStream with multiple topics -- does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-12103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037955#comment-15037955 ] Dan Dutrow edited comment on SPARK-12103 at 12/3/15 3:53 PM: - I'm sure all of this is obvious to Kafka experts. Nowhere in the spark documentation does it say that the Left value is the Kafka key, except if you make a guess about what K is. You have to trace through Spark and Kafka code to figure it out. Please update the KafkaWordCount.scala and DirectKafkaWordCount.scala to include a comment about what is in the ._1 field. was (Author: dutrow): I'm sure all of this is obvious to Kafka experts. Nowhere in the spark documentation does it say that the Left value is the Kafka key, except if you make a guess about what K is. You have to trace through Spark and Kafka code to figure it out. Please update the DirectKafkaWordCount.scala to include a comment about what is in the ._1 field. > KafkaUtils createStream with multiple topics -- does not work as expected > - > > Key: SPARK-12103 > URL: https://issues.apache.org/jira/browse/SPARK-12103 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 1.4.1 >Reporter: Dan Dutrow >Priority: Minor > Fix For: 1.4.2 > > > (Note: yes, there is a Direct API that may be better, but it's not the > easiest thing to get started with. The Kafka Receiver API still needs to > work, especially for newcomers) > When creating a receiver stream using KafkaUtils, there is a valid use case > where you would want to use one (or a few) Kafka Streaming Receiver to pool > resources. I have 10+ topics and don't want to dedicate 10 cores to > processing all of them. However, when reading the data procuced by > KafkaUtils.createStream, the DStream[(String,String)] does not properly > insert the topic name into the tuple. The left-key always null, making it > impossible to know what topic that data came from other than stashing your > key into the value. Is there a way around that problem? > CODE > val topics = Map("topicA" -> 1, "topicB" -> 1, "topicC" -> 1, "topicD" -> 1, > "topicE" -> 1, "topicF" -> 1, ...) > val streams : IndexedSeq[ReceiverInputDStream[(String,String] = (1 to 3).map( > i => > KafkaUtils.createStream[String, String, StringDecoder, StringDecoder]( > ssc, consumerProperties, > topics, > StorageLevel.MEMORY_ONLY_SER)) > val unioned :DStream[(String,String)] = ssc.union(streams) > unioned.flatMap(x => { >val (key, value) = x > // key is always null! > // value has data from any one of my topics > key match ... { > .. > } > } > END CODE -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12129) spark.submit.deployMode not being picked up in SparkSubmitArguments
[ https://issues.apache.org/jira/browse/SPARK-12129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037980#comment-15037980 ] Sean Owen commented on SPARK-12129: --- Hm I see. Let's see if others chime in about what is supposed to happen here. > spark.submit.deployMode not being picked up in SparkSubmitArguments > --- > > Key: SPARK-12129 > URL: https://issues.apache.org/jira/browse/SPARK-12129 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.3.1, 1.4.1, 1.5.2 > Environment: YARN, MESOS >Reporter: Doug Balog >Priority: Trivial > > In sparkSubmitArguments.scala > {code} > deployMode = Option(deployMode).orElse(env.get("DEPLOY_MODE")).orNull > {code} > This code should probably check to see if the env variable SPARK_SUBMIT is > set and then look up `spark.submit.deployMode` in sparkProperties. > To see this, run something like {code}spark-submit -v --conf > spark.submit.deployMode=cluster --class org.apache.spark.examples.SparkPi > $SPARK_HOME/lib/spark-examples-1.6.0-hadoop.2.6.0.jar{code} > The last `spark.submit.deployMode` will be set to client. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15037932#comment-15037932 ] Sean Owen commented on SPARK-5081: -- Is this still an issue? I'm trying to figure out whether we believe there is still something to do here. Note there have been some shuffle and snappy changes in between. > Shuffle write increases > --- > > Key: SPARK-5081 > URL: https://issues.apache.org/jira/browse/SPARK-5081 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.2.0 >Reporter: Kevin Jung >Priority: Critical > Attachments: Spark_Debug.pdf, diff.txt > > > The size of shuffle write showing in spark web UI is much different when I > execute same spark job with same input data in both spark 1.1 and spark 1.2. > At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB > in spark 1.2. > I set spark.shuffle.manager option to hash because it's default value is > changed but spark 1.2 still writes shuffle output more than spark 1.1. > It can increase disk I/O overhead exponentially as the input file gets bigger > and it causes the jobs take more time to complete. > In the case of about 100GB input, for example, the size of shuffle write is > 39.7GB in spark 1.1 but 91.0GB in spark 1.2. > spark 1.1 > ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| > |9|saveAsTextFile| |1169.4KB| | > |12|combineByKey| |1265.4KB|1275.0KB| > |6|sortByKey| |1276.5KB| | > |8|mapPartitions| |91.0MB|1383.1KB| > |4|apply| |89.4MB| | > |5|sortBy|155.6MB| |98.1MB| > |3|sortBy|155.6MB| | | > |1|collect| |2.1MB| | > |2|mapValues|155.6MB| |2.2MB| > |0|first|184.4KB| | | > spark 1.2 > ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| > |12|saveAsTextFile| |1170.2KB| | > |11|combineByKey| |1264.5KB|1275.0KB| > |8|sortByKey| |1273.6KB| | > |7|mapPartitions| |134.5MB|1383.1KB| > |5|zipWithIndex| |132.5MB| | > |4|sortBy|155.6MB| |146.9MB| > |3|sortBy|155.6MB| | | > |2|collect| |2.0MB| | > |1|mapValues|155.6MB| |2.2MB| > |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org