[jira] [Commented] (SPARK-17142) Complex query triggers binding error in HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489503#comment-15489503 ] Apache Spark commented on SPARK-17142: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/15092 > Complex query triggers binding error in HashAggregateExec > - > > Key: SPARK-17142 > URL: https://issues.apache.org/jira/browse/SPARK-17142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Jiang Xingbo >Priority: Blocker > Fix For: 2.1.0 > > > The following example runs successfully on Spark 2.0.0 but fails in the > current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not > earlier): > {code} > spark.sql("set spark.sql.crossJoin.enabled=true") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2") > val query = """ > SELECT > ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS > int_col_1 > FROM table_2 t1 > INNER JOIN ( > SELECT > LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), > -991) AS int_col, > (t2.int_col_1) + (t1.int_col_1) AS int_col_2, > (t1.int_col_1) + (t2.int_col_1) AS int_col_3, > t2.int_col_1 > FROM > table_4 t1, > table_4 t2 > GROUP BY > (t1.int_col_1) + (t2.int_col_1), > t2.int_col_1 > ) t2 > WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1) > GROUP BY (t2.int_col) + (t1.bigint_col_2) > """ > spark.sql(query).collect() > {code} > This fails with the following exception: > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: bigint_col_2#65 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454) > at >
[jira] [Updated] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
[ https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Perluss updated SPARK-17460: -- Summary: Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception (was: Dataset.joinWith causes OutOfMemory due to logicalPlan sizeInBytes being negative) > Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception > -- > > Key: SPARK-17460 > URL: https://issues.apache.org/jira/browse/SPARK-17460 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Spark 2.0 in local mode as well as on GoogleDataproc >Reporter: Chris Perluss > > Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes > in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0. > The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of > datatype Int. In my dataset, there is an Array column whose data size > exceeds the limits of an Int and so the data size becomes negative. > The issue can be repeated by running this code in REPL: > val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that > long of a string")).toDS() > // You might have to remove private[sql] from Dataset.logicalPlan to get this > to work > val stats = ds.logicalPlan.statistics > yields > stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = > Statistics(-1890686892,false) > This causes joinWith to performWith to perform a broadcast join even tho my > data is gigabytes in size, which of course causes the executors to run out of > memory. > Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the > logicalPlan.statistics.sizeInBytes is a large negative number and thus it is > less than the join threshold of -1. > I've been able to work around this issue by setting > autoBroadcastJoinThreshold to a very large negative number. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17317) Add package vignette to SparkR
[ https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-17317: -- Assignee: Junyang Qian > Add package vignette to SparkR > -- > > Key: SPARK-17317 > URL: https://issues.apache.org/jira/browse/SPARK-17317 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian >Assignee: Junyang Qian > Fix For: 2.1.0 > > > In publishing SparkR to CRAN, it would be nice to have a vignette as a user > guide that > * describes the big picture > * introduces the use of various methods > This is important for new users because they may not even know which method > to look up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17317) Add package vignette to SparkR
[ https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-17317. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14980 [https://github.com/apache/spark/pull/14980] > Add package vignette to SparkR > -- > > Key: SPARK-17317 > URL: https://issues.apache.org/jira/browse/SPARK-17317 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian > Fix For: 2.1.0 > > > In publishing SparkR to CRAN, it would be nice to have a vignette as a user > guide that > * describes the big picture > * introduces the use of various methods > This is important for new users because they may not even know which method > to look up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17073) generate basic stats for column
[ https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489326#comment-15489326 ] Apache Spark commented on SPARK-17073: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/15090 > generate basic stats for column > --- > > Key: SPARK-17073 > URL: https://issues.apache.org/jira/browse/SPARK-17073 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > For a specified column, we need to generate basic stats including max, min, > number of nulls, number of distinct values, max column length, average column > length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17073) generate basic stats for column
[ https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17073: Assignee: (was: Apache Spark) > generate basic stats for column > --- > > Key: SPARK-17073 > URL: https://issues.apache.org/jira/browse/SPARK-17073 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu > > For a specified column, we need to generate basic stats including max, min, > number of nulls, number of distinct values, max column length, average column > length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17073) generate basic stats for column
[ https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17073: Assignee: Apache Spark > generate basic stats for column > --- > > Key: SPARK-17073 > URL: https://issues.apache.org/jira/browse/SPARK-17073 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.0.0 >Reporter: Ron Hu >Assignee: Apache Spark > > For a specified column, we need to generate basic stats including max, min, > number of nulls, number of distinct values, max column length, average column > length. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489250#comment-15489250 ] Ganesh Krishnan commented on SPARK-2365: Is this only for RDD or can we use it with DataFrames? With Spark 2.0 especially the emphasis is on DataFrames and we would like to use only DataFrames in our project > Add IndexedRDD, an efficient updatable key-value store > -- > > Key: SPARK-2365 > URL: https://issues.apache.org/jira/browse/SPARK-2365 > Project: Spark > Issue Type: New Feature > Components: GraphX, Spark Core >Reporter: Ankur Dave >Assignee: Ankur Dave > Attachments: 2014-07-07-IndexedRDD-design-review.pdf > > > RDDs currently provide a bulk-updatable, iterator-based interface. This > imposes minimal requirements on the storage layer, which only needs to > support sequential access, enabling on-disk and serialized storage. > However, many applications would benefit from a richer interface. Efficient > support for point lookups would enable serving data out of RDDs, but it > currently requires iterating over an entire partition to find the desired > element. Point updates similarly require copying an entire iterator. Joins > are also expensive, requiring a shuffle and local hash joins. > To address these problems, we propose IndexedRDD, an efficient key-value > store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key > uniqueness and pre-indexing the entries for efficient joins and point > lookups, updates, and deletions. > It would be implemented by (1) hash-partitioning the entries by key, (2) > maintaining a hash index within each partition, and (3) using purely > functional (immutable and efficiently updatable) data structures to enable > efficient modifications and deletions. > GraphX would be the first user of IndexedRDD, since it currently implements a > limited form of this functionality in VertexRDD. We envision a variety of > other uses for IndexedRDD, including streaming updates to RDDs, direct > serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16460) Spark 2.0 CSV ignores NULL value in Date format
[ https://issues.apache.org/jira/browse/SPARK-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489193#comment-15489193 ] Liwei Lin commented on SPARK-16460: --- [~marcelboldt] Oh cool! Thanks for the feedback! > Spark 2.0 CSV ignores NULL value in Date format > --- > > Key: SPARK-16460 > URL: https://issues.apache.org/jira/browse/SPARK-16460 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 > Environment: SparkR >Reporter: Marcel Boldt >Priority: Minor > > Trying to read a CSV file to Spark (using SparkR) containing just this data > row: > {code} > 1|1998-01-01|| > {code} > Using Spark 1.6.2 (Hadoop 2.6) gives me > {code} > > head(sdf) > id d dtwo > 1 1 1998-01-01 NA > {code} > Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: > {panel} > > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.text.ParseException: Unparseable date: "" > at java.text.DateFormat.parse(DateFormat.java:357) > at > org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98) > at > org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124) > at > org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Itera... > {panel} > The problem seems indeed the NULL value here as with a valid date in the > third CSV column it works. > R code: > {code} > #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') > Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7') > .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) > library(SparkR) > > sc <- > sparkR.init( > master = "local", > sparkPackages = "com.databricks:spark-csv_2.11:1.4.0" > ) > sqlContext <- sparkRSQL.init(sc) > > > st <- structType(structField("id", "integer"), structField("d", "date"), > structField("dtwo", "date")) > > sdf <- read.df( > sqlContext, > path = "d:/date_test.csv", > source = "com.databricks.spark.csv", > schema = st, > inferSchema = "false", > delimiter = "|", > dateFormat = "-MM-dd", > nullValue = "", > mode = "PERMISSIVE" > ) > > head(sdf) > > sparkR.stop() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489169#comment-15489169 ] Reynold Xin edited comment on SPARK-15406 at 9/14/16 2:50 AM: -- Finally back from vacation. FWIW, I want to cut 2.0.1 rc in the next day or two (I sent an email almost 1 month ago about cutting 2.0.1) so I don't think the Kafka one will make it anyway. There are actual correctness bugs that we need to roll out 2.0.1. was (Author: rxin): Finally back from vacation. FWIW, I want to cut 2.0.1 rc in the next day or two so I don't think the Kafka one will make it anyway. There are actual correctness bugs that we need to roll out 2.0.1. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489169#comment-15489169 ] Reynold Xin commented on SPARK-15406: - Finally back from vacation. FWIW, I want to cut 2.0.1 rc in the next day or two so I don't think the Kafka one will make it anyway. There are actual correctness bugs that we need to roll out 2.0.1. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489102#comment-15489102 ] Cody Koeninger commented on SPARK-17510: This would require a constructor change and another overload, which I'm personally not necessarily against, but I'd want to understand why it was necessary. I know there was some list discussion, but I never heard back on your testing of the direct stream only with backpressure on. There had been some work done already on making sure backpressure didn't starve particular topicpartitions. WIth backpressure, max rate should only be necessary for making sure the initial rdd doesn't blow out any limits. > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files
[ https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1543#comment-1543 ] Charles Pritchard commented on SPARK-6593: -- Something appears to have changed between 2.0 and 1.5.1: in 2.0 I have files that will fail with "Unexpected end of input stream" whereas they read with 1.5.1 without error. Those files also trigger exceptions with command line zcat/gzip. > Provide option for HadoopRDD to skip corrupted files > > > Key: SPARK-6593 > URL: https://issues.apache.org/jira/browse/SPARK-6593 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Dale Richardson >Priority: Minor > > When reading a large amount of gzip files from HDFS eg. with > sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries > report an exception then the entire job is canceled. As default behaviour > this is probably for the best, but it would be nice in some circumstances > where you know it will be ok to have the option to skip the corrupted file > and continue the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15621: Assignee: Apache Spark (was: Davies Liu) > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Apache Spark > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15621: Assignee: Davies Liu (was: Apache Spark) > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Davies Liu > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488869#comment-15488869 ] Apache Spark commented on SPARK-15621: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/15089 > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Davies Liu > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14482) Change default compression codec for Parquet from gzip to snappy
[ https://issues.apache.org/jira/browse/SPARK-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488848#comment-15488848 ] Charles Pritchard commented on SPARK-14482: --- I don't think this fully made it into the manual; this impacts default compression for Parquet, but I think the manual still shows the default as gzip. > Change default compression codec for Parquet from gzip to snappy > > > Key: SPARK-14482 > URL: https://issues.apache.org/jira/browse/SPARK-14482 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > Based on our tests, gzip decompression is very slow (< 100MB/s), making > queries decompression bound. Snappy can decompress at ~ 500MB/s on a single > core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17531: - Fix Version/s: 1.6.3 > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17530. --- Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.1.0 > Add Statistics into DESCRIBE FORMATTED > -- > > Key: SPARK-17530 > URL: https://issues.apache.org/jira/browse/SPARK-17530 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it > and also improve the output of statistics in `DESCRIBE EXTENDED`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488642#comment-15488642 ] Cody Koeninger commented on SPARK-15406: 1. How can we avoid duplicate work like this? There was months of radio silence, Reynold finally said on list no one had gotten to it so I said I'd be happy to work on it. Now you post a design doc with a presumptive already upcoming PR for your preferred implementation. You guys have gotten a lot of flack for not having open development, and this process isn't helping. Are you taking anything in my work into account? 2. Subscribe and assign still need structured arguments, not strings 3. Prefixing keys with kafka is still a hack that would be avoided with actual structured configuration > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-17445: - Target Version/s: 2.0.1, 2.1.0 > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate
[ https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488612#comment-15488612 ] Kevin Rossi commented on SPARK-17097: - I am away and I will return on 16 September 2016. Thank you, Kevin Kevin Rossi | ross...@llnl.gov | Office: 925-423-6737 > Pregel does not keep vertex state properly; fails to terminate > --- > > Key: SPARK-17097 > URL: https://issues.apache.org/jira/browse/SPARK-17097 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 > Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel >Reporter: Seth Bromberger > > Consider the following minimum example: > {code:title=PregelBug.scala|borderStyle=solid} > package testGraph > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _} > object PregelBug { > def main(args: Array[String]) = { > //FIXME breaks if TestVertex is a case class; works if not case class > case class TestVertex(inId: VertexId, > inData: String, > inLabels: collection.mutable.HashSet[String]) extends > Serializable { > val id = inId > val value = inData > val labels = inLabels > } > class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends > Serializable { > val src = inSrc > val dst = inDst > val data = inData > } > val startString = "XXXSTARTXXX" > val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]") > val sc = new SparkContext(conf) > val vertexes = Vector( > new TestVertex(0, "label0", collection.mutable.HashSet[String]()), > new TestVertex(1, "label1", collection.mutable.HashSet[String]()) > ) > val links = Vector( > new TestLink(0, 1, "linkData01") > ) > val vertexes_packaged = vertexes.map(v => (v.id, v)) > val links_packaged = links.map(e => Edge(e.src, e.dst, e)) > val graph = Graph[TestVertex, > TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged)) > def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: > Vector[String]): TestVertex = { > message.foreach { > case `startString` => > if (vdata.id == 0L) > vdata.labels.add(vdata.value) > case m => > if (!vdata.labels.contains(m)) > vdata.labels.add(m) > } > new TestVertex(vdata.id, vdata.value, vdata.labels) > } > def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): > Iterator[(VertexId, Vector[String])] = { > val srcLabels = triplet.srcAttr.labels > val dstLabels = triplet.dstAttr.labels > val msgsSrcDst = srcLabels.diff(dstLabels) > .map(label => (triplet.dstAttr.id, Vector[String](label))) > val msgsDstSrc = dstLabels.diff(dstLabels) > .map(label => (triplet.srcAttr.id, Vector[String](label))) > msgsSrcDst.toIterator ++ msgsDstSrc.toIterator > } > def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] > = m1.union(m2).distinct > val g = graph.pregel(Vector[String](startString))(vertexProgram, > sendMessage, mergeMessage) > println("---pregel done---") > println("vertex info:") > g.vertices.foreach( > v => { > val labels = v._2.labels > println( > "vertex " + v._1 + > ": name = " + v._2.id + > ", labels = " + labels) > } > ) > } > } > {code} > This code never terminates even though we expect it to. To fix, we simply > remove the "case" designation for the TestVertex class (see FIXME comment), > and then it behaves as expected. > (Apologies if this has been fixed in later versions; we're unfortunately > pegged to 2.10.5 / 1.6.0 for now.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate
[ https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488609#comment-15488609 ] ding commented on SPARK-17097: -- I am afraid the attached sample code fail to terminate with case class is not caused by Pregel or graphx bug. As in the sample code, vertices(inLabels here) is updated inplace which supposed not to happen. In this way, after transform operations, the original vertices is also updated and it has exactly the same value with the new generated vertices in above code when VD is case class. It leads to fail to update EdgeTriplet as there is no difference of the vertices. So EdgeTriplet dstLabels is always empty while srcLabels contains a value. And there is always active message which lead to Pregel not terminate. One way to fix the problem is remove inplace update in vertexProgram by clone the labels and make update in the new labels. I have tried below code and it works. def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: Vector[String]): TestVertex = { val labels = vdata.labels.clone() message.foreach { case `startString` => if (vdata.id == 0L) labels.add(vdata.value ) case m => if (!vdata.labels.contains(m)) labels.add(m) } new TestVertex(vdata.id, vdata.value, labels) } Hope this information is helpful to you. > Pregel does not keep vertex state properly; fails to terminate > --- > > Key: SPARK-17097 > URL: https://issues.apache.org/jira/browse/SPARK-17097 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.6.0 > Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel >Reporter: Seth Bromberger > > Consider the following minimum example: > {code:title=PregelBug.scala|borderStyle=solid} > package testGraph > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _} > object PregelBug { > def main(args: Array[String]) = { > //FIXME breaks if TestVertex is a case class; works if not case class > case class TestVertex(inId: VertexId, > inData: String, > inLabels: collection.mutable.HashSet[String]) extends > Serializable { > val id = inId > val value = inData > val labels = inLabels > } > class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends > Serializable { > val src = inSrc > val dst = inDst > val data = inData > } > val startString = "XXXSTARTXXX" > val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]") > val sc = new SparkContext(conf) > val vertexes = Vector( > new TestVertex(0, "label0", collection.mutable.HashSet[String]()), > new TestVertex(1, "label1", collection.mutable.HashSet[String]()) > ) > val links = Vector( > new TestLink(0, 1, "linkData01") > ) > val vertexes_packaged = vertexes.map(v => (v.id, v)) > val links_packaged = links.map(e => Edge(e.src, e.dst, e)) > val graph = Graph[TestVertex, > TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged)) > def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: > Vector[String]): TestVertex = { > message.foreach { > case `startString` => > if (vdata.id == 0L) > vdata.labels.add(vdata.value) > case m => > if (!vdata.labels.contains(m)) > vdata.labels.add(m) > } > new TestVertex(vdata.id, vdata.value, vdata.labels) > } > def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): > Iterator[(VertexId, Vector[String])] = { > val srcLabels = triplet.srcAttr.labels > val dstLabels = triplet.dstAttr.labels > val msgsSrcDst = srcLabels.diff(dstLabels) > .map(label => (triplet.dstAttr.id, Vector[String](label))) > val msgsDstSrc = dstLabels.diff(dstLabels) > .map(label => (triplet.srcAttr.id, Vector[String](label))) > msgsSrcDst.toIterator ++ msgsDstSrc.toIterator > } > def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] > = m1.union(m2).distinct > val g = graph.pregel(Vector[String](startString))(vertexProgram, > sendMessage, mergeMessage) > println("---pregel done---") > println("vertex info:") > g.vertices.foreach( > v => { > val labels = v._2.labels > println( > "vertex " + v._1 + > ": name = " + v._2.id + > ", labels = " + labels) > } > ) > } > } > {code} > This code never terminates even though we expect it to. To fix, we simply > remove the "case" designation for the TestVertex class (see FIXME comment),
[jira] [Resolved] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-17531. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15086 [https://github.com/apache/spark/pull/15086] > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz > Fix For: 2.0.1, 2.1.0 > > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-17531: - Assignee: Burak Yavuz > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz > Fix For: 2.0.1, 2.1.0 > > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17532) Add thread lock information from JMX to thread dump UI
[ https://issues.apache.org/jira/browse/SPARK-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17532: Assignee: (was: Apache Spark) > Add thread lock information from JMX to thread dump UI > -- > > Key: SPARK-17532 > URL: https://issues.apache.org/jira/browse/SPARK-17532 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Ryan Blue > > It would be much easier to track down issues like SPARK-15725 if the thread > dump page in the web UI displayed debugging info about what locks are > currently held by threads. This is already available through the > [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html] > that is used to get stack traces, thead ids, and names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17532) Add thread lock information from JMX to thread dump UI
[ https://issues.apache.org/jira/browse/SPARK-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488601#comment-15488601 ] Apache Spark commented on SPARK-17532: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/15088 > Add thread lock information from JMX to thread dump UI > -- > > Key: SPARK-17532 > URL: https://issues.apache.org/jira/browse/SPARK-17532 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Ryan Blue > > It would be much easier to track down issues like SPARK-15725 if the thread > dump page in the web UI displayed debugging info about what locks are > currently held by threads. This is already available through the > [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html] > that is used to get stack traces, thead ids, and names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17532) Add thread lock information from JMX to thread dump UI
[ https://issues.apache.org/jira/browse/SPARK-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17532: Assignee: Apache Spark > Add thread lock information from JMX to thread dump UI > -- > > Key: SPARK-17532 > URL: https://issues.apache.org/jira/browse/SPARK-17532 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Ryan Blue >Assignee: Apache Spark > > It would be much easier to track down issues like SPARK-15725 if the thread > dump page in the web UI displayed debugging info about what locks are > currently held by threads. This is already available through the > [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html] > that is used to get stack traces, thead ids, and names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488595#comment-15488595 ] Tathagata Das commented on SPARK-15406: --- 1. We will have a PR very soon so that you can start playing with it. In fact, it would be awesome if you can play with it and give feedback. 2. Version 1 will have just Subscribe, SubscribePattern and Assign strategies without the ability to specify starting offsets. May be this wasnt super clear in the design, and it will be clear when you look at the actual code. 3. Json stuff - Version 1 will not have the ability to specifying starting offsets, so those hacks are not there. That's precisely what I wanted to do with version 1, not have anything that cannot be supported in future. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488558#comment-15488558 ] Cody Koeninger commented on SPARK-15406: So you're saying the type of K and V will always be bytearray, custom kafka deserializers won't be used, and message schema will be handled by mapping after the fact? That's potentially OK, but I'd feel a lot better with a working example. I can play around with that a bit. Regarding the statefulness of kafka consumers, I just mean you can't construct a kafka consumer that is already subscribed and has correct positions; you need to instantiate it and then call side-effecrting operations on it in order to get it in the right state. Regarding Json convenience constructors, that's not purely additive. If we know that's the goal, we shouldn't use hacks like comma separated strings or prefixed keys now and get people relying on them. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17532) Add thread lock information from JMX to thread dump UI
Ryan Blue created SPARK-17532: - Summary: Add thread lock information from JMX to thread dump UI Key: SPARK-17532 URL: https://issues.apache.org/jira/browse/SPARK-17532 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Ryan Blue It would be much easier to track down issues like SPARK-15725 if the thread dump page in the web UI displayed debugging info about what locks are currently held by threads. This is already available through the [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html] that is used to get stack traces, thead ids, and names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488528#comment-15488528 ] Dhruve Ashar commented on SPARK-16441: -- So all of these are related in one way or the other. I will share my observations made so far: I have taken multiple thread dumps of these and analyzed what is causing the SparkListenerBus to block. The ExecutorAllocationManager is heavily synchronized and its a hotspot for contention especially when schedule is being invoked every 100ms to balance the executors. So for a spark job running thousands of executors, any change in the status of these executors with dynamic allocation enabled leads to frequent firing of events. This causes a contention and leads to blocking. Specifically if calls are remote RPCs. Also the current design of the listener bus is such that it waits for every listener to process the event before it can proceed to deliver the next event from the queue. Any wait for acquiring the locks are leading to the event queue being filling up fast and leads to dropping of events. Logging individual execution times of these do not necessarily conclude to updateAndSync consuming majority of the time and hence we are working on reducing the lock contention and the minimize the RPC calls. Also a parameter to look at would be the heartbeat interval. Having a small interval for a very large no. of executors will aggravate the problem as you would be getting frequent ExecutorMetricsUpdate from the running executors. > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at >
[jira] [Commented] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488498#comment-15488498 ] Apache Spark commented on SPARK-17531: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15087 > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488496#comment-15488496 ] Michael Armbrust commented on SPARK-15406: -- For the types that are coming out, the SQL way would be to do conversions using either expressions, udfs, or as (in Datasets with encoders). This way the schema would default to binary, but you can convert to the actual format after the DataFrame is returned. I'm not sure I understand the reasons why a consumer would be stateful or how that is compatible with the model proposed in structured streaming. Can you explain this use case further? For configuration, I think its totally reasonable to have some way to pass in configuration that does not involve constructing json strings manually. Thats an example of something that is purely additive. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488477#comment-15488477 ] Michael Armbrust commented on SPARK-15406: -- Streaming is labeled experimental, we can continue to change it during the 2.x.x line of Spark. In fact we are even adding features during 2.0.x. At the same time I would hope that we don't have to throw everything out after we release something to users. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488487#comment-15488487 ] Ofir Manor commented on SPARK-15406: Cody, I think you are right. Now is the right time to spend a several days iterating over the design. Commiting a new, half-baked API that will need to be maintained for years is exactly what we all try to avoid. Of course, v1 doesn't have to implement a lot of features, but it should be future-compatible with the rest of the planned improvements. (as a side note - I haven't seen a timeline for 2.0.1 being discussed, so I'm not sure what are the expectations regarding "v1" delivery) > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488465#comment-15488465 ] Ashwin Shankar commented on SPARK-16441: hey, Thanks! I don't think either of the patches would help unfortunately. The first patch is related to preemption, and I don't see the WARN message mentioned in that jira description in my logs. The second patch which you're working on is related to fixing ExecutorAllocationManager#removeExecutor. However the problem I see and what [~cenyuhai] has mentioned in this ticket is related to ExecutorAllocationManager#updateAndSyncNumExecutorsTarget not releasing the lock, and blocking SparkListenerBus thread. I'm debugging this as well on my side but do you know why ExecutorAllocationManager#updateAndSyncNumExecutorsTarget is stuck talking to cluster manager? > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at >
[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488452#comment-15488452 ] Cody Koeninger edited comment on SPARK-15406 at 9/13/16 9:18 PM: - So I asked this twice with no answer, so I'll ask it a third time. Are you willing to later throw out whatever ships with 2.0.1 in order to do things right? If not, that would certainly be a concern that could prevent us from doing stuff in 2. was (Author: c...@koeninger.org): So if I asked this twice with no answer, so I'll ask it a third time. Are you willing to later throw out whatever ships with 2.0.1 in order to do things right? If not, that would certainly be a concern that could prevent us from doing stuff in 2. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488452#comment-15488452 ] Cody Koeninger commented on SPARK-15406: So if I asked this twice with no answer, so I'll ask it a third time. Are you willing to later throw out whatever ships with 2.0.1 in order to do things right? If not, that would certainly be a concern that could prevent us from doing stuff in 2. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488420#comment-15488420 ] Tathagata Das edited comment on SPARK-15406 at 9/13/16 9:04 PM: [Combining the comments in the doc and on the JIRA] [~c...@koeninger.org] Thank you very much for highlighting the issues in more details. And I agree that string-string is NOT SUFFICIENT, and configurations may need to be passed in non-string form. In fact, kafka configurations allow [string-to-object|http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#KafkaConsumer(java.util.Map)] configs to be passed, and some configuration do need to be passed as non-string objects (integers, etc.). So eventually, its very likely that we have to come up with a solution that accommodates such stuff. All I am suggesting now is 1. Let's put in something simple that works generically across all languages, but that may not be complete. 2. After that, let's discuss and design something that makes it complete, and this may include new APIs, language-specific stuff, etc. Rather than trying come up with a complete solutions and delay the release, I feel that this approach allows us to give something to the community with most common usecases, as soon as possible. Even if it is incomplete, the community can start testing ASAP and providing more concrete feedback on what it would take to make this feature complete. Do you think this plan is okay? If so, do you think there is anything in the design for 1, that prevents us from doing stuff in 2? was (Author: tdas): [Combining the comments in the doc and on the JIRA] Thank you very much for highlighting the issues in more details. And I agree that string-string is NOT SUFFICIENT, and configurations may need to be passed in non-string form. In fact, kafka configurations allow [string-to-object|http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#KafkaConsumer(java.util.Map)] configs to be passed, and some configuration do need to be passed as non-string objects (integers, etc.). So eventually, its very likely that we have to come up with a solution that accommodates such stuff. All I am suggesting now is 1. Let's put in something simple that works generically across all languages, but that may not be complete. 2. After that, let's discuss and design something that makes it complete, and this may include new APIs, language-specific stuff, etc. Rather than trying come up with a complete solutions and delay the release, I feel that this approach allows us to give something to the community with most common usecases, as soon as possible. Even if it is incomplete, the community can start testing ASAP and providing more concrete feedback on what it would take to make this feature complete. Do you think this plan is okay? If so, do you think there is anything in the design for 1, that prevents us from doing stuff in 2? > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488420#comment-15488420 ] Tathagata Das commented on SPARK-15406: --- [Combining the comments in the doc and on the JIRA] Thank you very much for highlighting the issues in more details. And I agree that string-string is NOT SUFFICIENT, and configurations may need to be passed in non-string form. In fact, kafka configurations allow [string-to-object|http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#KafkaConsumer(java.util.Map)] configs to be passed, and some configuration do need to be passed as non-string objects (integers, etc.). So eventually, its very likely that we have to come up with a solution that accommodates such stuff. All I am suggesting now is 1. Let's put in something simple that works generically across all languages, but that may not be complete. 2. After that, let's discuss and design something that makes it complete, and this may include new APIs, language-specific stuff, etc. Rather than trying come up with a complete solutions and delay the release, I feel that this approach allows us to give something to the community with most common usecases, as soon as possible. Even if it is incomplete, the community can start testing ASAP and providing more concrete feedback on what it would take to make this feature complete. Do you think this plan is okay? If so, do you think there is anything in the design for 1, that prevents us from doing stuff in 2? > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17531: Assignee: Apache Spark > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz >Assignee: Apache Spark > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488337#comment-15488337 ] Apache Spark commented on SPARK-17531: -- User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/15086 > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17531: Assignee: (was: Apache Spark) > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488312#comment-15488312 ] Cody Koeninger edited comment on SPARK-15406 at 9/13/16 8:26 PM: - Unless I'm misunderstanding, you answered regarding a source, not a source provider. How does an end user pass in a parameterized type through this interface to indicate that their keys are e.g. longs, or MyType? If this is easy, can you show a code example? Using reflection off strings for ConsumerStrategy means it must have a zero arg constructor, which is doable but again awkward compared to a serialized instance. If you agree json isn't going to cut it, are you actually going to agree to make changes, or is this going to be another 'well we're stuck with it because binary compatibility'? was (Author: c...@koeninger.org): Unless I'm misunderstanding, you answered regarding a source, not a source provider. How does an end user pass in a parameterized type through this interface to indicate that their keys are e.g. longs, or MyType? If this is easy, can you show a code example? Using reflection off strings for ConsumerStrategy means it must have a zero arg constructor, which is doable but again awkward compared to a serialized instance. If you agree neon isn't going to cut it, are you actually going to agree to make changes, or is this going to be another 'well we're stuck with it because binary compatibility'? > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488312#comment-15488312 ] Cody Koeninger commented on SPARK-15406: Unless I'm misunderstanding, you answered regarding a source, not a source provider. How does an end user pass in a parameterized type through this interface to indicate that their keys are e.g. longs, or MyType? If this is easy, can you show a code example? Using reflection off strings for ConsumerStrategy means it must have a zero arg constructor, which is doable but again awkward compared to a serialized instance. If you agree neon isn't going to cut it, are you actually going to agree to make changes, or is this going to be another 'well we're stuck with it because binary compatibility'? > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures
[ https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17484: Assignee: Josh Rosen (was: Apache Spark) > Race condition when cancelling a job during a cache write can lead to block > fetch failures > -- > > Key: SPARK-17484 > URL: https://issues.apache.org/jira/browse/SPARK-17484 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > On a production cluster, I observed the following weird behavior where a > block manager cached a block, the store failed due to a task being killed / > cancelled, and then a subsequent task incorrectly attempted to read the > cached block from the machine where the write failed, eventually leading to a > complete job failure. > Here's the executor log snippet from the machine performing the failed cache: > {code} > 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory > (estimated size 976.8 MB, free 9.8 GB) > 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed > 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID > 127) > {code} > Here's the exception from the reader in the failed job: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 > (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: > Failed to fetch block after 1 fetch failures. Most recent failure cause: > at > org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > {code} > I believe that there's a race condition in how we handle cleanup after failed > cache stores. Here's an excerpt from {{BlockManager.doPut()}} > {code} > var blockWasSuccessfullyStored: Boolean = false > val result: Option[T] = try { > val res = putBody(putBlockInfo) > blockWasSuccessfullyStored = res.isEmpty > res > } finally { > if (blockWasSuccessfullyStored) { > if (keepReadLock) { > blockInfoManager.downgradeLock(blockId) > } else { > blockInfoManager.unlock(blockId) > } > } else { > blockInfoManager.removeBlock(blockId) > logWarning(s"Putting block $blockId failed") > } > } > {code} > The only way that I think this "successfully stored followed by immediately > failed" case could appear in our logs is if the local memory store write > succeeds and then an exception (perhaps InterruptedException) causes us to > enter the {{finally}} block's error-cleanup path. The problem is that the > {{finally}} block only cleans up the block's metadata rather than performing > the full cleanup path which would also notify the master that the block is no > longer available at this host. > The fact that the Spark task was not resilient in the face of remote block > fetches is a separate issue which I'll report and fix separately. The scope > of this JIRA, however, is the fact that Spark still attempted reads from a > machine which was missing the block. > In order to fix this problem, I think that the {{finally}} block should > perform more thorough cleanup and should send a "block removed" status update > to the master following any error during the write. This is necessary because > the body of {{doPut()}} may have already notified the master of block > availability. In addition, we can extend the block serving code path to > automatically update the master with "block deleted" statuses whenever the > block manager receives invalid requests for blocks that it doesn't have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures
[ https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17484: Assignee: Apache Spark (was: Josh Rosen) > Race condition when cancelling a job during a cache write can lead to block > fetch failures > -- > > Key: SPARK-17484 > URL: https://issues.apache.org/jira/browse/SPARK-17484 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > On a production cluster, I observed the following weird behavior where a > block manager cached a block, the store failed due to a task being killed / > cancelled, and then a subsequent task incorrectly attempted to read the > cached block from the machine where the write failed, eventually leading to a > complete job failure. > Here's the executor log snippet from the machine performing the failed cache: > {code} > 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory > (estimated size 976.8 MB, free 9.8 GB) > 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed > 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID > 127) > {code} > Here's the exception from the reader in the failed job: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 > (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: > Failed to fetch block after 1 fetch failures. Most recent failure cause: > at > org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > {code} > I believe that there's a race condition in how we handle cleanup after failed > cache stores. Here's an excerpt from {{BlockManager.doPut()}} > {code} > var blockWasSuccessfullyStored: Boolean = false > val result: Option[T] = try { > val res = putBody(putBlockInfo) > blockWasSuccessfullyStored = res.isEmpty > res > } finally { > if (blockWasSuccessfullyStored) { > if (keepReadLock) { > blockInfoManager.downgradeLock(blockId) > } else { > blockInfoManager.unlock(blockId) > } > } else { > blockInfoManager.removeBlock(blockId) > logWarning(s"Putting block $blockId failed") > } > } > {code} > The only way that I think this "successfully stored followed by immediately > failed" case could appear in our logs is if the local memory store write > succeeds and then an exception (perhaps InterruptedException) causes us to > enter the {{finally}} block's error-cleanup path. The problem is that the > {{finally}} block only cleans up the block's metadata rather than performing > the full cleanup path which would also notify the master that the block is no > longer available at this host. > The fact that the Spark task was not resilient in the face of remote block > fetches is a separate issue which I'll report and fix separately. The scope > of this JIRA, however, is the fact that Spark still attempted reads from a > machine which was missing the block. > In order to fix this problem, I think that the {{finally}} block should > perform more thorough cleanup and should send a "block removed" status update > to the master following any error during the write. This is necessary because > the body of {{doPut()}} may have already notified the master of block > availability. In addition, we can extend the block serving code path to > automatically update the master with "block deleted" statuses whenever the > block manager receives invalid requests for blocks that it doesn't have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures
[ https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488273#comment-15488273 ] Apache Spark commented on SPARK-17484: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15085 > Race condition when cancelling a job during a cache write can lead to block > fetch failures > -- > > Key: SPARK-17484 > URL: https://issues.apache.org/jira/browse/SPARK-17484 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > On a production cluster, I observed the following weird behavior where a > block manager cached a block, the store failed due to a task being killed / > cancelled, and then a subsequent task incorrectly attempted to read the > cached block from the machine where the write failed, eventually leading to a > complete job failure. > Here's the executor log snippet from the machine performing the failed cache: > {code} > 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory > (estimated size 976.8 MB, free 9.8 GB) > 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed > 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID > 127) > {code} > Here's the exception from the reader in the failed job: > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 > (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: > Failed to fetch block after 1 fetch failures. Most recent failure cause: > at > org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565) > at > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522) > at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609) > at > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661) > {code} > I believe that there's a race condition in how we handle cleanup after failed > cache stores. Here's an excerpt from {{BlockManager.doPut()}} > {code} > var blockWasSuccessfullyStored: Boolean = false > val result: Option[T] = try { > val res = putBody(putBlockInfo) > blockWasSuccessfullyStored = res.isEmpty > res > } finally { > if (blockWasSuccessfullyStored) { > if (keepReadLock) { > blockInfoManager.downgradeLock(blockId) > } else { > blockInfoManager.unlock(blockId) > } > } else { > blockInfoManager.removeBlock(blockId) > logWarning(s"Putting block $blockId failed") > } > } > {code} > The only way that I think this "successfully stored followed by immediately > failed" case could appear in our logs is if the local memory store write > succeeds and then an exception (perhaps InterruptedException) causes us to > enter the {{finally}} block's error-cleanup path. The problem is that the > {{finally}} block only cleans up the block's metadata rather than performing > the full cleanup path which would also notify the master that the block is no > longer available at this host. > The fact that the Spark task was not resilient in the face of remote block > fetches is a separate issue which I'll report and fix separately. The scope > of this JIRA, however, is the fact that Spark still attempted reads from a > machine which was missing the block. > In order to fix this problem, I think that the {{finally}} block should > perform more thorough cleanup and should send a "block removed" status update > to the master following any error during the write. This is necessary because > the body of {{doPut()}} may have already notified the master of block > availability. In addition, we can extend the block serving code path to > automatically update the master with "block deleted" statuses whenever the > block manager receives invalid requests for blocks that it doesn't have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
[ https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz updated SPARK-17531: Description: If a user provides listeners inside the Hive Conf, the configuration for these listeners are passed to the Hive Execution Client as well. This may cause issues for two reasons: 1. The Execution Client will actually generate garbage 2. The listener class needs to be both in the Spark Classpath and Hive Classpath The configuration can be overwritten within {code} HiveUtils.newTemporaryConfiguration {code} > Don't initialize Hive Listeners for the Execution Client > > > Key: SPARK-17531 > URL: https://issues.apache.org/jira/browse/SPARK-17531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Burak Yavuz > > If a user provides listeners inside the Hive Conf, the configuration for > these listeners are passed to the Hive Execution Client as well. This may > cause issues for two reasons: > 1. The Execution Client will actually generate garbage > 2. The listener class needs to be both in the Spark Classpath and Hive > Classpath > The configuration can be overwritten within > {code} > HiveUtils.newTemporaryConfiguration > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client
Burak Yavuz created SPARK-17531: --- Summary: Don't initialize Hive Listeners for the Execution Client Key: SPARK-17531 URL: https://issues.apache.org/jira/browse/SPARK-17531 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17529) On highly skewed data, outer join merges are slow
[ https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488106#comment-15488106 ] Apache Spark commented on SPARK-17529: -- User 'davidnavas' has created a pull request for this issue: https://github.com/apache/spark/pull/15084 > On highly skewed data, outer join merges are slow > - > > Key: SPARK-17529 > URL: https://issues.apache.org/jira/browse/SPARK-17529 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: David C Navas > > All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the > same performance problem. > My co-worker Yewei Zhang was investigating a performance problem with a > highly skewed dataset. > "Part of this query performs a full outer join over [an ID] on highly skewed > data. On the left view, there is one record for id = 0 out of 2,272,486 > records; On the right view there are 8,353,097 records for id = 0 out of > 12,685,073 records" > The sub-query was taking 5.2 minutes. We discovered that snappy was > responsible for some measure of this problem and incorporated the new snappy > release. This brought the sub-query down to 2.4 minutes. A large percentage > of the remaining time was spent in the merge code which I tracked down to a > BitSet clearing issue. We have noted that you already have the snappy fix, > this issue describes the problem with the BitSet. > The BitSet grows to handle the largest matching set of keys and is used to > track joins. The BitSet is re-used in all subsequent joins (unless it is too > small) > The skewing of our data caused a very large BitSet to be allocated on the > very first row joined. Unfortunately, the entire BitSet is cleared on each > re-use. For each of the remaining rows which likely match only a few rows on > the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, > this would happen roughly 6 million times. The fix I developed for this is > to clear only the portion of the BitSet that is needed. After applying it, > the sub-query dropped from 2.4 minutes to 29 seconds. > Small (0 or negative) IDs are often used as place-holders for null, empty, or > unknown data, so I expect this fix to be generally useful, if rarely > encountered to this particular degree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17529) On highly skewed data, outer join merges are slow
[ https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17529: Assignee: Apache Spark > On highly skewed data, outer join merges are slow > - > > Key: SPARK-17529 > URL: https://issues.apache.org/jira/browse/SPARK-17529 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: David C Navas >Assignee: Apache Spark > > All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the > same performance problem. > My co-worker Yewei Zhang was investigating a performance problem with a > highly skewed dataset. > "Part of this query performs a full outer join over [an ID] on highly skewed > data. On the left view, there is one record for id = 0 out of 2,272,486 > records; On the right view there are 8,353,097 records for id = 0 out of > 12,685,073 records" > The sub-query was taking 5.2 minutes. We discovered that snappy was > responsible for some measure of this problem and incorporated the new snappy > release. This brought the sub-query down to 2.4 minutes. A large percentage > of the remaining time was spent in the merge code which I tracked down to a > BitSet clearing issue. We have noted that you already have the snappy fix, > this issue describes the problem with the BitSet. > The BitSet grows to handle the largest matching set of keys and is used to > track joins. The BitSet is re-used in all subsequent joins (unless it is too > small) > The skewing of our data caused a very large BitSet to be allocated on the > very first row joined. Unfortunately, the entire BitSet is cleared on each > re-use. For each of the remaining rows which likely match only a few rows on > the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, > this would happen roughly 6 million times. The fix I developed for this is > to clear only the portion of the BitSet that is needed. After applying it, > the sub-query dropped from 2.4 minutes to 29 seconds. > Small (0 or negative) IDs are often used as place-holders for null, empty, or > unknown data, so I expect this fix to be generally useful, if rarely > encountered to this particular degree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17529) On highly skewed data, outer join merges are slow
[ https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17529: Assignee: (was: Apache Spark) > On highly skewed data, outer join merges are slow > - > > Key: SPARK-17529 > URL: https://issues.apache.org/jira/browse/SPARK-17529 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: David C Navas > > All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the > same performance problem. > My co-worker Yewei Zhang was investigating a performance problem with a > highly skewed dataset. > "Part of this query performs a full outer join over [an ID] on highly skewed > data. On the left view, there is one record for id = 0 out of 2,272,486 > records; On the right view there are 8,353,097 records for id = 0 out of > 12,685,073 records" > The sub-query was taking 5.2 minutes. We discovered that snappy was > responsible for some measure of this problem and incorporated the new snappy > release. This brought the sub-query down to 2.4 minutes. A large percentage > of the remaining time was spent in the merge code which I tracked down to a > BitSet clearing issue. We have noted that you already have the snappy fix, > this issue describes the problem with the BitSet. > The BitSet grows to handle the largest matching set of keys and is used to > track joins. The BitSet is re-used in all subsequent joins (unless it is too > small) > The skewing of our data caused a very large BitSet to be allocated on the > very first row joined. Unfortunately, the entire BitSet is cleared on each > re-use. For each of the remaining rows which likely match only a few rows on > the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, > this would happen roughly 6 million times. The fix I developed for this is > to clear only the portion of the BitSet that is needed. After applying it, > the sub-query dropped from 2.4 minutes to 29 seconds. > Small (0 or negative) IDs are often used as place-holders for null, empty, or > unknown data, so I expect this fix to be generally useful, if rarely > encountered to this particular degree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17529) On highly skewed data, outer join merges are slow
[ https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-17529: - Priority: Major (was: Trivial) > On highly skewed data, outer join merges are slow > - > > Key: SPARK-17529 > URL: https://issues.apache.org/jira/browse/SPARK-17529 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 >Reporter: David C Navas > > All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the > same performance problem. > My co-worker Yewei Zhang was investigating a performance problem with a > highly skewed dataset. > "Part of this query performs a full outer join over [an ID] on highly skewed > data. On the left view, there is one record for id = 0 out of 2,272,486 > records; On the right view there are 8,353,097 records for id = 0 out of > 12,685,073 records" > The sub-query was taking 5.2 minutes. We discovered that snappy was > responsible for some measure of this problem and incorporated the new snappy > release. This brought the sub-query down to 2.4 minutes. A large percentage > of the remaining time was spent in the merge code which I tracked down to a > BitSet clearing issue. We have noted that you already have the snappy fix, > this issue describes the problem with the BitSet. > The BitSet grows to handle the largest matching set of keys and is used to > track joins. The BitSet is re-used in all subsequent joins (unless it is too > small) > The skewing of our data caused a very large BitSet to be allocated on the > very first row joined. Unfortunately, the entire BitSet is cleared on each > re-use. For each of the remaining rows which likely match only a few rows on > the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, > this would happen roughly 6 million times. The fix I developed for this is > to clear only the portion of the BitSet that is needed. After applying it, > the sub-query dropped from 2.4 minutes to 29 seconds. > Small (0 or negative) IDs are often used as place-holders for null, empty, or > unknown data, so I expect this fix to be generally useful, if rarely > encountered to this particular degree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17530: Assignee: (was: Apache Spark) > Add Statistics into DESCRIBE FORMATTED > -- > > Key: SPARK-17530 > URL: https://issues.apache.org/jira/browse/SPARK-17530 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it > and also improve the output of statistics in `DESCRIBE EXTENDED`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17530: Assignee: Apache Spark > Add Statistics into DESCRIBE FORMATTED > -- > > Key: SPARK-17530 > URL: https://issues.apache.org/jira/browse/SPARK-17530 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it > and also improve the output of statistics in `DESCRIBE EXTENDED`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED
[ https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488090#comment-15488090 ] Apache Spark commented on SPARK-17530: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/15083 > Add Statistics into DESCRIBE FORMATTED > -- > > Key: SPARK-17530 > URL: https://issues.apache.org/jira/browse/SPARK-17530 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li > > Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it > and also improve the output of statistics in `DESCRIBE EXTENDED`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED
Xiao Li created SPARK-17530: --- Summary: Add Statistics into DESCRIBE FORMATTED Key: SPARK-17530 URL: https://issues.apache.org/jira/browse/SPARK-17530 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xiao Li Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it and also improve the output of statistics in `DESCRIBE EXTENDED`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488079#comment-15488079 ] Tathagata Das commented on SPARK-15406: --- Here are some thoughts. - The key-value types should be communicated as constructor parameters of the Kafka Source object. This is how we do it in the FileStreamSource, we pass on the schema through the constructor which is used to create the DF in getBatch - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L39 - custom consumer strategy implementation can be loaded by specifying the class name as string and loading by reflection. Same way as Kafka loads serializers/deserializers, parquet loads compressing libraries, etc. This is may not be the only solution, and we should definitely discuss this in the future. - I completely agree that asking the users to generate json strings themselves is not gonna a cut. that's this needs more focus discussion on this specific topic so that we bounce around ideas, ranging from providing simple map-to-json helper, to adding extra methods to the datastreamreader. These are very good questions which does not have clear answers, and we should make specific follow up JIRAs to discuss on them with the community and experience Kafka users like you. I am hoping this initial design doc puts in the basic framework in without pushing ourselves in a corner. And I think with the current design, we dont block any improvements in the future. Please let us know if you think we are doing so in any specific use case. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17529) On highly skewed data, outer join merges are slow
David C Navas created SPARK-17529: - Summary: On highly skewed data, outer join merges are slow Key: SPARK-17529 URL: https://issues.apache.org/jira/browse/SPARK-17529 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.0, 1.6.2 Reporter: David C Navas Priority: Trivial All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the same performance problem. My co-worker Yewei Zhang was investigating a performance problem with a highly skewed dataset. "Part of this query performs a full outer join over [an ID] on highly skewed data. On the left view, there is one record for id = 0 out of 2,272,486 records; On the right view there are 8,353,097 records for id = 0 out of 12,685,073 records" The sub-query was taking 5.2 minutes. We discovered that snappy was responsible for some measure of this problem and incorporated the new snappy release. This brought the sub-query down to 2.4 minutes. A large percentage of the remaining time was spent in the merge code which I tracked down to a BitSet clearing issue. We have noted that you already have the snappy fix, this issue describes the problem with the BitSet. The BitSet grows to handle the largest matching set of keys and is used to track joins. The BitSet is re-used in all subsequent joins (unless it is too small) The skewing of our data caused a very large BitSet to be allocated on the very first row joined. Unfortunately, the entire BitSet is cleared on each re-use. For each of the remaining rows which likely match only a few rows on the other side, the entire 1MB of the BitSet is cleared. If unpartitioned, this would happen roughly 6 million times. The fix I developed for this is to clear only the portion of the BitSet that is needed. After applying it, the sub-query dropped from 2.4 minutes to 29 seconds. Small (0 or negative) IDs are often used as place-holders for null, empty, or unknown data, so I expect this fix to be generally useful, if rarely encountered to this particular degree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10816) API design: window and session specification
[ https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488035#comment-15488035 ] Maciej Bryński commented on SPARK-10816: Hi, Any updates on Session Window ? > API design: window and session specification > > > Key: SPARK-10816 > URL: https://issues.apache.org/jira/browse/SPARK-10816 > Project: Spark > Issue Type: Sub-task > Components: SQL, Streaming >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17450) spark sql rownumber OOM
[ https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488005#comment-15488005 ] Herman van Hovell commented on SPARK-17450: --- https://github.com/apache/spark/pull/10605 > spark sql rownumber OOM > --- > > Key: SPARK-17450 > URL: https://issues.apache.org/jira/browse/SPARK-17450 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: cen yuhai > > spark sql will be OOM when using row_number() over too much sorted records... > There will be only 1 task to handle all records > This sql group by passenger_id, we have 100 million passengers. > {code} > SELECT > passenger_id, > total_order, > (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND > 670800 THEN 'V3' END) AS order_rank > FROM > ( > SELECT > passenger_id, > 1 as total_order > FROM table > GROUP BY passenger_id > ) dd1 > {code} > {code} > java.lang.OutOfMemoryError: Java heap space > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536) > at > org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304) > at > org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:104) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > physical plan: > {code} > == Physical Plan == > Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 > <= 670800)) THEN V3 AS order_rank#1] > +- Window [passenger_id#7L,total_order#0], > [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS > _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber() > windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND > UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC] >+- Sort [total_order#0 DESC], false, 0 > +- TungstenExchange SinglePartition, None > +- Project [passenger_id#7L,total_order#0] > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L,total_order#0]) >+- TungstenExchange hashpartitioning(passenger_id#7L,1000), > None > +- TungstenAggregate(key=[passenger_id#7L], functions=[], > output=[passenger_id#7L]) > +- Project [passenger_id#7L] > +- Filter product#9 IN (kuai,gulf) >+- HiveTableScan [passenger_id#7L,product#9], > MetastoreRelation pbs_dw, dwv_order_whole_day, None, > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487930#comment-15487930 ] Cody Koeninger commented on SPARK-15406: Specific examples: Kafka has a type for a key, and a type for a value, with deserializers corresponding to those types. In other words, I need to construct a KafkaRDD[K, V] in the getBatch method. How do I communicate a parameterized type for K and V through this interface? ConsumerStrategy allows for user-defined implementations. This is necessary because getting kafka consumers set up correctly is stateful, and one size doesn't fit all. How do I communicate a user-defined ConsumerStrategy through this interface? More prosaically, telling Scala end users that the way they need to communicate a mapping from topicpartition objects to starting offsets is to pass in a json string... if I was evaluating a library new from an outside perspective and saw that, I'd say nope and walk away. Any language that can handle json can handle nested maps, so at the very least that seems like a better lowest common denominator than string. Again, I'm not trying to be obstructionist here, I am generally in favor of doing the simplest thing that works. But I really have very little confidence that doing the expedient thing now isn't going to prevent us from doing the right thing later. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-15406: -- Description: This is the parent JIRA to track all the work for the building a Kafka source for Structured Streaming. Here is the design doc for an initial version of the Kafka Source. https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing == Old description = Structured streaming doesn't have support for kafka yet. I personally feel like time based indexing would make for a much better interface, but it's been pushed back to kafka 0.10.1 https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index was: Structured streaming doesn't have support for kafka yet. I personally feel like time based indexing would make for a much better interface, but it's been pushed back to kafka 0.10.1 https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > This is the parent JIRA to track all the work for the building a Kafka source > for Structured Streaming. Here is the design doc for an initial version of > the Kafka Source. > https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing > == Old description = > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17528) MutableProjection should not cache content from the input row
[ https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487907#comment-15487907 ] Apache Spark commented on SPARK-17528: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15082 > MutableProjection should not cache content from the input row > - > > Key: SPARK-17528 > URL: https://issues.apache.org/jira/browse/SPARK-17528 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17528) MutableProjection should not cache content from the input row
[ https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17528: Assignee: Apache Spark (was: Wenchen Fan) > MutableProjection should not cache content from the input row > - > > Key: SPARK-17528 > URL: https://issues.apache.org/jira/browse/SPARK-17528 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17528) MutableProjection should not cache content from the input row
[ https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17528: Assignee: Wenchen Fan (was: Apache Spark) > MutableProjection should not cache content from the input row > - > > Key: SPARK-17528 > URL: https://issues.apache.org/jira/browse/SPARK-17528 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487875#comment-15487875 ] Michael Armbrust commented on SPARK-15406: -- Hey Cody, thanks for the input and for sharing your prototype! While I think its fair to say that we don't have the resources to bring python DStreams up to par with Scala at this time, that is not the case for the Spark SQL APIs. We already have pretty good parity between Scala and Python here, and I'd really like to maintain/improve that. Can you help me understand better what specifically is broken with the stringly typed interface? Ideally, we'd find ways to work around this, but that said, I'm not opposed to also having a typed interface in addition to the DataStreamReader/Writer. We do this in a few exceptional cases already (i.e. we take an RDD[String] into the JSON data source). > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17528) MutableProjection should not cache content from the input row
Wenchen Fan created SPARK-17528: --- Summary: MutableProjection should not cache content from the input row Key: SPARK-17528 URL: https://issues.apache.org/jira/browse/SPARK-17528 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17527) mergeSchema with `_OPTIONAL_` metadata fails
Gaurav Shah created SPARK-17527: --- Summary: mergeSchema with `_OPTIONAL_` metadata fails Key: SPARK-17527 URL: https://issues.apache.org/jira/browse/SPARK-17527 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Environment: mac osx 10.11.6, ubuntu 14, ubuntu 16. spark 2.0.0, spark-catalyst 2.0.0 Reporter: Gaurav Shah Spark added '_OPTIONAL' metadata in 2.0.0 in following commit: https://github.com/apache/spark/commit/4637fc08a3733ec313218fb7e4d05064d9a6262d but merging metadata for data created from spark 1.6.x and 2.0 fails with following: {code} Exception in thread "main" java.lang.RuntimeException: could not merge metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values: {code} and the only difference in those values is metadata now having "_OPTIONAL_" field extra. {code:javascript} { { "name": "catalog", "name": "catalog", "type": { "type": { "type": "struct", "type": "struct", "fields": [ "fields": [ { { "name": "category", "name": "category", "type": "string", "type": "string", "nullable": true, "nullable": true, "metadata": {} "metadata": {} }, }, { { "name": "department", "name": "department", "type": "string", "type": "string", "nullable": true, "nullable": true, "metadata": {} "metadata": {} } } ] ] }, }, "nullable": true, "nullable": true, "metadata": { "metadata": {} "_OPTIONAL_": true } {code} vs {code:javascript} { "name": "catalog", "name": "catalog", "type": { "type": { "type": "struct", "type": "struct", "fields": [ "fields": [ { { "name": "category", "name": "category", "type": "string", "type": "string", "nullable": true, "nullable": true, "metadata": {} "metadata": {} }, }, { { "name": "department", "name": "department", "type": "string", "type": "string", "nullable": true, "nullable": true, "metadata": {} "metadata": {} } } ] ] }, }, "nullable": true, "nullable": true, "metadata": { "metadata": {} {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487789#comment-15487789 ] Apache Spark commented on SPARK-17525: -- User 'sjakthol' has created a pull request for this issue: https://github.com/apache/spark/pull/15081 > SparkContext.clearFiles() still present in the PySpark bindings though the > underlying Scala method was removed in Spark 2.0 > --- > > Key: SPARK-17525 > URL: https://issues.apache.org/jira/browse/SPARK-17525 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sami Jaktholm >Priority: Trivial > > STR: In PySpark shell, run sc.clearFiles() > What happens: > {noformat} > py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. > Trace: > py4j.Py4JException: Method clearFiles([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:272) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Apparently the old and deprecated SparkContext.clearFiles() was removed from > Spark 2.0 but it's still present in the PySpark API. It should be removed > from there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17525: Assignee: (was: Apache Spark) > SparkContext.clearFiles() still present in the PySpark bindings though the > underlying Scala method was removed in Spark 2.0 > --- > > Key: SPARK-17525 > URL: https://issues.apache.org/jira/browse/SPARK-17525 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sami Jaktholm >Priority: Trivial > > STR: In PySpark shell, run sc.clearFiles() > What happens: > {noformat} > py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. > Trace: > py4j.Py4JException: Method clearFiles([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:272) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Apparently the old and deprecated SparkContext.clearFiles() was removed from > Spark 2.0 but it's still present in the PySpark API. It should be removed > from there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17525: Assignee: Apache Spark > SparkContext.clearFiles() still present in the PySpark bindings though the > underlying Scala method was removed in Spark 2.0 > --- > > Key: SPARK-17525 > URL: https://issues.apache.org/jira/browse/SPARK-17525 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sami Jaktholm >Assignee: Apache Spark >Priority: Trivial > > STR: In PySpark shell, run sc.clearFiles() > What happens: > {noformat} > py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. > Trace: > py4j.Py4JException: Method clearFiles([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:272) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Apparently the old and deprecated SparkContext.clearFiles() was removed from > Spark 2.0 but it's still present in the PySpark API. It should be removed > from there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487780#comment-15487780 ] Frederick Reiss commented on SPARK-15406: - +1 for taking the simple route in the short term. I'm seeing a lot of people at my work who are just kicking the tires on Structured Streaming at this point. Right now, we just need a Kafka connector that is correct and easy to use; performance and advanced functionality are of secondary importance. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing
[ https://issues.apache.org/jira/browse/SPARK-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487749#comment-15487749 ] Apache Spark commented on SPARK-17252: -- User 'sjakthol' has created a pull request for this issue: https://github.com/apache/spark/pull/15081 > Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors > during query parsing > - > > Key: SPARK-17252 > URL: https://issues.apache.org/jira/browse/SPARK-17252 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen > Fix For: 2.0.1 > > > The following example fails with a ClassCastException: > {code} > create table t(d double); > insert into t VALUES (1 * 1.0); > {code} > Here's the error: > {code} > java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be > cast to java.lang.Integer > at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106) > at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57) > at > org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416) > at > org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198) > at > org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57) > at > org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580) > at > org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47) > at > org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158) > at > org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162) > at > org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96) > at >
[jira] [Assigned] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console
[ https://issues.apache.org/jira/browse/SPARK-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17526: Assignee: (was: Apache Spark) > Display the executor log links with the job failure message on Spark UI and > Console > --- > > Key: SPARK-17526 > URL: https://issues.apache.org/jira/browse/SPARK-17526 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Zhan Zhang >Priority: Minor > > Display the executor log links with the job failure message on Spark UI and > Console > "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): > java.lang.Exception: foo" > To make this failure message more helpful, we should have the executor log > link in the driver log and web ui as well on which the task failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console
[ https://issues.apache.org/jira/browse/SPARK-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17526: Assignee: Apache Spark > Display the executor log links with the job failure message on Spark UI and > Console > --- > > Key: SPARK-17526 > URL: https://issues.apache.org/jira/browse/SPARK-17526 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Zhan Zhang >Assignee: Apache Spark >Priority: Minor > > Display the executor log links with the job failure message on Spark UI and > Console > "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): > java.lang.Exception: foo" > To make this failure message more helpful, we should have the executor log > link in the driver log and web ui as well on which the task failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console
[ https://issues.apache.org/jira/browse/SPARK-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487705#comment-15487705 ] Apache Spark commented on SPARK-17526: -- User 'zhzhan' has created a pull request for this issue: https://github.com/apache/spark/pull/15080 > Display the executor log links with the job failure message on Spark UI and > Console > --- > > Key: SPARK-17526 > URL: https://issues.apache.org/jira/browse/SPARK-17526 > Project: Spark > Issue Type: Improvement > Components: Web UI >Reporter: Zhan Zhang >Priority: Minor > > Display the executor log links with the job failure message on Spark UI and > Console > "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): > java.lang.Exception: foo" > To make this failure message more helpful, we should have the executor log > link in the driver log and web ui as well on which the task failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17397) Show example of what to do when awaitTermination() throws an Exception
[ https://issues.apache.org/jira/browse/SPARK-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487691#comment-15487691 ] Spiro Michaylov edited comment on SPARK-17397 at 9/13/16 4:37 PM: -- In the form of a PR? Sure, I can do that. If I don't get to it in the next couple of days it'll take a few weeks, but I'll eventually get to it. Feel free to assign it to me if that's the process. (Looks like I can't do it myself.) was (Author: spirom): In the form of a PR? Sure, I can do that. If I don't get to it in the next couple of days it'll take a few weeks, but I'll eventually get to it. I'll assign it to myself but if someone wants to yank it back that's fine. > Show example of what to do when awaitTermination() throws an Exception > -- > > Key: SPARK-17397 > URL: https://issues.apache.org/jira/browse/SPARK-17397 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 2.0.0 > Environment: Linux, Scala but probably general >Reporter: Spiro Michaylov >Priority: Minor > > When awaitTermination propagates an exception that was thrown in processing a > batch, the StreamingContext keeps running. Perhaps this is by design, but I > don't see any mention of it in the API docs or the streaming programming > guide. It's not clear what idiom should be used to block the thread until the > context HAS been stopped in a situation where stream processing is throwing > lots of exceptions. > For example, in the following, streaming takes the full 30 seconds to > terminate. My hope in asking this is to improve my own understanding and > perhaps inspire documentation improvements. I'm not filing a bug because it's > not clear to me whether this is working as intended. > {code} > val conf = new > SparkConf().setAppName("ExceptionPropagation").setMaster("local[4]") > val sc = new SparkContext(conf) > // streams will produce data every second > val ssc = new StreamingContext(sc, Seconds(1)) > val qm = new QueueMaker(sc, ssc) > // create the stream > val stream = // create some stream > // register for data > stream > .map(x => { throw new SomeException("something"); x} ) > .foreachRDD(r => println("*** count = " + r.count())) > // start streaming > ssc.start() > new Thread("Delayed Termination") { > override def run() { > Thread.sleep(3) > ssc.stop() > } > }.start() > println("*** producing data") > // start producing data > qm.populateQueue() > try { > ssc.awaitTermination() > println("*** streaming terminated") > } catch { > case e: Exception => { > println("*** streaming exception caught in monitor thread") > } > } > // if the above goes down the exception path, there seems no > // good way to block here until the streaming context is stopped > println("*** done") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10408: Assignee: Apache Spark (was: Alexander Ulanov) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Apache Spark > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487696#comment-15487696 ] Apache Spark commented on SPARK-10408: -- User 'avulanov' has created a pull request for this issue: https://github.com/apache/spark/pull/13621 > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-10408: Assignee: Alexander Ulanov (was: Apache Spark) > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17397) Show example of what to do when awaitTermination() throws an Exception
[ https://issues.apache.org/jira/browse/SPARK-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487691#comment-15487691 ] Spiro Michaylov commented on SPARK-17397: - In the form of a PR? Sure, I can do that. If I don't get to it in the next couple of days it'll take a few weeks, but I'll eventually get to it. I'll assign it to myself but if someone wants to yank it back that's fine. > Show example of what to do when awaitTermination() throws an Exception > -- > > Key: SPARK-17397 > URL: https://issues.apache.org/jira/browse/SPARK-17397 > Project: Spark > Issue Type: Improvement > Components: Documentation, Streaming >Affects Versions: 2.0.0 > Environment: Linux, Scala but probably general >Reporter: Spiro Michaylov >Priority: Minor > > When awaitTermination propagates an exception that was thrown in processing a > batch, the StreamingContext keeps running. Perhaps this is by design, but I > don't see any mention of it in the API docs or the streaming programming > guide. It's not clear what idiom should be used to block the thread until the > context HAS been stopped in a situation where stream processing is throwing > lots of exceptions. > For example, in the following, streaming takes the full 30 seconds to > terminate. My hope in asking this is to improve my own understanding and > perhaps inspire documentation improvements. I'm not filing a bug because it's > not clear to me whether this is working as intended. > {code} > val conf = new > SparkConf().setAppName("ExceptionPropagation").setMaster("local[4]") > val sc = new SparkContext(conf) > // streams will produce data every second > val ssc = new StreamingContext(sc, Seconds(1)) > val qm = new QueueMaker(sc, ssc) > // create the stream > val stream = // create some stream > // register for data > stream > .map(x => { throw new SomeException("something"); x} ) > .foreachRDD(r => println("*** count = " + r.count())) > // start streaming > ssc.start() > new Thread("Delayed Termination") { > override def run() { > Thread.sleep(3) > ssc.stop() > } > }.start() > println("*** producing data") > // start producing data > qm.populateQueue() > try { > ssc.awaitTermination() > println("*** streaming terminated") > } catch { > case e: Exception => { > println("*** streaming exception caught in monitor thread") > } > } > // if the above goes down the exception path, there seems no > // good way to block here until the streaming context is stopped > println("*** done") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console
Zhan Zhang created SPARK-17526: -- Summary: Display the executor log links with the job failure message on Spark UI and Console Key: SPARK-17526 URL: https://issues.apache.org/jira/browse/SPARK-17526 Project: Spark Issue Type: Improvement Components: Web UI Reporter: Zhan Zhang Priority: Minor Display the executor log links with the job failure message on Spark UI and Console "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): java.lang.Exception: foo" To make this failure message more helpful, we should have the executor log link in the driver log and web ui as well on which the task failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17142) Complex query triggers binding error in HashAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-17142. --- Resolution: Fixed Assignee: Jiang Xingbo Fix Version/s: 2.1.0 > Complex query triggers binding error in HashAggregateExec > - > > Key: SPARK-17142 > URL: https://issues.apache.org/jira/browse/SPARK-17142 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Jiang Xingbo >Priority: Blocker > Fix For: 2.1.0 > > > The following example runs successfully on Spark 2.0.0 but fails in the > current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not > earlier): > {code} > spark.sql("set spark.sql.crossJoin.enabled=true") > sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4") > sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2") > val query = """ > SELECT > ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS > int_col_1 > FROM table_2 t1 > INNER JOIN ( > SELECT > LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), > -991) AS int_col, > (t2.int_col_1) + (t1.int_col_1) AS int_col_2, > (t1.int_col_1) + (t2.int_col_1) AS int_col_3, > t2.int_col_1 > FROM > table_4 t1, > table_4 t2 > GROUP BY > (t1.int_col_1) + (t2.int_col_1), > t2.int_col_1 > ) t2 > WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1) > GROUP BY (t2.int_col) + (t1.bigint_col_2) > """ > spark.sql(query).collect() > {code} > This fails with the following exception: > {code} > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: bigint_col_2#65 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455) > at > org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487310#comment-15487310 ] Cody Koeninger commented on SPARK-15406: So we can do the easiest thing possible for 2.0.1, which in this case is probably taking my patch and replacing the hardcoded string, string for message key/value with a byte array. But if we do that, are you actually willing to allow the changes necessary for people to get functionality comparable to the existing DStream / RDD interface for 2.x.x ? Or is this going to be another situation where you say "well, now people are using this interface, so we're stuck with it"? I'm worried about shoehorning part of the existing Kafka functionality into the SourceProvider interface without considering changes to that interface, especially if Kafka or Kafka-like sources are the only thing intended to work with it. For instance, why we are sticking ourselves with a stringly-typed interface and all the problems that causes, when Reynold already admitted that Python is always going to be a second-class citizen with regard to streaming (e.g. SPARK-16534) > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487224#comment-15487224 ] Dhruve Ashar commented on SPARK-16441: -- There was a patch which was recently contributed which reduces the deadlocking. You can check the details here => https://github.com/apache/spark/commit/a0aac4b775bc8c275f96ad0fbf85c9d8a3690588 Also I am working on a patch which improves this further and should be able to complete it soon. Its a WIP and I have a PR up for it [https://github.com/apache/spark/pull/14926]. Let us know if your issue is resolved/minimized with the patch. > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at >
[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487203#comment-15487203 ] Thomas Graves commented on SPARK-17321: --- yes that makes sense and as I stated I think the fix for this should be that the Spark shuffle services doesn't use the backup database at all if NM recovery (and spark config) aren't enabled. Thus you wouldn't have any disk errors. If NM recovery isn't enabled the spark DB isn't going to do you any good because NM is going to shoot any running containers on restart. If you are up for making those changes please go ahead and put up patch. > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487185#comment-15487185 ] Dongjoon Hyun edited comment on SPARK-16938 at 9/13/16 1:23 PM: Hi, [~cloud_fan] . Could you review this issue and PR? was (Author: dongjoon): Hi, @cloud-fan . Could you review this issue and PR? > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join
[ https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487185#comment-15487185 ] Dongjoon Hyun commented on SPARK-16938: --- Hi, @cloud-fan . Could you review this issue and PR? > Cannot resolve column name after a join > --- > > Key: SPARK-16938 > URL: https://issues.apache.org/jira/browse/SPARK-16938 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Mathieu D >Priority: Minor > > Found a change of behavior on spark-2.0.0, which breaks a query in our code > base. > The following works on previous spark versions, 1.6.1 up to 2.0.0-preview : > {code} > val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa") > val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb") > dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", > "dfb.id")) > {code} > but fails with spark-2.0.0 with the exception : > {code} > Cannot resolve column name "dfa.id" among (id, a, id, b); > org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" > among (id, a, id, b); > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817) > at > org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814) > at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487166#comment-15487166 ] Alexander Kasper commented on SPARK-17321: -- No, we're not using NM recovery. What we observed is the following: - NM runs with a list of local dirs on various disks - Shuffle service puts its data into one of these local dirs - One disk fails making that local dir unuseable, which is incidentally the one where shuffle service put its data - NM recognizes the disk failure but keeps running happily (did not have any critical data there? no idea...) - Shuffle service can't access its data and is not working anymore on this node So in essence, the NM has some way of recognizing that a local dir is not useable anymore and can keep operating. The shuffle service lacks this functionality and becomes unuseable. Does that make sense? > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487117#comment-15487117 ] Sean Owen commented on SPARK-17525: --- Oops, yeah this was missed in https://github.com/apache/spark/commit/8ce645d4eeda203cf5e100c4bdba2d71edd44e6a#diff-364713d7776956cb8b0a771e9b62f82d I think you can open a PR for this for 2.0.1+ > SparkContext.clearFiles() still present in the PySpark bindings though the > underlying Scala method was removed in Spark 2.0 > --- > > Key: SPARK-17525 > URL: https://issues.apache.org/jira/browse/SPARK-17525 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Sami Jaktholm >Priority: Trivial > > STR: In PySpark shell, run sc.clearFiles() > What happens: > {noformat} > py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. > Trace: > py4j.Py4JException: Method clearFiles([]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) > at py4j.Gateway.invoke(Gateway.java:272) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > {noformat} > Apparently the old and deprecated SparkContext.clearFiles() was removed from > Spark 2.0 but it's still present in the PySpark API. It should be removed > from there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size
[ https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17524: Assignee: Apache Spark > RowBasedKeyValueBatchSuite always uses 64 mb page size > -- > > Key: SPARK-17524 > URL: https://issues.apache.org/jira/browse/SPARK-17524 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Assignee: Apache Spark >Priority: Minor > > The appendRowUntilExceedingPageSize test at > sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java > always uses the default page size which is 64 MB for running the test > Users with less powerful machines (e.g. those with two cores) may opt to > choose a smaller spark.buffer.pageSize value in order to prevent problems > acquiring memory > If this size is reduced, let's say to 1 MB, this test will fail, here is the > problem scenario > We run with 1,048,576 page size (1 mb) > Default is 67,108,864 size (64 mb) > Test fails: java.lang.AssertionError: expected:<14563> but was:<932067> > 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger > than 1 MB (which is when we see the failure) > The failure is at > Assert.assertEquals(batch.numRows(), numRows); > This minor improvement has the test use whatever the user has specified > (looks for spark.buffer.pageSize) to prevent this problem from occurring for > anyone testing Apache Spark on a box with a reduced page size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size
[ https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487093#comment-15487093 ] Apache Spark commented on SPARK-17524: -- User 'a-roberts' has created a pull request for this issue: https://github.com/apache/spark/pull/15079 > RowBasedKeyValueBatchSuite always uses 64 mb page size > -- > > Key: SPARK-17524 > URL: https://issues.apache.org/jira/browse/SPARK-17524 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > The appendRowUntilExceedingPageSize test at > sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java > always uses the default page size which is 64 MB for running the test > Users with less powerful machines (e.g. those with two cores) may opt to > choose a smaller spark.buffer.pageSize value in order to prevent problems > acquiring memory > If this size is reduced, let's say to 1 MB, this test will fail, here is the > problem scenario > We run with 1,048,576 page size (1 mb) > Default is 67,108,864 size (64 mb) > Test fails: java.lang.AssertionError: expected:<14563> but was:<932067> > 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger > than 1 MB (which is when we see the failure) > The failure is at > Assert.assertEquals(batch.numRows(), numRows); > This minor improvement has the test use whatever the user has specified > (looks for spark.buffer.pageSize) to prevent this problem from occurring for > anyone testing Apache Spark on a box with a reduced page size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size
[ https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17524: Assignee: (was: Apache Spark) > RowBasedKeyValueBatchSuite always uses 64 mb page size > -- > > Key: SPARK-17524 > URL: https://issues.apache.org/jira/browse/SPARK-17524 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > The appendRowUntilExceedingPageSize test at > sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java > always uses the default page size which is 64 MB for running the test > Users with less powerful machines (e.g. those with two cores) may opt to > choose a smaller spark.buffer.pageSize value in order to prevent problems > acquiring memory > If this size is reduced, let's say to 1 MB, this test will fail, here is the > problem scenario > We run with 1,048,576 page size (1 mb) > Default is 67,108,864 size (64 mb) > Test fails: java.lang.AssertionError: expected:<14563> but was:<932067> > 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger > than 1 MB (which is when we see the failure) > The failure is at > Assert.assertEquals(batch.numRows(), numRows); > This minor improvement has the test use whatever the user has specified > (looks for spark.buffer.pageSize) to prevent this problem from occurring for > anyone testing Apache Spark on a box with a reduced page size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0
Sami Jaktholm created SPARK-17525: - Summary: SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0 Key: SPARK-17525 URL: https://issues.apache.org/jira/browse/SPARK-17525 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Sami Jaktholm Priority: Trivial STR: In PySpark shell, run sc.clearFiles() What happens: {noformat} py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. Trace: py4j.Py4JException: Method clearFiles([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:272) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) {noformat} Apparently the old and deprecated SparkContext.clearFiles() was removed from Spark 2.0 but it's still present in the PySpark API. It should be removed from there too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17521) Error when I use sparkContext.makeRDD(Seq())
[ https://issues.apache.org/jira/browse/SPARK-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487089#comment-15487089 ] Sean Owen commented on SPARK-17521: --- Reaching way back to page [~pwendell] -- do you have any view on whether it's probably right to deprecate makeRDD() at this point in favor of parallelize()? > Error when I use sparkContext.makeRDD(Seq()) > > > Key: SPARK-17521 > URL: https://issues.apache.org/jira/browse/SPARK-17521 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei >Priority: Trivial > Labels: easyfix > > when i use sc.makeRDD below > ``` > val data3 = sc.makeRDD(Seq()) > println(data3.partitions.length) > ``` > I got an error: > Exception in thread "main" java.lang.IllegalArgumentException: Positive > number of slices required > We can fix this bug just modify the last line ,do a check of seq.size > > def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope { > assertNotStopped() > val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap > new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs) > } > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size
Adam Roberts created SPARK-17524: Summary: RowBasedKeyValueBatchSuite always uses 64 mb page size Key: SPARK-17524 URL: https://issues.apache.org/jira/browse/SPARK-17524 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.1.0 Reporter: Adam Roberts Priority: Minor The appendRowUntilExceedingPageSize test at sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java always uses the default page size which is 64 MB for running the test Users with less powerful machines (e.g. those with two cores) may opt to choose a smaller spark.buffer.pageSize value in order to prevent problems acquiring memory If this size is reduced, let's say to 1 MB, this test will fail, here is the problem scenario We run with 1,048,576 page size (1 mb) Default is 67,108,864 size (64 mb) Test fails: java.lang.AssertionError: expected:<14563> but was:<932067> 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger than 1 MB (which is when we see the failure) The failure is at Assert.assertEquals(batch.numRows(), numRows); This minor improvement has the test use whatever the user has specified (looks for spark.buffer.pageSize) to prevent this problem from occurring for anyone testing Apache Spark on a box with a reduced page size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487035#comment-15487035 ] Yanbo Liang commented on SPARK-17471: - [~sethah] I'm sorry that I have some emergent affairs to deal with in these days, so please feel free to take over this task. Thanks! > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org