[jira] [Commented] (SPARK-17142) Complex query triggers binding error in HashAggregateExec

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489503#comment-15489503
 ] 

Apache Spark commented on SPARK-17142:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15092

> Complex query triggers binding error in HashAggregateExec
> -
>
> Key: SPARK-17142
> URL: https://issues.apache.org/jira/browse/SPARK-17142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.1.0
>
>
> The following example runs successfully on Spark 2.0.0 but fails in the 
> current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not 
> earlier):
> {code}
> spark.sql("set spark.sql.crossJoin.enabled=true")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2")
> val query = """
> SELECT
> ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS 
> int_col_1
> FROM table_2 t1
> INNER JOIN (
> SELECT
> LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), 
> -991) AS int_col,
> (t2.int_col_1) + (t1.int_col_1) AS int_col_2,
> (t1.int_col_1) + (t2.int_col_1) AS int_col_3,
> t2.int_col_1
> FROM
> table_4 t1,
> table_4 t2
> GROUP BY
> (t1.int_col_1) + (t2.int_col_1),
> t2.int_col_1
> ) t2
> WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1)
> GROUP BY (t2.int_col) + (t1.bigint_col_2)
> """
> spark.sql(query).collect()
> {code}
> This fails with the following exception:
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: bigint_col_2#65
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454)
>   at 
> 

[jira] [Updated] (SPARK-17460) Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception

2016-09-13 Thread Chris Perluss (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Perluss updated SPARK-17460:
--
Summary: Dataset.joinWith broadcasts gigabyte sized table, causes OOM 
Exception  (was: Dataset.joinWith causes OutOfMemory due to logicalPlan 
sizeInBytes being negative)

> Dataset.joinWith broadcasts gigabyte sized table, causes OOM Exception
> --
>
> Key: SPARK-17460
> URL: https://issues.apache.org/jira/browse/SPARK-17460
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Spark 2.0 in local mode as well as on GoogleDataproc
>Reporter: Chris Perluss
>
> Dataset.joinWith is performing a BroadcastJoin on a table that is gigabytes 
> in size due to the dataset.logicalPlan.statistics.sizeInBytes < 0.
> The issue is that org.apache.spark.sql.types.ArrayType.defaultSize is of 
> datatype Int.  In my dataset, there is an Array column whose data size 
> exceeds the limits of an Int and so the data size becomes negative.
> The issue can be repeated by running this code in REPL:
> val ds = (0 to 1).map( i => (i, Seq((i, Seq((i, "This is really not that 
> long of a string")).toDS()
> // You might have to remove private[sql] from Dataset.logicalPlan to get this 
> to work
> val stats = ds.logicalPlan.statistics
> yields
> stats: org.apache.spark.sql.catalyst.plans.logical.Statistics = 
> Statistics(-1890686892,false)
> This causes joinWith to performWith to perform a broadcast join even tho my 
> data is gigabytes in size, which of course causes the executors to run out of 
> memory.
> Setting spark.sql.autoBroadcastJoinThreshold=-1 does not help because the 
> logicalPlan.statistics.sizeInBytes is a large negative number and thus it is 
> less than the join threshold of -1.
> I've been able to work around this issue by setting 
> autoBroadcastJoinThreshold to a very large negative number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17317) Add package vignette to SparkR

2016-09-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-17317:
--
Assignee: Junyang Qian

> Add package vignette to SparkR
> --
>
> Key: SPARK-17317
> URL: https://issues.apache.org/jira/browse/SPARK-17317
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>Assignee: Junyang Qian
> Fix For: 2.1.0
>
>
> In publishing SparkR to CRAN, it would be nice to have a vignette as a user 
> guide that
> * describes the big picture
> * introduces the use of various methods
> This is important for new users because they may not even know which method 
> to look up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17317) Add package vignette to SparkR

2016-09-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-17317.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14980
[https://github.com/apache/spark/pull/14980]

> Add package vignette to SparkR
> --
>
> Key: SPARK-17317
> URL: https://issues.apache.org/jira/browse/SPARK-17317
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
> Fix For: 2.1.0
>
>
> In publishing SparkR to CRAN, it would be nice to have a vignette as a user 
> guide that
> * describes the big picture
> * introduces the use of various methods
> This is important for new users because they may not even know which method 
> to look up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17073) generate basic stats for column

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489326#comment-15489326
 ] 

Apache Spark commented on SPARK-17073:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/15090

> generate basic stats for column
> ---
>
> Key: SPARK-17073
> URL: https://issues.apache.org/jira/browse/SPARK-17073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> For a specified column, we need to generate basic stats including max, min, 
> number of nulls, number of distinct values, max column length, average column 
> length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17073) generate basic stats for column

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17073:


Assignee: (was: Apache Spark)

> generate basic stats for column
> ---
>
> Key: SPARK-17073
> URL: https://issues.apache.org/jira/browse/SPARK-17073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> For a specified column, we need to generate basic stats including max, min, 
> number of nulls, number of distinct values, max column length, average column 
> length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17073) generate basic stats for column

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17073:


Assignee: Apache Spark

> generate basic stats for column
> ---
>
> Key: SPARK-17073
> URL: https://issues.apache.org/jira/browse/SPARK-17073
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>Assignee: Apache Spark
>
> For a specified column, we need to generate basic stats including max, min, 
> number of nulls, number of distinct values, max column length, average column 
> length.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2016-09-13 Thread Ganesh Krishnan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489250#comment-15489250
 ] 

Ganesh Krishnan commented on SPARK-2365:


Is this only for RDD or can we use it with DataFrames? With Spark 2.0 
especially the emphasis is on DataFrames and we would like to use only 
DataFrames in our project

> Add IndexedRDD, an efficient updatable key-value store
> --
>
> Key: SPARK-2365
> URL: https://issues.apache.org/jira/browse/SPARK-2365
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, Spark Core
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Attachments: 2014-07-07-IndexedRDD-design-review.pdf
>
>
> RDDs currently provide a bulk-updatable, iterator-based interface. This 
> imposes minimal requirements on the storage layer, which only needs to 
> support sequential access, enabling on-disk and serialized storage.
> However, many applications would benefit from a richer interface. Efficient 
> support for point lookups would enable serving data out of RDDs, but it 
> currently requires iterating over an entire partition to find the desired 
> element. Point updates similarly require copying an entire iterator. Joins 
> are also expensive, requiring a shuffle and local hash joins.
> To address these problems, we propose IndexedRDD, an efficient key-value 
> store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
> uniqueness and pre-indexing the entries for efficient joins and point 
> lookups, updates, and deletions.
> It would be implemented by (1) hash-partitioning the entries by key, (2) 
> maintaining a hash index within each partition, and (3) using purely 
> functional (immutable and efficiently updatable) data structures to enable 
> efficient modifications and deletions.
> GraphX would be the first user of IndexedRDD, since it currently implements a 
> limited form of this functionality in VertexRDD. We envision a variety of 
> other uses for IndexedRDD, including streaming updates to RDDs, direct 
> serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16460) Spark 2.0 CSV ignores NULL value in Date format

2016-09-13 Thread Liwei Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489193#comment-15489193
 ] 

Liwei Lin commented on SPARK-16460:
---

[~marcelboldt] Oh cool! Thanks for the feedback!

> Spark 2.0 CSV ignores NULL value in Date format
> ---
>
> Key: SPARK-16460
> URL: https://issues.apache.org/jira/browse/SPARK-16460
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: SparkR
>Reporter: Marcel Boldt
>Priority: Minor
>
> Trying to read a CSV file to Spark (using SparkR) containing just this data 
> row:
> {code}
> 1|1998-01-01||
> {code}
> Using Spark 1.6.2 (Hadoop 2.6) gives me 
> {code}
> > head(sdf)
>   id  d dtwo
> 1  1 1998-01-01   NA
> {code}
> Spark 2.0 preview (Hadoop 2.7, Rev. 14308) fails with error: 
> {panel}
> > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.text.ParseException: Unparseable date: ""
>   at java.text.DateFormat.parse(DateFormat.java:357)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:98)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVRelation$$anonfun$csvParser$3.apply(CSVRelation.scala:74)
>   at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
>   at 
> org.apache.spark.sql.execution.datasources.csv.DefaultSource$$anonfun$buildReader$1$$anonfun$apply$1.apply(DefaultSource.scala:124)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Itera...
> {panel}
> The problem seems indeed the NULL value here as with a valid date in the 
> third CSV column it works.
> R code:
> {code}
> #Sys.setenv(SPARK_HOME = 'c:/spark/spark-1.6.2-bin-hadoop2.6') 
> Sys.setenv(SPARK_HOME = 'C:/spark/spark-2.0.0-preview-bin-hadoop2.7')
> .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
> library(SparkR)
> 
> sc <-
> sparkR.init(
> master = "local",
> sparkPackages = "com.databricks:spark-csv_2.11:1.4.0"
> )
> sqlContext <- sparkRSQL.init(sc)
> 
> 
> st <- structType(structField("id", "integer"), structField("d", "date"), 
> structField("dtwo", "date"))
> 
> sdf <- read.df(
> sqlContext,
> path = "d:/date_test.csv",
> source = "com.databricks.spark.csv",
> schema = st,
> inferSchema = "false",
> delimiter = "|",
> dateFormat = "-MM-dd",
> nullValue = "",
> mode = "PERMISSIVE"
> )
> 
> head(sdf)
> 
> sparkR.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489169#comment-15489169
 ] 

Reynold Xin edited comment on SPARK-15406 at 9/14/16 2:50 AM:
--

Finally back from vacation. FWIW, I want to cut 2.0.1 rc in the next day or two 
(I sent an email almost 1 month ago about cutting 2.0.1) so I don't think the 
Kafka one will make it anyway. There are actual correctness bugs that we need 
to roll out 2.0.1.



was (Author: rxin):
Finally back from vacation. FWIW, I want to cut 2.0.1 rc in the next day or two 
so I don't think the Kafka one will make it anyway. There are actual 
correctness bugs that we need to roll out 2.0.1.


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489169#comment-15489169
 ] 

Reynold Xin commented on SPARK-15406:
-

Finally back from vacation. FWIW, I want to cut 2.0.1 rc in the next day or two 
so I don't think the Kafka one will make it anyway. There are actual 
correctness bugs that we need to roll out 2.0.1.


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489102#comment-15489102
 ] 

Cody Koeninger commented on SPARK-17510:


This would require a constructor change and another overload, which I'm 
personally not necessarily against, but I'd want to understand why it was 
necessary.

I know there was some list discussion, but I never heard back on your testing 
of the direct stream only with backpressure on.  There had been some work done 
already on making sure backpressure didn't starve particular topicpartitions.  
WIth backpressure, max rate should only be necessary for making sure the 
initial rdd doesn't blow out any limits.

> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6593) Provide option for HadoopRDD to skip corrupted files

2016-09-13 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1543#comment-1543
 ] 

Charles Pritchard commented on SPARK-6593:
--

Something appears to have changed between 2.0 and 1.5.1: in 2.0 I have files 
that will fail with "Unexpected end of input stream" whereas they read with 
1.5.1 without error. Those files also trigger exceptions with command line 
zcat/gzip.

> Provide option for HadoopRDD to skip corrupted files
> 
>
> Key: SPARK-6593
> URL: https://issues.apache.org/jira/browse/SPARK-6593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Dale Richardson
>Priority: Minor
>
> When reading a large amount of gzip files from HDFS eg. with  
> sc.textFile("hdfs:///user/cloudera/logs*.gz"), If the hadoop input libraries 
> report an exception then the entire job is canceled. As default behaviour 
> this is probably for the best, but it would be nice in some circumstances 
> where you know it will be ok to have the option to skip the corrupted file 
> and continue the job. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15621:


Assignee: Apache Spark  (was: Davies Liu)

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Apache Spark
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15621:


Assignee: Davies Liu  (was: Apache Spark)

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488869#comment-15488869
 ] 

Apache Spark commented on SPARK-15621:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15089

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14482) Change default compression codec for Parquet from gzip to snappy

2016-09-13 Thread Charles Pritchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488848#comment-15488848
 ] 

Charles Pritchard commented on SPARK-14482:
---

I don't think this fully made it into the manual; this impacts default 
compression for Parquet, but I think the manual still shows the default as gzip.

> Change default compression codec for Parquet from gzip to snappy
> 
>
> Key: SPARK-14482
> URL: https://issues.apache.org/jira/browse/SPARK-14482
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Based on our tests, gzip decompression is very slow (< 100MB/s), making 
> queries decompression bound. Snappy can decompress at ~ 500MB/s on a single 
> core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17531:
-
Fix Version/s: 1.6.3

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED

2016-09-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17530.
---
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.1.0

> Add Statistics into DESCRIBE FORMATTED
> --
>
> Key: SPARK-17530
> URL: https://issues.apache.org/jira/browse/SPARK-17530
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it 
> and also improve the output of statistics in `DESCRIBE EXTENDED`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488642#comment-15488642
 ] 

Cody Koeninger commented on SPARK-15406:


1. How can we avoid duplicate work like this? There was months of radio 
silence, Reynold finally said on list no one had gotten to it so I said I'd be 
happy to work on it.  Now you post a design doc with a presumptive already 
upcoming PR for your preferred implementation.  You guys have gotten a lot of 
flack for not having open development, and this process isn't helping. Are you 
taking anything in my work into account?

2. Subscribe and assign still need structured arguments, not strings

3. Prefixing keys with kafka is still a hack that would be avoided with actual 
structured configuration

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-17445:
-
Target Version/s: 2.0.1, 2.1.0

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-09-13 Thread Kevin Rossi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488612#comment-15488612
 ] 

Kevin Rossi commented on SPARK-17097:
-

I am away and I will return on 16 September 2016.

Thank you,
Kevin
Kevin Rossi | ross...@llnl.gov | Office: 925-423-6737


> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 
> and then it behaves as expected.
> (Apologies if this has been fixed in later versions; we're unfortunately 
> pegged to 2.10.5 / 1.6.0 for now.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17097) Pregel does not keep vertex state properly; fails to terminate

2016-09-13 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488609#comment-15488609
 ] 

ding commented on SPARK-17097:
--

I am afraid the attached sample code fail to terminate with case class is not 
caused by Pregel or graphx bug. As in the sample code, vertices(inLabels here) 
is updated inplace which supposed not to happen. In this way, after transform 
operations, the original vertices is also updated and it has exactly the same 
value with the new generated vertices in above code when VD is case class. It 
leads to fail to update EdgeTriplet as there is no difference of the vertices. 
So EdgeTriplet dstLabels is always empty while srcLabels contains a value. And 
there is always active message which lead to Pregel not terminate.

One way to fix the problem is remove inplace update in vertexProgram by clone 
the labels and make update in the new labels. I have tried below code and it 
works.
def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
Vector[String]): TestVertex = {
  val labels = vdata.labels.clone() 
  message.foreach {
case `startString` =>
  if (vdata.id == 0L)
labels.add(vdata.value )  

case m =>
  if (!vdata.labels.contains(m))
labels.add(m)
  }
  new TestVertex(vdata.id, vdata.value, labels)
}

Hope this information is helpful to you.

> Pregel does not keep vertex state properly; fails to terminate 
> ---
>
> Key: SPARK-17097
> URL: https://issues.apache.org/jira/browse/SPARK-17097
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.0
> Environment: Scala 2.10.5, Spark 1.6.0 with GraphX and Pregel
>Reporter: Seth Bromberger
>
> Consider the following minimum example:
> {code:title=PregelBug.scala|borderStyle=solid}
> package testGraph
> import org.apache.spark.{SparkConf, SparkContext}
> import org.apache.spark.graphx.{Edge, EdgeTriplet, Graph, _}
> object PregelBug {
>   def main(args: Array[String]) = {
> //FIXME breaks if TestVertex is a case class; works if not case class
> case class TestVertex(inId: VertexId,
>  inData: String,
>  inLabels: collection.mutable.HashSet[String]) extends 
> Serializable {
>   val id = inId
>   val value = inData
>   val labels = inLabels
> }
> class TestLink(inSrc: VertexId, inDst: VertexId, inData: String) extends 
> Serializable  {
>   val src = inSrc
>   val dst = inDst
>   val data = inData
> }
> val startString = "XXXSTARTXXX"
> val conf = new SparkConf().setAppName("pregeltest").setMaster("local[*]")
> val sc = new SparkContext(conf)
> val vertexes = Vector(
>   new TestVertex(0, "label0", collection.mutable.HashSet[String]()),
>   new TestVertex(1, "label1", collection.mutable.HashSet[String]())
> )
> val links = Vector(
>   new TestLink(0, 1, "linkData01")
> )
> val vertexes_packaged = vertexes.map(v => (v.id, v))
> val links_packaged = links.map(e => Edge(e.src, e.dst, e))
> val graph = Graph[TestVertex, 
> TestLink](sc.parallelize(vertexes_packaged), sc.parallelize(links_packaged))
> def vertexProgram (vertexId: VertexId, vdata: TestVertex, message: 
> Vector[String]): TestVertex = {
>   message.foreach {
> case `startString` =>
>   if (vdata.id == 0L)
> vdata.labels.add(vdata.value)
> case m =>
>   if (!vdata.labels.contains(m))
> vdata.labels.add(m)
>   }
>   new TestVertex(vdata.id, vdata.value, vdata.labels)
> }
> def sendMessage (triplet: EdgeTriplet[TestVertex, TestLink]): 
> Iterator[(VertexId, Vector[String])] = {
>   val srcLabels = triplet.srcAttr.labels
>   val dstLabels = triplet.dstAttr.labels
>   val msgsSrcDst = srcLabels.diff(dstLabels)
> .map(label => (triplet.dstAttr.id, Vector[String](label)))
>   val msgsDstSrc = dstLabels.diff(dstLabels)
> .map(label => (triplet.srcAttr.id, Vector[String](label)))
>   msgsSrcDst.toIterator ++ msgsDstSrc.toIterator
> }
> def mergeMessage (m1: Vector[String], m2: Vector[String]): Vector[String] 
> = m1.union(m2).distinct
> val g = graph.pregel(Vector[String](startString))(vertexProgram, 
> sendMessage, mergeMessage)
> println("---pregel done---")
> println("vertex info:")
> g.vertices.foreach(
>   v => {
> val labels = v._2.labels
> println(
>   "vertex " + v._1 +
> ": name = " + v._2.id +
> ", labels = " + labels)
>   }
> )
>   }
> }
> {code}
> This code never terminates even though we expect it to. To fix, we simply 
> remove the "case" designation for the TestVertex class (see FIXME comment), 

[jira] [Resolved] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-17531.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15086
[https://github.com/apache/spark/pull/15086]

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
> Fix For: 2.0.1, 2.1.0
>
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-17531:
-
Assignee: Burak Yavuz

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 2.0.1, 2.1.0
>
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17532) Add thread lock information from JMX to thread dump UI

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17532:


Assignee: (was: Apache Spark)

> Add thread lock information from JMX to thread dump UI
> --
>
> Key: SPARK-17532
> URL: https://issues.apache.org/jira/browse/SPARK-17532
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Ryan Blue
>
> It would be much easier to track down issues like SPARK-15725 if the thread 
> dump page in the web UI displayed debugging info about what locks are 
> currently held by threads. This is already available through the 
> [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html]
>  that is used to get stack traces, thead ids, and names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17532) Add thread lock information from JMX to thread dump UI

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488601#comment-15488601
 ] 

Apache Spark commented on SPARK-17532:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/15088

> Add thread lock information from JMX to thread dump UI
> --
>
> Key: SPARK-17532
> URL: https://issues.apache.org/jira/browse/SPARK-17532
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Ryan Blue
>
> It would be much easier to track down issues like SPARK-15725 if the thread 
> dump page in the web UI displayed debugging info about what locks are 
> currently held by threads. This is already available through the 
> [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html]
>  that is used to get stack traces, thead ids, and names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17532) Add thread lock information from JMX to thread dump UI

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17532:


Assignee: Apache Spark

> Add thread lock information from JMX to thread dump UI
> --
>
> Key: SPARK-17532
> URL: https://issues.apache.org/jira/browse/SPARK-17532
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> It would be much easier to track down issues like SPARK-15725 if the thread 
> dump page in the web UI displayed debugging info about what locks are 
> currently held by threads. This is already available through the 
> [ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html]
>  that is used to get stack traces, thead ids, and names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488595#comment-15488595
 ] 

Tathagata Das commented on SPARK-15406:
---

1. We will have a PR very soon so that you can start playing with it. In fact, 
it would be awesome if you can play with it and give feedback. 

2. Version 1 will have just Subscribe, SubscribePattern and Assign strategies 
without the ability to specify starting offsets. May be this wasnt super clear 
in the design, and it will be clear when you look at the actual code.

3. Json stuff - Version 1 will not have the ability to specifying starting 
offsets, so those hacks are not there. That's precisely what I wanted to do 
with version 1, not have anything that cannot be supported in future.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488558#comment-15488558
 ] 

Cody Koeninger commented on SPARK-15406:


So you're saying the type of K and V will always be bytearray, custom kafka 
deserializers won't be used, and message schema will be handled by mapping 
after the fact? That's potentially OK, but I'd feel a lot better with a working 
example. I can play around with that a bit.

Regarding the statefulness of kafka consumers, I just mean you can't construct 
a kafka consumer that is already subscribed and has correct positions; you need 
to instantiate it and then call side-effecrting operations on it in order to 
get it in the right state.

Regarding Json convenience constructors, that's not purely additive. If we know 
that's the goal, we shouldn't use hacks like comma separated strings or 
prefixed keys now and get people relying on them.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17532) Add thread lock information from JMX to thread dump UI

2016-09-13 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-17532:
-

 Summary: Add thread lock information from JMX to thread dump UI
 Key: SPARK-17532
 URL: https://issues.apache.org/jira/browse/SPARK-17532
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Ryan Blue


It would be much easier to track down issues like SPARK-15725 if the thread 
dump page in the web UI displayed debugging info about what locks are currently 
held by threads. This is already available through the 
[ThreadMXBean|https://docs.oracle.com/javase/7/docs/api/java/lang/management/ThreadMXBean.html]
 that is used to get stack traces, thead ids, and names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-09-13 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488528#comment-15488528
 ] 

Dhruve Ashar commented on SPARK-16441:
--

So all of these are related in one way or the other. I will share my 
observations made so far: 

I have taken multiple thread dumps of these and analyzed what is causing the 
SparkListenerBus to block. The ExecutorAllocationManager is heavily 
synchronized and its a hotspot for contention especially when schedule is being 
invoked every 100ms to balance the executors. So for a spark job running 
thousands of executors, any change in the status of these executors with 
dynamic allocation enabled leads to frequent firing of events. This causes a 
contention and leads to blocking. Specifically if calls are remote RPCs.

Also the current design of the listener bus is such that it waits for every 
listener to process the event before it can proceed to deliver the next event 
from the queue. Any wait for acquiring the locks are leading to the event queue 
being filling up fast and leads to dropping of events. 

Logging individual execution times of these do not necessarily conclude to 
updateAndSync consuming majority of the time and hence we are working on 
reducing the lock contention and the minimize the RPC calls. Also a parameter 
to look at would be the heartbeat interval. Having a small interval for a very 
large no. of executors will aggravate the problem as you would be getting 
frequent ExecutorMetricsUpdate from the running executors. 



> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> 

[jira] [Commented] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488498#comment-15488498
 ] 

Apache Spark commented on SPARK-17531:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15087

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488496#comment-15488496
 ] 

Michael Armbrust commented on SPARK-15406:
--

For the types that are coming out, the SQL way would be to do conversions using 
either expressions, udfs, or as (in Datasets with encoders).  This way the 
schema would default to binary, but you can convert to the actual format after 
the DataFrame is returned.

I'm not sure I understand the reasons why a consumer would be stateful or how 
that is compatible with the model proposed in structured streaming.  Can you 
explain this use case further?

For configuration, I think its totally reasonable to have some way to pass in 
configuration that does not involve constructing json strings manually.  Thats 
an example of something that is purely additive.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488477#comment-15488477
 ] 

Michael Armbrust commented on SPARK-15406:
--

Streaming is labeled experimental, we can continue to change it during the 
2.x.x line of Spark.  In fact we are even adding features during 2.0.x.  At the 
same time I would hope that we don't have to throw everything out after we 
release something to users.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Ofir Manor (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488487#comment-15488487
 ] 

Ofir Manor commented on SPARK-15406:


Cody, I think you are right. Now is the right time to spend a several days 
iterating over the design. Commiting a new, half-baked API that will need to be 
maintained for years is exactly what we all try to avoid.
Of course, v1 doesn't have to implement a lot of features, but it should be 
future-compatible with the rest of the planned improvements.
(as a side note - I haven't seen a timeline for 2.0.1 being discussed, so I'm 
not sure what are the expectations regarding "v1" delivery)

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-09-13 Thread Ashwin Shankar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488465#comment-15488465
 ] 

Ashwin Shankar commented on SPARK-16441:


hey, Thanks! I don't think either of the patches would help unfortunately. The 
first patch is related to preemption, and I don't see the WARN message 
mentioned in that jira description in my logs. The second patch which you're 
working on is related to fixing ExecutorAllocationManager#removeExecutor. 
However the problem I see and what [~cenyuhai] has mentioned in this ticket is 
related to ExecutorAllocationManager#updateAndSyncNumExecutorsTarget not 
releasing the lock, and blocking SparkListenerBus thread. 
I'm debugging this as well on my side but do you know why 
ExecutorAllocationManager#updateAndSyncNumExecutorsTarget is stuck talking to 
cluster manager? 

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> 

[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488452#comment-15488452
 ] 

Cody Koeninger edited comment on SPARK-15406 at 9/13/16 9:18 PM:
-

So I asked this twice with no answer, so I'll ask it a third time.

Are you willing to later throw out whatever ships with 2.0.1 in order to do 
things right?

If not, that would certainly be a concern that could prevent us from doing 
stuff in 2.


was (Author: c...@koeninger.org):
So if I asked this twice with no answer, so I'll ask it a third time.

Are you willing to later throw out whatever ships with 2.0.1 in order to do 
things right?

If not, that would certainly be a concern that could prevent us from doing 
stuff in 2.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488452#comment-15488452
 ] 

Cody Koeninger commented on SPARK-15406:


So if I asked this twice with no answer, so I'll ask it a third time.

Are you willing to later throw out whatever ships with 2.0.1 in order to do 
things right?

If not, that would certainly be a concern that could prevent us from doing 
stuff in 2.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488420#comment-15488420
 ] 

Tathagata Das edited comment on SPARK-15406 at 9/13/16 9:04 PM:


[Combining the comments in the doc and on the JIRA]
[~c...@koeninger.org]
Thank you very much for highlighting the issues in more details. And I agree 
that string-string is NOT SUFFICIENT, and configurations may need to be passed 
in non-string form. In fact, kafka configurations allow 
[string-to-object|http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#KafkaConsumer(java.util.Map)]
 configs to be passed, and some configuration do need to be passed as 
non-string objects (integers, etc.). So eventually, its very likely that we 
have to come up with a solution that accommodates such stuff. All I am 
suggesting now is 

1. Let's put in something simple that works generically across all languages, 
but that may not be complete. 

2. After that, let's discuss and design something that makes it complete, and 
this may include new APIs, language-specific stuff, etc.

Rather than trying come up with a complete solutions and delay the release, I 
feel that this approach allows us to give something to the community with most 
common usecases, as soon as possible. Even if it is incomplete, the community 
can start testing ASAP and providing more concrete feedback on what it would 
take to make this feature complete.

Do you think this plan is okay? If so, do you think there is anything in the 
design for 1, that prevents us from doing stuff in 2?



was (Author: tdas):
[Combining the comments in the doc and on the JIRA]

Thank you very much for highlighting the issues in more details. And I agree 
that string-string is NOT SUFFICIENT, and configurations may need to be passed 
in non-string form. In fact, kafka configurations allow 
[string-to-object|http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#KafkaConsumer(java.util.Map)]
 configs to be passed, and some configuration do need to be passed as 
non-string objects (integers, etc.). So eventually, its very likely that we 
have to come up with a solution that accommodates such stuff. All I am 
suggesting now is 

1. Let's put in something simple that works generically across all languages, 
but that may not be complete. 

2. After that, let's discuss and design something that makes it complete, and 
this may include new APIs, language-specific stuff, etc.

Rather than trying come up with a complete solutions and delay the release, I 
feel that this approach allows us to give something to the community with most 
common usecases, as soon as possible. Even if it is incomplete, the community 
can start testing ASAP and providing more concrete feedback on what it would 
take to make this feature complete.

Do you think this plan is okay? If so, do you think there is anything in the 
design for 1, that prevents us from doing stuff in 2?


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488420#comment-15488420
 ] 

Tathagata Das commented on SPARK-15406:
---

[Combining the comments in the doc and on the JIRA]

Thank you very much for highlighting the issues in more details. And I agree 
that string-string is NOT SUFFICIENT, and configurations may need to be passed 
in non-string form. In fact, kafka configurations allow 
[string-to-object|http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#KafkaConsumer(java.util.Map)]
 configs to be passed, and some configuration do need to be passed as 
non-string objects (integers, etc.). So eventually, its very likely that we 
have to come up with a solution that accommodates such stuff. All I am 
suggesting now is 

1. Let's put in something simple that works generically across all languages, 
but that may not be complete. 

2. After that, let's discuss and design something that makes it complete, and 
this may include new APIs, language-specific stuff, etc.

Rather than trying come up with a complete solutions and delay the release, I 
feel that this approach allows us to give something to the community with most 
common usecases, as soon as possible. Even if it is incomplete, the community 
can start testing ASAP and providing more concrete feedback on what it would 
take to make this feature complete.

Do you think this plan is okay? If so, do you think there is anything in the 
design for 1, that prevents us from doing stuff in 2?


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17531:


Assignee: Apache Spark

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488337#comment-15488337
 ] 

Apache Spark commented on SPARK-17531:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/15086

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17531:


Assignee: (was: Apache Spark)

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488312#comment-15488312
 ] 

Cody Koeninger edited comment on SPARK-15406 at 9/13/16 8:26 PM:
-

Unless I'm misunderstanding, you answered regarding a source, not a source 
provider.  How does an end user pass in a parameterized type through this 
interface to indicate that their keys are e.g. longs, or MyType?  If this is 
easy, can you show a code example?

Using reflection off strings for ConsumerStrategy means it must have a zero arg 
constructor, which is doable but again awkward compared to a serialized 
instance.

If you agree json isn't going to cut it, are you actually going to agree to 
make changes, or is this going to be another 'well we're stuck with it because 
binary compatibility'? 


was (Author: c...@koeninger.org):
Unless I'm misunderstanding, you answered regarding a source, not a source 
provider.  How does an end user pass in a parameterized type through this 
interface to indicate that their keys are e.g. longs, or MyType?  If this is 
easy, can you show a code example?

Using reflection off strings for ConsumerStrategy means it must have a zero arg 
constructor, which is doable but again awkward compared to a serialized 
instance.

If you agree neon isn't going to cut it, are you actually going to agree to 
make changes, or is this going to be another 'well we're stuck with it because 
binary compatibility'? 

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488312#comment-15488312
 ] 

Cody Koeninger commented on SPARK-15406:


Unless I'm misunderstanding, you answered regarding a source, not a source 
provider.  How does an end user pass in a parameterized type through this 
interface to indicate that their keys are e.g. longs, or MyType?  If this is 
easy, can you show a code example?

Using reflection off strings for ConsumerStrategy means it must have a zero arg 
constructor, which is doable but again awkward compared to a serialized 
instance.

If you agree neon isn't going to cut it, are you actually going to agree to 
make changes, or is this going to be another 'well we're stuck with it because 
binary compatibility'? 

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17484:


Assignee: Josh Rosen  (was: Apache Spark)

> Race condition when cancelling a job during a cache write can lead to block 
> fetch failures
> --
>
> Key: SPARK-17484
> URL: https://issues.apache.org/jira/browse/SPARK-17484
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> On a production cluster, I observed the following weird behavior where a 
> block manager cached a block, the store failed due to a task being killed / 
> cancelled, and then a subsequent task incorrectly attempted to read the 
> cached block from the machine where the write failed, eventually leading to a 
> complete job failure.
> Here's the executor log snippet from the machine performing the failed cache:
> {code}
> 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory 
> (estimated size 976.8 MB, free 9.8 GB)
> 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed
> 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID 
> 127)
> {code}
> Here's the exception from the reader in the failed job:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 
> (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: 
> Failed to fetch block after 1 fetch failures. Most recent failure cause:
>   at 
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565)
>   at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> {code}
> I believe that there's a race condition in how we handle cleanup after failed 
> cache stores. Here's an excerpt from {{BlockManager.doPut()}}
> {code}
> var blockWasSuccessfullyStored: Boolean = false
> val result: Option[T] = try {
>   val res = putBody(putBlockInfo)
>   blockWasSuccessfullyStored = res.isEmpty
>   res
> } finally {
>   if (blockWasSuccessfullyStored) {
> if (keepReadLock) {
>   blockInfoManager.downgradeLock(blockId)
> } else {
>   blockInfoManager.unlock(blockId)
> }
>   } else {
> blockInfoManager.removeBlock(blockId)
> logWarning(s"Putting block $blockId failed")
>   }
>   }
> {code}
> The only way that I think this "successfully stored followed by immediately 
> failed" case could appear in our logs is if the local memory store write 
> succeeds and then an exception (perhaps InterruptedException) causes us to 
> enter the {{finally}} block's error-cleanup path. The problem is that the 
> {{finally}} block only cleans up the block's metadata rather than performing 
> the full cleanup path which would also notify the master that the block is no 
> longer available at this host.
> The fact that the Spark task was not resilient in the face of remote block 
> fetches is a separate issue which I'll report and fix separately. The scope 
> of this JIRA, however, is the fact that Spark still attempted reads from a 
> machine which was missing the block.
> In order to fix this problem, I think that the {{finally}} block should 
> perform more thorough cleanup and should send a "block removed" status update 
> to the master following any error during the write. This is necessary because 
> the body of {{doPut()}} may have already notified the master of block 
> availability. In addition, we can extend the block serving code path to 
> automatically update the master with "block deleted" statuses whenever the 
> block manager receives invalid requests for blocks that it doesn't have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17484:


Assignee: Apache Spark  (was: Josh Rosen)

> Race condition when cancelling a job during a cache write can lead to block 
> fetch failures
> --
>
> Key: SPARK-17484
> URL: https://issues.apache.org/jira/browse/SPARK-17484
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> On a production cluster, I observed the following weird behavior where a 
> block manager cached a block, the store failed due to a task being killed / 
> cancelled, and then a subsequent task incorrectly attempted to read the 
> cached block from the machine where the write failed, eventually leading to a 
> complete job failure.
> Here's the executor log snippet from the machine performing the failed cache:
> {code}
> 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory 
> (estimated size 976.8 MB, free 9.8 GB)
> 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed
> 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID 
> 127)
> {code}
> Here's the exception from the reader in the failed job:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 
> (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: 
> Failed to fetch block after 1 fetch failures. Most recent failure cause:
>   at 
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565)
>   at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> {code}
> I believe that there's a race condition in how we handle cleanup after failed 
> cache stores. Here's an excerpt from {{BlockManager.doPut()}}
> {code}
> var blockWasSuccessfullyStored: Boolean = false
> val result: Option[T] = try {
>   val res = putBody(putBlockInfo)
>   blockWasSuccessfullyStored = res.isEmpty
>   res
> } finally {
>   if (blockWasSuccessfullyStored) {
> if (keepReadLock) {
>   blockInfoManager.downgradeLock(blockId)
> } else {
>   blockInfoManager.unlock(blockId)
> }
>   } else {
> blockInfoManager.removeBlock(blockId)
> logWarning(s"Putting block $blockId failed")
>   }
>   }
> {code}
> The only way that I think this "successfully stored followed by immediately 
> failed" case could appear in our logs is if the local memory store write 
> succeeds and then an exception (perhaps InterruptedException) causes us to 
> enter the {{finally}} block's error-cleanup path. The problem is that the 
> {{finally}} block only cleans up the block's metadata rather than performing 
> the full cleanup path which would also notify the master that the block is no 
> longer available at this host.
> The fact that the Spark task was not resilient in the face of remote block 
> fetches is a separate issue which I'll report and fix separately. The scope 
> of this JIRA, however, is the fact that Spark still attempted reads from a 
> machine which was missing the block.
> In order to fix this problem, I think that the {{finally}} block should 
> perform more thorough cleanup and should send a "block removed" status update 
> to the master following any error during the write. This is necessary because 
> the body of {{doPut()}} may have already notified the master of block 
> availability. In addition, we can extend the block serving code path to 
> automatically update the master with "block deleted" statuses whenever the 
> block manager receives invalid requests for blocks that it doesn't have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17484) Race condition when cancelling a job during a cache write can lead to block fetch failures

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488273#comment-15488273
 ] 

Apache Spark commented on SPARK-17484:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15085

> Race condition when cancelling a job during a cache write can lead to block 
> fetch failures
> --
>
> Key: SPARK-17484
> URL: https://issues.apache.org/jira/browse/SPARK-17484
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> On a production cluster, I observed the following weird behavior where a 
> block manager cached a block, the store failed due to a task being killed / 
> cancelled, and then a subsequent task incorrectly attempted to read the 
> cached block from the machine where the write failed, eventually leading to a 
> complete job failure.
> Here's the executor log snippet from the machine performing the failed cache:
> {code}
> 16/09/06 16:10:31 INFO MemoryStore: Block rdd_25_1 stored as values in memory 
> (estimated size 976.8 MB, free 9.8 GB)
> 16/09/06 16:10:31 WARN BlockManager: Putting block rdd_25_1 failed
> 16/09/06 16:10:31 INFO Executor: Executor killed task 0.0 in stage 3.0 (TID 
> 127)
> {code}
> Here's the exception from the reader in the failed job:
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in 
> stage 46.0 failed 4 times, most recent failure: Lost task 4.3 in stage 46.0 
> (TID 1484, 10.69.255.197): org.apache.spark.storage.BlockFetchException: 
> Failed to fetch block after 1 fetch failures. Most recent failure cause:
>   at 
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:565)
>   at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:522)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:609)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:661)
> {code}
> I believe that there's a race condition in how we handle cleanup after failed 
> cache stores. Here's an excerpt from {{BlockManager.doPut()}}
> {code}
> var blockWasSuccessfullyStored: Boolean = false
> val result: Option[T] = try {
>   val res = putBody(putBlockInfo)
>   blockWasSuccessfullyStored = res.isEmpty
>   res
> } finally {
>   if (blockWasSuccessfullyStored) {
> if (keepReadLock) {
>   blockInfoManager.downgradeLock(blockId)
> } else {
>   blockInfoManager.unlock(blockId)
> }
>   } else {
> blockInfoManager.removeBlock(blockId)
> logWarning(s"Putting block $blockId failed")
>   }
>   }
> {code}
> The only way that I think this "successfully stored followed by immediately 
> failed" case could appear in our logs is if the local memory store write 
> succeeds and then an exception (perhaps InterruptedException) causes us to 
> enter the {{finally}} block's error-cleanup path. The problem is that the 
> {{finally}} block only cleans up the block's metadata rather than performing 
> the full cleanup path which would also notify the master that the block is no 
> longer available at this host.
> The fact that the Spark task was not resilient in the face of remote block 
> fetches is a separate issue which I'll report and fix separately. The scope 
> of this JIRA, however, is the fact that Spark still attempted reads from a 
> machine which was missing the block.
> In order to fix this problem, I think that the {{finally}} block should 
> perform more thorough cleanup and should send a "block removed" status update 
> to the master following any error during the write. This is necessary because 
> the body of {{doPut()}} may have already notified the master of block 
> availability. In addition, we can extend the block serving code path to 
> automatically update the master with "block deleted" statuses whenever the 
> block manager receives invalid requests for blocks that it doesn't have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-17531:

Description: 
If a user provides listeners inside the Hive Conf, the configuration for these 
listeners are passed to the Hive Execution Client as well. This may cause 
issues for two reasons:
 1. The Execution Client will actually generate garbage
 2. The listener class needs to be both in the Spark Classpath and Hive 
Classpath

The configuration can be overwritten within 
{code}
HiveUtils.newTemporaryConfiguration
{code}

> Don't initialize Hive Listeners for the Execution Client
> 
>
> Key: SPARK-17531
> URL: https://issues.apache.org/jira/browse/SPARK-17531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Burak Yavuz
>
> If a user provides listeners inside the Hive Conf, the configuration for 
> these listeners are passed to the Hive Execution Client as well. This may 
> cause issues for two reasons:
>  1. The Execution Client will actually generate garbage
>  2. The listener class needs to be both in the Spark Classpath and Hive 
> Classpath
> The configuration can be overwritten within 
> {code}
> HiveUtils.newTemporaryConfiguration
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17531) Don't initialize Hive Listeners for the Execution Client

2016-09-13 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-17531:
---

 Summary: Don't initialize Hive Listeners for the Execution Client
 Key: SPARK-17531
 URL: https://issues.apache.org/jira/browse/SPARK-17531
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Burak Yavuz






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17529) On highly skewed data, outer join merges are slow

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488106#comment-15488106
 ] 

Apache Spark commented on SPARK-17529:
--

User 'davidnavas' has created a pull request for this issue:
https://github.com/apache/spark/pull/15084

> On highly skewed data, outer join merges are slow
> -
>
> Key: SPARK-17529
> URL: https://issues.apache.org/jira/browse/SPARK-17529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: David C Navas
>
> All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
> same performance problem.
> My co-worker Yewei Zhang was investigating a performance problem with a 
> highly skewed dataset.
> "Part of this query performs a full outer join over [an ID] on highly skewed 
> data. On the left view, there is one record for id = 0 out of 2,272,486 
> records; On the right view there are 8,353,097 records for id = 0 out of 
> 12,685,073 records"
> The sub-query was taking 5.2 minutes.  We discovered that snappy was 
> responsible for some measure of this problem and incorporated the new snappy 
> release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
> of the remaining time was spent in the merge code which I tracked down to a 
> BitSet clearing issue.  We have noted that you already have the snappy fix, 
> this issue describes the problem with the BitSet.
> The BitSet grows to handle the largest matching set of keys and is used to 
> track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
> small)
> The skewing of our data caused a very large BitSet to be allocated on the 
> very first row joined.  Unfortunately, the entire BitSet is cleared on each 
> re-use.  For each of the remaining rows which likely match only a few rows on 
> the other side, the entire 1MB of the BitSet is cleared.  If unpartitioned, 
> this would happen roughly 6 million times.  The fix I developed for this is 
> to clear only the portion of the BitSet that is needed.  After applying it, 
> the sub-query dropped from 2.4 minutes to 29 seconds.
> Small (0 or negative) IDs are often used as place-holders for null, empty, or 
> unknown data, so I expect this fix to be generally useful, if rarely 
> encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17529) On highly skewed data, outer join merges are slow

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17529:


Assignee: Apache Spark

> On highly skewed data, outer join merges are slow
> -
>
> Key: SPARK-17529
> URL: https://issues.apache.org/jira/browse/SPARK-17529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: David C Navas
>Assignee: Apache Spark
>
> All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
> same performance problem.
> My co-worker Yewei Zhang was investigating a performance problem with a 
> highly skewed dataset.
> "Part of this query performs a full outer join over [an ID] on highly skewed 
> data. On the left view, there is one record for id = 0 out of 2,272,486 
> records; On the right view there are 8,353,097 records for id = 0 out of 
> 12,685,073 records"
> The sub-query was taking 5.2 minutes.  We discovered that snappy was 
> responsible for some measure of this problem and incorporated the new snappy 
> release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
> of the remaining time was spent in the merge code which I tracked down to a 
> BitSet clearing issue.  We have noted that you already have the snappy fix, 
> this issue describes the problem with the BitSet.
> The BitSet grows to handle the largest matching set of keys and is used to 
> track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
> small)
> The skewing of our data caused a very large BitSet to be allocated on the 
> very first row joined.  Unfortunately, the entire BitSet is cleared on each 
> re-use.  For each of the remaining rows which likely match only a few rows on 
> the other side, the entire 1MB of the BitSet is cleared.  If unpartitioned, 
> this would happen roughly 6 million times.  The fix I developed for this is 
> to clear only the portion of the BitSet that is needed.  After applying it, 
> the sub-query dropped from 2.4 minutes to 29 seconds.
> Small (0 or negative) IDs are often used as place-holders for null, empty, or 
> unknown data, so I expect this fix to be generally useful, if rarely 
> encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17529) On highly skewed data, outer join merges are slow

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17529:


Assignee: (was: Apache Spark)

> On highly skewed data, outer join merges are slow
> -
>
> Key: SPARK-17529
> URL: https://issues.apache.org/jira/browse/SPARK-17529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: David C Navas
>
> All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
> same performance problem.
> My co-worker Yewei Zhang was investigating a performance problem with a 
> highly skewed dataset.
> "Part of this query performs a full outer join over [an ID] on highly skewed 
> data. On the left view, there is one record for id = 0 out of 2,272,486 
> records; On the right view there are 8,353,097 records for id = 0 out of 
> 12,685,073 records"
> The sub-query was taking 5.2 minutes.  We discovered that snappy was 
> responsible for some measure of this problem and incorporated the new snappy 
> release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
> of the remaining time was spent in the merge code which I tracked down to a 
> BitSet clearing issue.  We have noted that you already have the snappy fix, 
> this issue describes the problem with the BitSet.
> The BitSet grows to handle the largest matching set of keys and is used to 
> track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
> small)
> The skewing of our data caused a very large BitSet to be allocated on the 
> very first row joined.  Unfortunately, the entire BitSet is cleared on each 
> re-use.  For each of the remaining rows which likely match only a few rows on 
> the other side, the entire 1MB of the BitSet is cleared.  If unpartitioned, 
> this would happen roughly 6 million times.  The fix I developed for this is 
> to clear only the portion of the BitSet that is needed.  After applying it, 
> the sub-query dropped from 2.4 minutes to 29 seconds.
> Small (0 or negative) IDs are often used as place-holders for null, empty, or 
> unknown data, so I expect this fix to be generally useful, if rarely 
> encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17529) On highly skewed data, outer join merges are slow

2016-09-13 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-17529:
-
Priority: Major  (was: Trivial)

> On highly skewed data, outer join merges are slow
> -
>
> Key: SPARK-17529
> URL: https://issues.apache.org/jira/browse/SPARK-17529
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: David C Navas
>
> All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
> same performance problem.
> My co-worker Yewei Zhang was investigating a performance problem with a 
> highly skewed dataset.
> "Part of this query performs a full outer join over [an ID] on highly skewed 
> data. On the left view, there is one record for id = 0 out of 2,272,486 
> records; On the right view there are 8,353,097 records for id = 0 out of 
> 12,685,073 records"
> The sub-query was taking 5.2 minutes.  We discovered that snappy was 
> responsible for some measure of this problem and incorporated the new snappy 
> release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
> of the remaining time was spent in the merge code which I tracked down to a 
> BitSet clearing issue.  We have noted that you already have the snappy fix, 
> this issue describes the problem with the BitSet.
> The BitSet grows to handle the largest matching set of keys and is used to 
> track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
> small)
> The skewing of our data caused a very large BitSet to be allocated on the 
> very first row joined.  Unfortunately, the entire BitSet is cleared on each 
> re-use.  For each of the remaining rows which likely match only a few rows on 
> the other side, the entire 1MB of the BitSet is cleared.  If unpartitioned, 
> this would happen roughly 6 million times.  The fix I developed for this is 
> to clear only the portion of the BitSet that is needed.  After applying it, 
> the sub-query dropped from 2.4 minutes to 29 seconds.
> Small (0 or negative) IDs are often used as place-holders for null, empty, or 
> unknown data, so I expect this fix to be generally useful, if rarely 
> encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17530:


Assignee: (was: Apache Spark)

> Add Statistics into DESCRIBE FORMATTED
> --
>
> Key: SPARK-17530
> URL: https://issues.apache.org/jira/browse/SPARK-17530
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it 
> and also improve the output of statistics in `DESCRIBE EXTENDED`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17530:


Assignee: Apache Spark

> Add Statistics into DESCRIBE FORMATTED
> --
>
> Key: SPARK-17530
> URL: https://issues.apache.org/jira/browse/SPARK-17530
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it 
> and also improve the output of statistics in `DESCRIBE EXTENDED`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488090#comment-15488090
 ] 

Apache Spark commented on SPARK-17530:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/15083

> Add Statistics into DESCRIBE FORMATTED
> --
>
> Key: SPARK-17530
> URL: https://issues.apache.org/jira/browse/SPARK-17530
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>
> Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it 
> and also improve the output of statistics in `DESCRIBE EXTENDED`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17530) Add Statistics into DESCRIBE FORMATTED

2016-09-13 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17530:
---

 Summary: Add Statistics into DESCRIBE FORMATTED
 Key: SPARK-17530
 URL: https://issues.apache.org/jira/browse/SPARK-17530
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xiao Li


Statistics is missing in the output of `DESCRIBE FORMATTED`. We should add it 
and also improve the output of statistics in `DESCRIBE EXTENDED`.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488079#comment-15488079
 ] 

Tathagata Das commented on SPARK-15406:
---

Here are some thoughts. 

- The key-value types should be communicated as constructor parameters of the 
Kafka Source object. This is how we do it in the FileStreamSource, we pass on 
the schema through the constructor which is used to create the DF in getBatch - 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L39

- custom consumer strategy implementation can be loaded by specifying the class 
name as string and loading by reflection. Same way as Kafka loads 
serializers/deserializers, parquet loads compressing libraries, etc. This is 
may not be the only solution, and we should definitely discuss this in the 
future. 

- I completely agree that asking the users to generate json strings themselves 
is not gonna a cut. that's this needs more focus discussion on this specific 
topic so that we bounce around ideas, ranging from providing simple map-to-json 
helper, to adding extra methods to the datastreamreader. 

These are very good questions which does not have clear answers, and we should 
make specific follow up JIRAs to discuss on them with the community and 
experience Kafka users like you. I am hoping this initial design doc puts in 
the basic framework in without pushing ourselves in a corner. And I think with 
the current design, we dont block any improvements in the future. Please let us 
know if you think we are doing so in any specific use case.


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17529) On highly skewed data, outer join merges are slow

2016-09-13 Thread David C Navas (JIRA)
David C Navas created SPARK-17529:
-

 Summary: On highly skewed data, outer join merges are slow
 Key: SPARK-17529
 URL: https://issues.apache.org/jira/browse/SPARK-17529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0, 1.6.2
Reporter: David C Navas
Priority: Trivial


All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
same performance problem.

My co-worker Yewei Zhang was investigating a performance problem with a highly 
skewed dataset.
"Part of this query performs a full outer join over [an ID] on highly skewed 
data. On the left view, there is one record for id = 0 out of 2,272,486 
records; On the right view there are 8,353,097 records for id = 0 out of 
12,685,073 records"

The sub-query was taking 5.2 minutes.  We discovered that snappy was 
responsible for some measure of this problem and incorporated the new snappy 
release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
of the remaining time was spent in the merge code which I tracked down to a 
BitSet clearing issue.  We have noted that you already have the snappy fix, 
this issue describes the problem with the BitSet.

The BitSet grows to handle the largest matching set of keys and is used to 
track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
small)

The skewing of our data caused a very large BitSet to be allocated on the very 
first row joined.  Unfortunately, the entire BitSet is cleared on each re-use.  
For each of the remaining rows which likely match only a few rows on the other 
side, the entire 1MB of the BitSet is cleared.  If unpartitioned, this would 
happen roughly 6 million times.  The fix I developed for this is to clear only 
the portion of the BitSet that is needed.  After applying it, the sub-query 
dropped from 2.4 minutes to 29 seconds.

Small (0 or negative) IDs are often used as place-holders for null, empty, or 
unknown data, so I expect this fix to be generally useful, if rarely 
encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10816) API design: window and session specification

2016-09-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488035#comment-15488035
 ] 

Maciej Bryński commented on SPARK-10816:


Hi,
Any updates on Session Window ?

> API design: window and session specification
> 
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17450) spark sql rownumber OOM

2016-09-13 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15488005#comment-15488005
 ] 

Herman van Hovell commented on SPARK-17450:
---

https://github.com/apache/spark/pull/10605

> spark sql rownumber OOM
> ---
>
> Key: SPARK-17450
> URL: https://issues.apache.org/jira/browse/SPARK-17450
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: cen yuhai
>
> spark sql will be OOM when using row_number() over too much sorted records... 
> There will be only 1 task to handle all records
> This sql group by passenger_id,  we have 100 million passengers.
> {code} 
>  SELECT
> passenger_id,
> total_order,
> (CASE WHEN row_number() over (ORDER BY total_order DESC) BETWEEN 0 AND 
> 670800 THEN 'V3' END) AS order_rank
>   FROM
>   (
> SELECT
>   passenger_id,
>   1 as total_order
> FROM table
> GROUP BY passenger_id
>   ) dd1
> {code}
> {code}
> java.lang.OutOfMemoryError: Java heap space
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:536)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:93)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.fetchNextPartition(Window.scala:278)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:304)
> at 
> org.apache.spark.sql.execution.Window$$anonfun$8$$anon$1.next(Window.scala:246)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:512)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:74)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:104)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> physical plan:
> {code}
> == Physical Plan ==
> Project [passenger_id#7L,total_order#0,CASE WHEN ((_we0#20 >= 0) && (_we1#21 
> <= 670800)) THEN V3 AS order_rank#1]
> +- Window [passenger_id#7L,total_order#0], 
> [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS 
> _we0#20,HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFRowNumber()
>  windowspecdefinition(total_order#0 DESC,ROWS BETWEEN UNBOUNDED PRECEDING AND 
> UNBOUNDED FOLLOWING) AS _we1#21], [total_order#0 DESC]
>+- Sort [total_order#0 DESC], false, 0
>   +- TungstenExchange SinglePartition, None
>  +- Project [passenger_id#7L,total_order#0]
> +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L,total_order#0])
>+- TungstenExchange hashpartitioning(passenger_id#7L,1000), 
> None
>   +- TungstenAggregate(key=[passenger_id#7L], functions=[], 
> output=[passenger_id#7L])
>  +- Project [passenger_id#7L]
> +- Filter product#9 IN (kuai,gulf)
>+- HiveTableScan [passenger_id#7L,product#9], 
> MetastoreRelation pbs_dw, dwv_order_whole_day, None,
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To 

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487930#comment-15487930
 ] 

Cody Koeninger commented on SPARK-15406:


Specific examples:

Kafka has a type for a key, and a type for a value, with deserializers 
corresponding to those types.  In other words, I need to construct a 
KafkaRDD[K, V] in the getBatch method.  How do I communicate a parameterized 
type for K and V through this interface?

ConsumerStrategy allows for user-defined implementations.  This is necessary 
because getting kafka consumers set up correctly is stateful, and one size 
doesn't fit all.  How do I communicate a user-defined ConsumerStrategy through 
this interface?

More prosaically, telling Scala end users that the way they need to communicate 
a mapping from topicpartition objects to starting offsets is to pass in a json 
string... if I was evaluating a library new from an outside perspective and saw 
that, I'd say nope and walk away.  Any language that can handle json can handle 
nested maps, so at the very least that seems like a better lowest common 
denominator than string.

Again, I'm not trying to be obstructionist here, I am generally in favor of 
doing the simplest thing that works.  But I really have very little confidence 
that doing the expedient thing now isn't going to prevent us from doing the 
right thing later.

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-15406:
--
Description: 
This is the parent JIRA to track all the work for the building a Kafka source 
for Structured Streaming. Here is the design doc for an initial version of the 
Kafka Source.
https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing


== Old description =
Structured streaming doesn't have support for kafka yet.  I personally feel 
like time based indexing would make for a much better interface, but it's been 
pushed back to kafka 0.10.1

https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index

  was:
Structured streaming doesn't have support for kafka yet.  I personally feel 
like time based indexing would make for a much better interface, but it's been 
pushed back to kafka 0.10.1

https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> This is the parent JIRA to track all the work for the building a Kafka source 
> for Structured Streaming. Here is the design doc for an initial version of 
> the Kafka Source.
> https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
> == Old description =
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17528) MutableProjection should not cache content from the input row

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487907#comment-15487907
 ] 

Apache Spark commented on SPARK-17528:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15082

> MutableProjection should not cache content from the input row
> -
>
> Key: SPARK-17528
> URL: https://issues.apache.org/jira/browse/SPARK-17528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17528) MutableProjection should not cache content from the input row

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17528:


Assignee: Apache Spark  (was: Wenchen Fan)

> MutableProjection should not cache content from the input row
> -
>
> Key: SPARK-17528
> URL: https://issues.apache.org/jira/browse/SPARK-17528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17528) MutableProjection should not cache content from the input row

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17528:


Assignee: Wenchen Fan  (was: Apache Spark)

> MutableProjection should not cache content from the input row
> -
>
> Key: SPARK-17528
> URL: https://issues.apache.org/jira/browse/SPARK-17528
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487875#comment-15487875
 ] 

Michael Armbrust commented on SPARK-15406:
--

Hey Cody, thanks for the input and for sharing your prototype!  While I think 
its fair to say that we don't have the resources to bring python DStreams up to 
par with Scala at this time, that is not the case for the Spark SQL APIs.  We 
already have pretty good parity between Scala and Python here, and I'd really 
like to maintain/improve that.

Can you help me understand better what specifically is broken with the stringly 
typed interface?  Ideally, we'd find ways to work around this, but that said, 
I'm not opposed to also having a typed interface in addition to the 
DataStreamReader/Writer.  We do this in a few exceptional cases already (i.e. 
we take an RDD[String] into the JSON data source).

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17528) MutableProjection should not cache content from the input row

2016-09-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-17528:
---

 Summary: MutableProjection should not cache content from the input 
row
 Key: SPARK-17528
 URL: https://issues.apache.org/jira/browse/SPARK-17528
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17527) mergeSchema with `_OPTIONAL_` metadata fails

2016-09-13 Thread Gaurav Shah (JIRA)
Gaurav Shah created SPARK-17527:
---

 Summary: mergeSchema with `_OPTIONAL_` metadata fails
 Key: SPARK-17527
 URL: https://issues.apache.org/jira/browse/SPARK-17527
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
 Environment: mac osx 10.11.6, ubuntu 14, ubuntu 16.
spark 2.0.0, spark-catalyst 2.0.0
Reporter: Gaurav Shah


Spark added '_OPTIONAL' metadata in 2.0.0 in following commit: 
https://github.com/apache/spark/commit/4637fc08a3733ec313218fb7e4d05064d9a6262d

but merging metadata for data created from spark 1.6.x and 2.0 fails with 
following:

{code}
Exception in thread "main" java.lang.RuntimeException: could not merge 
metadata: key org.apache.spark.sql.parquet.row.metadata has conflicting values:
{code}
and the only difference in those values is metadata now having "_OPTIONAL_" 
field extra.

{code:javascript}
{   {
  "name": "catalog",  "name": "catalog",
  "type": {   "type": {
"type": "struct",   "type": 
"struct",
"fields": [ "fields": [
  {   {
"name": "category", "name": 
"category",
"type": "string",   "type": 
"string",
"nullable": true,   "nullable": 
true,
"metadata": {}  "metadata": 
{}
  },  },
  {   {
"name": "department",   
"name": "department",
"type": "string",   "type": 
"string",
"nullable": true,   "nullable": 
true,
"metadata": {}  "metadata": 
{}
  }   }
]   ]
  },  },
  "nullable": true,   "nullable": true,
  "metadata": {   "metadata": {}
"_OPTIONAL_": true  
  } 

{code}

vs
{code:javascript}
{
  "name": "catalog",  "name": "catalog",
  "type": {   "type": {
"type": "struct",   "type": 
"struct",
"fields": [ "fields": [
  {   {
"name": "category", "name": 
"category",
"type": "string",   "type": 
"string",
"nullable": true,   "nullable": 
true,
"metadata": {}  "metadata": 
{}
  },  },
  {   {
"name": "department",   
"name": "department",
"type": "string",   "type": 
"string",
"nullable": true,   "nullable": 
true,
"metadata": {}  "metadata": 
{}
  }   }
]   ]
  },  },
  "nullable": true,   "nullable": true,
  "metadata": {   "metadata": {}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487789#comment-15487789
 ] 

Apache Spark commented on SPARK-17525:
--

User 'sjakthol' has created a pull request for this issue:
https://github.com/apache/spark/pull/15081

> SparkContext.clearFiles() still present in the PySpark bindings though the 
> underlying Scala method was removed in Spark 2.0
> ---
>
> Key: SPARK-17525
> URL: https://issues.apache.org/jira/browse/SPARK-17525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sami Jaktholm
>Priority: Trivial
>
> STR: In PySpark shell, run sc.clearFiles()
> What happens:
> {noformat}
> py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. 
> Trace:
> py4j.Py4JException: Method clearFiles([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:272)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Apparently the old and deprecated SparkContext.clearFiles() was removed from 
> Spark 2.0 but it's still present in the PySpark API. It should be removed 
> from there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17525:


Assignee: (was: Apache Spark)

> SparkContext.clearFiles() still present in the PySpark bindings though the 
> underlying Scala method was removed in Spark 2.0
> ---
>
> Key: SPARK-17525
> URL: https://issues.apache.org/jira/browse/SPARK-17525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sami Jaktholm
>Priority: Trivial
>
> STR: In PySpark shell, run sc.clearFiles()
> What happens:
> {noformat}
> py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. 
> Trace:
> py4j.Py4JException: Method clearFiles([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:272)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Apparently the old and deprecated SparkContext.clearFiles() was removed from 
> Spark 2.0 but it's still present in the PySpark API. It should be removed 
> from there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17525:


Assignee: Apache Spark

> SparkContext.clearFiles() still present in the PySpark bindings though the 
> underlying Scala method was removed in Spark 2.0
> ---
>
> Key: SPARK-17525
> URL: https://issues.apache.org/jira/browse/SPARK-17525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sami Jaktholm
>Assignee: Apache Spark
>Priority: Trivial
>
> STR: In PySpark shell, run sc.clearFiles()
> What happens:
> {noformat}
> py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. 
> Trace:
> py4j.Py4JException: Method clearFiles([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:272)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Apparently the old and deprecated SparkContext.clearFiles() was removed from 
> Spark 2.0 but it's still present in the PySpark API. It should be removed 
> from there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Frederick Reiss (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487780#comment-15487780
 ] 

Frederick Reiss commented on SPARK-15406:
-

+1 for taking the simple route in the short term. I'm seeing a lot of people at 
my work who are just kicking the tires on Structured Streaming at this point. 
Right now, we just need a Kafka connector that is correct and easy to use; 
performance and advanced functionality are of secondary importance. 

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17252) Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors during query parsing

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487749#comment-15487749
 ] 

Apache Spark commented on SPARK-17252:
--

User 'sjakthol' has created a pull request for this issue:
https://github.com/apache/spark/pull/15081

> Performing arithmetic in VALUES can lead to ClassCastException / MatchErrors 
> during query parsing
> -
>
> Key: SPARK-17252
> URL: https://issues.apache.org/jira/browse/SPARK-17252
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
> Fix For: 2.0.1
>
>
> The following example fails with a ClassCastException:
> {code}
> create table t(d double);
> insert into t VALUES (1 * 1.0);
> {code}
>  Here's the error:
> {code}
> java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be 
> cast to java.lang.Integer
>   at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
>   at scala.math.Numeric$IntIsIntegral$.times(Numeric.scala:57)
>   at 
> org.apache.spark.sql.catalyst.expressions.Multiply.nullSafeEval(arithmetic.scala:207)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:416)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypeCreator.scala:198)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypeCreator.scala:198)
>   at 
> org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:320)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:677)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1$$anonfun$39.apply(AstBuilder.scala:674)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:674)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitInlineTable$1.apply(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:658)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInlineTable(AstBuilder.scala:43)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableContext.accept(SqlBaseParser.java:9358)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitInlineTableDefault1(SqlBaseBaseVisitor.java:608)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$InlineTableDefault1Context.accept(SqlBaseParser.java:7073)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitQueryTermDefault(SqlBaseBaseVisitor.java:580)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$QueryTermDefaultContext.accept(SqlBaseParser.java:6895)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:47)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.plan(AstBuilder.scala:83)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:158)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder$$anonfun$visitSingleInsertQuery$1.apply(AstBuilder.scala:162)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:96)
>   at 
> 

[jira] [Assigned] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17526:


Assignee: (was: Apache Spark)

> Display the executor log links with the job failure message on Spark UI and 
> Console
> ---
>
> Key: SPARK-17526
> URL: https://issues.apache.org/jira/browse/SPARK-17526
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Zhan Zhang
>Priority: Minor
>
>  Display the executor log links with the job failure message on Spark UI and 
> Console
> "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): 
> java.lang.Exception: foo"
> To make this failure message more helpful, we should have the executor log 
> link in the driver log and web ui as well on which the task failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17526:


Assignee: Apache Spark

> Display the executor log links with the job failure message on Spark UI and 
> Console
> ---
>
> Key: SPARK-17526
> URL: https://issues.apache.org/jira/browse/SPARK-17526
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Zhan Zhang
>Assignee: Apache Spark
>Priority: Minor
>
>  Display the executor log links with the job failure message on Spark UI and 
> Console
> "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): 
> java.lang.Exception: foo"
> To make this failure message more helpful, we should have the executor log 
> link in the driver log and web ui as well on which the task failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487705#comment-15487705
 ] 

Apache Spark commented on SPARK-17526:
--

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15080

> Display the executor log links with the job failure message on Spark UI and 
> Console
> ---
>
> Key: SPARK-17526
> URL: https://issues.apache.org/jira/browse/SPARK-17526
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Zhan Zhang
>Priority: Minor
>
>  Display the executor log links with the job failure message on Spark UI and 
> Console
> "Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
> recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): 
> java.lang.Exception: foo"
> To make this failure message more helpful, we should have the executor log 
> link in the driver log and web ui as well on which the task failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17397) Show example of what to do when awaitTermination() throws an Exception

2016-09-13 Thread Spiro Michaylov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487691#comment-15487691
 ] 

Spiro Michaylov edited comment on SPARK-17397 at 9/13/16 4:37 PM:
--

In the form of a PR? Sure, I can do that. If I don't get to it in the next 
couple of days it'll take a few weeks, but I'll eventually get to it. Feel free 
to assign it to me if that's the process. (Looks like I can't do it myself.)


was (Author: spirom):
In the form of a PR? Sure, I can do that. If I don't get to it in the next 
couple of days it'll take a few weeks, but I'll eventually get to it. I'll 
assign it to myself but if someone wants to yank it back that's fine. 

> Show example of what to do when awaitTermination() throws an Exception
> --
>
> Key: SPARK-17397
> URL: https://issues.apache.org/jira/browse/SPARK-17397
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 2.0.0
> Environment: Linux, Scala but probably general
>Reporter: Spiro Michaylov
>Priority: Minor
>
> When awaitTermination propagates an exception that was thrown in processing a 
> batch, the StreamingContext keeps running. Perhaps this is by design, but I 
> don't see any mention of it in the API docs or the streaming programming 
> guide. It's not clear what idiom should be used to block the thread until the 
> context HAS been stopped in a situation where stream processing is throwing 
> lots of exceptions. 
> For example, in the following, streaming takes the full 30 seconds to 
> terminate. My hope in asking this is to improve my own understanding and 
> perhaps inspire documentation improvements. I'm not filing a bug because it's 
> not clear to me whether this is working as intended. 
> {code}
> val conf = new 
> SparkConf().setAppName("ExceptionPropagation").setMaster("local[4]")
> val sc = new SparkContext(conf)
> // streams will produce data every second
> val ssc = new StreamingContext(sc, Seconds(1))
> val qm = new QueueMaker(sc, ssc)
> // create the stream
> val stream = // create some stream
> // register for data
> stream
>   .map(x => { throw new SomeException("something"); x} )
>   .foreachRDD(r => println("*** count = " + r.count()))
> // start streaming
> ssc.start()
> new Thread("Delayed Termination") {
>   override def run() {
> Thread.sleep(3)
> ssc.stop()
>   }
> }.start()
> println("*** producing data")
> // start producing data
> qm.populateQueue()
> try {
>   ssc.awaitTermination()
>   println("*** streaming terminated")
> } catch {
>   case e: Exception => {
> println("*** streaming exception caught in monitor thread")
>   }
> }
> // if the above goes down the exception path, there seems no 
> // good way to block here until the streaming context is stopped 
> println("*** done")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10408) Autoencoder

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10408:


Assignee: Apache Spark  (was: Alexander Ulanov)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Apache Spark
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10408) Autoencoder

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487696#comment-15487696
 ] 

Apache Spark commented on SPARK-10408:
--

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/13621

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10408) Autoencoder

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10408:


Assignee: Alexander Ulanov  (was: Apache Spark)

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484=rep1=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17397) Show example of what to do when awaitTermination() throws an Exception

2016-09-13 Thread Spiro Michaylov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487691#comment-15487691
 ] 

Spiro Michaylov commented on SPARK-17397:
-

In the form of a PR? Sure, I can do that. If I don't get to it in the next 
couple of days it'll take a few weeks, but I'll eventually get to it. I'll 
assign it to myself but if someone wants to yank it back that's fine. 

> Show example of what to do when awaitTermination() throws an Exception
> --
>
> Key: SPARK-17397
> URL: https://issues.apache.org/jira/browse/SPARK-17397
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Streaming
>Affects Versions: 2.0.0
> Environment: Linux, Scala but probably general
>Reporter: Spiro Michaylov
>Priority: Minor
>
> When awaitTermination propagates an exception that was thrown in processing a 
> batch, the StreamingContext keeps running. Perhaps this is by design, but I 
> don't see any mention of it in the API docs or the streaming programming 
> guide. It's not clear what idiom should be used to block the thread until the 
> context HAS been stopped in a situation where stream processing is throwing 
> lots of exceptions. 
> For example, in the following, streaming takes the full 30 seconds to 
> terminate. My hope in asking this is to improve my own understanding and 
> perhaps inspire documentation improvements. I'm not filing a bug because it's 
> not clear to me whether this is working as intended. 
> {code}
> val conf = new 
> SparkConf().setAppName("ExceptionPropagation").setMaster("local[4]")
> val sc = new SparkContext(conf)
> // streams will produce data every second
> val ssc = new StreamingContext(sc, Seconds(1))
> val qm = new QueueMaker(sc, ssc)
> // create the stream
> val stream = // create some stream
> // register for data
> stream
>   .map(x => { throw new SomeException("something"); x} )
>   .foreachRDD(r => println("*** count = " + r.count()))
> // start streaming
> ssc.start()
> new Thread("Delayed Termination") {
>   override def run() {
> Thread.sleep(3)
> ssc.stop()
>   }
> }.start()
> println("*** producing data")
> // start producing data
> qm.populateQueue()
> try {
>   ssc.awaitTermination()
>   println("*** streaming terminated")
> } catch {
>   case e: Exception => {
> println("*** streaming exception caught in monitor thread")
>   }
> }
> // if the above goes down the exception path, there seems no 
> // good way to block here until the streaming context is stopped 
> println("*** done")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17526) Display the executor log links with the job failure message on Spark UI and Console

2016-09-13 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-17526:
--

 Summary: Display the executor log links with the job failure 
message on Spark UI and Console
 Key: SPARK-17526
 URL: https://issues.apache.org/jira/browse/SPARK-17526
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Zhan Zhang
Priority: Minor


 Display the executor log links with the job failure message on Spark UI and 
Console

"Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most 
recent failure: Lost task 0.3 in stage 0.0 (TID 3, HostName): 
java.lang.Exception: foo"

To make this failure message more helpful, we should have the executor log link 
in the driver log and web ui as well on which the task failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17142) Complex query triggers binding error in HashAggregateExec

2016-09-13 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-17142.
---
   Resolution: Fixed
 Assignee: Jiang Xingbo
Fix Version/s: 2.1.0

> Complex query triggers binding error in HashAggregateExec
> -
>
> Key: SPARK-17142
> URL: https://issues.apache.org/jira/browse/SPARK-17142
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Jiang Xingbo
>Priority: Blocker
> Fix For: 2.1.0
>
>
> The following example runs successfully on Spark 2.0.0 but fails in the 
> current master (as of b72bb62d421840f82d663c6b8e3922bd14383fbb, if not 
> earlier):
> {code}
> spark.sql("set spark.sql.crossJoin.enabled=true")
> sc.parallelize(Seq(0)).toDF("int_col_1").createOrReplaceTempView("table_4")
> sc.parallelize(Seq(0)).toDF("bigint_col_2").createOrReplaceTempView("table_2")
> val query = """
> SELECT
> ((t2.int_col) + (t1.bigint_col_2)) + ((t2.int_col) + (t1.bigint_col_2)) AS 
> int_col_1
> FROM table_2 t1
> INNER JOIN (
> SELECT
> LEAST(IF(False, LAG(0) OVER (ORDER BY t2.int_col_1 DESC), -230), 
> -991) AS int_col,
> (t2.int_col_1) + (t1.int_col_1) AS int_col_2,
> (t1.int_col_1) + (t2.int_col_1) AS int_col_3,
> t2.int_col_1
> FROM
> table_4 t1,
> table_4 t2
> GROUP BY
> (t1.int_col_1) + (t2.int_col_1),
> t2.int_col_1
> ) t2
> WHERE (t2.int_col_3) NOT IN (t2.int_col, t2.int_col_1)
> GROUP BY (t2.int_col) + (t1.bigint_col_2)
> """
> spark.sql(query).collect()
> {code}
> This fails with the following exception:
> {code}
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: bigint_col_2#65
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:279)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:278)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:320)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:284)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:268)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:455)
>   at 
> org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$32.apply(HashAggregateExec.scala:454)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> 

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-13 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487310#comment-15487310
 ] 

Cody Koeninger commented on SPARK-15406:


So we can do the easiest thing possible for 2.0.1, which in this case is 
probably taking my patch and replacing the hardcoded string, string for message 
key/value with a byte array.

But if we do that, are you actually willing to allow the changes necessary for 
people to get functionality comparable to the existing DStream / RDD interface 
for 2.x.x ?  Or is this going to be another situation where you say "well, now 
people are using this interface, so we're stuck with it"?

I'm worried about shoehorning part of the existing Kafka functionality into the 
SourceProvider interface without considering changes to that interface, 
especially if Kafka or Kafka-like sources are the only thing intended to work 
with it.  For instance, why we are sticking ourselves with a stringly-typed 
interface and all the problems that causes, when Reynold already admitted that 
Python is always going to be a second-class citizen with regard to streaming 
(e.g. SPARK-16534)

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-09-13 Thread Dhruve Ashar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487224#comment-15487224
 ] 

Dhruve Ashar commented on SPARK-16441:
--

There was a patch which was recently contributed which reduces the deadlocking. 
You can check the details here => 
https://github.com/apache/spark/commit/a0aac4b775bc8c275f96ad0fbf85c9d8a3690588

Also I am working on a patch which improves this further and should be able to 
complete it soon. Its a WIP and I have a PR up for it 
[https://github.com/apache/spark/pull/14926]. 

Let us know if your issue is resolved/minimized with the patch.  



> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> 

[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2016-09-13 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487203#comment-15487203
 ] 

Thomas Graves commented on SPARK-17321:
---

yes that makes sense and as I stated I think the fix for this should be that 
the Spark shuffle services doesn't use the backup database at all if NM 
recovery (and spark config) aren't enabled.  Thus you wouldn't have any disk 
errors.  If NM recovery isn't enabled the spark DB isn't going to do you any 
good because NM is going to shoot any running containers on restart.

If you are up for making those changes please go ahead and put up patch.

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16938) Cannot resolve column name after a join

2016-09-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487185#comment-15487185
 ] 

Dongjoon Hyun edited comment on SPARK-16938 at 9/13/16 1:23 PM:


Hi, [~cloud_fan] .
Could you review this issue and PR?


was (Author: dongjoon):
Hi, @cloud-fan .
Could you review this issue and PR?

> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16938) Cannot resolve column name after a join

2016-09-13 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487185#comment-15487185
 ] 

Dongjoon Hyun commented on SPARK-16938:
---

Hi, @cloud-fan .
Could you review this issue and PR?

> Cannot resolve column name after a join
> ---
>
> Key: SPARK-16938
> URL: https://issues.apache.org/jira/browse/SPARK-16938
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Mathieu D
>Priority: Minor
>
> Found a change of behavior on spark-2.0.0, which breaks a query in our code 
> base.
> The following works on previous spark versions, 1.6.1 up to 2.0.0-preview :
> {code}
> val dfa = Seq((1, 2), (2, 3)).toDF("id", "a").alias("dfa")
> val dfb = Seq((1, 0), (1, 1)).toDF("id", "b").alias("dfb")
> dfa.join(dfb, dfa("id") === dfb("id")).dropDuplicates(Array("dfa.id", 
> "dfb.id"))
> {code}
> but fails with spark-2.0.0 with the exception : 
> {code}
> Cannot resolve column name "dfa.id" among (id, a, id, b); 
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "dfa.id" 
> among (id, a, id, b);
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36$$anonfun$apply$12.apply(Dataset.scala:1819)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1818)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1$$anonfun$36.apply(Dataset.scala:1817)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1817)
>   at 
> org.apache.spark.sql.Dataset$$anonfun$dropDuplicates$1.apply(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.withTypedPlan(Dataset.scala:2594)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1814)
>   at org.apache.spark.sql.Dataset.dropDuplicates(Dataset.scala:1840)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2016-09-13 Thread Alexander Kasper (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487166#comment-15487166
 ] 

Alexander Kasper commented on SPARK-17321:
--

No, we're not using NM recovery. What we observed is the following:
- NM runs with a list of local dirs on various disks
- Shuffle service puts its data into one of these local dirs
- One disk fails making that local dir unuseable, which is incidentally the one 
where shuffle service put its data
- NM recognizes the disk failure but keeps running happily (did not have any 
critical data there? no idea...)
- Shuffle service can't access its data and is not working anymore on this node

So in essence, the NM has some way of recognizing that a local dir is not 
useable anymore and can keep operating. The shuffle service lacks this 
functionality and becomes unuseable. Does that make sense?

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0

2016-09-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487117#comment-15487117
 ] 

Sean Owen commented on SPARK-17525:
---

Oops, yeah this was missed in 
https://github.com/apache/spark/commit/8ce645d4eeda203cf5e100c4bdba2d71edd44e6a#diff-364713d7776956cb8b0a771e9b62f82d
  I think you can open a PR for this for 2.0.1+

> SparkContext.clearFiles() still present in the PySpark bindings though the 
> underlying Scala method was removed in Spark 2.0
> ---
>
> Key: SPARK-17525
> URL: https://issues.apache.org/jira/browse/SPARK-17525
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sami Jaktholm
>Priority: Trivial
>
> STR: In PySpark shell, run sc.clearFiles()
> What happens:
> {noformat}
> py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. 
> Trace:
> py4j.Py4JException: Method clearFiles([]) does not exist
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
>   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
>   at py4j.Gateway.invoke(Gateway.java:272)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Apparently the old and deprecated SparkContext.clearFiles() was removed from 
> Spark 2.0 but it's still present in the PySpark API. It should be removed 
> from there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17524:


Assignee: Apache Spark

> RowBasedKeyValueBatchSuite always uses 64 mb page size
> --
>
> Key: SPARK-17524
> URL: https://issues.apache.org/jira/browse/SPARK-17524
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Assignee: Apache Spark
>Priority: Minor
>
> The appendRowUntilExceedingPageSize test at 
> sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
>  always uses the default page size which is 64 MB for running the test
> Users with less powerful machines (e.g. those with two cores) may opt to 
> choose a smaller spark.buffer.pageSize value in order to prevent problems 
> acquiring memory
> If this size is reduced, let's say to 1 MB, this test will fail, here is the 
> problem scenario
> We run with 1,048,576 page size (1 mb)
> Default is 67,108,864 size (64 mb)
> Test fails: java.lang.AssertionError: expected:<14563> but was:<932067>
> 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger 
> than 1 MB (which is when we see the failure)
> The failure is at
> Assert.assertEquals(batch.numRows(), numRows);
> This minor improvement has the test use whatever the user has specified 
> (looks for spark.buffer.pageSize) to prevent this problem from occurring for 
> anyone testing Apache Spark on a box with a reduced page size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size

2016-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487093#comment-15487093
 ] 

Apache Spark commented on SPARK-17524:
--

User 'a-roberts' has created a pull request for this issue:
https://github.com/apache/spark/pull/15079

> RowBasedKeyValueBatchSuite always uses 64 mb page size
> --
>
> Key: SPARK-17524
> URL: https://issues.apache.org/jira/browse/SPARK-17524
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> The appendRowUntilExceedingPageSize test at 
> sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
>  always uses the default page size which is 64 MB for running the test
> Users with less powerful machines (e.g. those with two cores) may opt to 
> choose a smaller spark.buffer.pageSize value in order to prevent problems 
> acquiring memory
> If this size is reduced, let's say to 1 MB, this test will fail, here is the 
> problem scenario
> We run with 1,048,576 page size (1 mb)
> Default is 67,108,864 size (64 mb)
> Test fails: java.lang.AssertionError: expected:<14563> but was:<932067>
> 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger 
> than 1 MB (which is when we see the failure)
> The failure is at
> Assert.assertEquals(batch.numRows(), numRows);
> This minor improvement has the test use whatever the user has specified 
> (looks for spark.buffer.pageSize) to prevent this problem from occurring for 
> anyone testing Apache Spark on a box with a reduced page size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size

2016-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17524:


Assignee: (was: Apache Spark)

> RowBasedKeyValueBatchSuite always uses 64 mb page size
> --
>
> Key: SPARK-17524
> URL: https://issues.apache.org/jira/browse/SPARK-17524
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> The appendRowUntilExceedingPageSize test at 
> sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
>  always uses the default page size which is 64 MB for running the test
> Users with less powerful machines (e.g. those with two cores) may opt to 
> choose a smaller spark.buffer.pageSize value in order to prevent problems 
> acquiring memory
> If this size is reduced, let's say to 1 MB, this test will fail, here is the 
> problem scenario
> We run with 1,048,576 page size (1 mb)
> Default is 67,108,864 size (64 mb)
> Test fails: java.lang.AssertionError: expected:<14563> but was:<932067>
> 932,067 is 64x bigger than 14,563 and the default page size is 64x bigger 
> than 1 MB (which is when we see the failure)
> The failure is at
> Assert.assertEquals(batch.numRows(), numRows);
> This minor improvement has the test use whatever the user has specified 
> (looks for spark.buffer.pageSize) to prevent this problem from occurring for 
> anyone testing Apache Spark on a box with a reduced page size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17525) SparkContext.clearFiles() still present in the PySpark bindings though the underlying Scala method was removed in Spark 2.0

2016-09-13 Thread Sami Jaktholm (JIRA)
Sami Jaktholm created SPARK-17525:
-

 Summary: SparkContext.clearFiles() still present in the PySpark 
bindings though the underlying Scala method was removed in Spark 2.0
 Key: SPARK-17525
 URL: https://issues.apache.org/jira/browse/SPARK-17525
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Sami Jaktholm
Priority: Trivial


STR: In PySpark shell, run sc.clearFiles()

What happens:
{noformat}
py4j.protocol.Py4JError: An error occurred while calling o74.clearFiles. Trace:
py4j.Py4JException: Method clearFiles([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:272)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Apparently the old and deprecated SparkContext.clearFiles() was removed from 
Spark 2.0 but it's still present in the PySpark API. It should be removed from 
there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17521) Error when I use sparkContext.makeRDD(Seq())

2016-09-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487089#comment-15487089
 ] 

Sean Owen commented on SPARK-17521:
---

Reaching way back to page [~pwendell] -- do you have any view on whether it's 
probably right to deprecate makeRDD() at this point in favor of parallelize()?

> Error when I use sparkContext.makeRDD(Seq())
> 
>
> Key: SPARK-17521
> URL: https://issues.apache.org/jira/browse/SPARK-17521
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>Priority: Trivial
>  Labels: easyfix
>
> when i use sc.makeRDD below
> ```
> val data3 = sc.makeRDD(Seq())
> println(data3.partitions.length)
> ```
> I got an error:
> Exception in thread "main" java.lang.IllegalArgumentException: Positive 
> number of slices required
> We can fix this bug just modify the last line ,do a check of seq.size
> 
>   def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
> assertNotStopped()
> val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
> new ParallelCollectionRDD[T](this, seq.map(_._1), seq.size, indexToPrefs)
>   }
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17524) RowBasedKeyValueBatchSuite always uses 64 mb page size

2016-09-13 Thread Adam Roberts (JIRA)
Adam Roberts created SPARK-17524:


 Summary: RowBasedKeyValueBatchSuite always uses 64 mb page size
 Key: SPARK-17524
 URL: https://issues.apache.org/jira/browse/SPARK-17524
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.1.0
Reporter: Adam Roberts
Priority: Minor


The appendRowUntilExceedingPageSize test at 
sql/catalyst/src/test/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatchSuite.java
 always uses the default page size which is 64 MB for running the test

Users with less powerful machines (e.g. those with two cores) may opt to choose 
a smaller spark.buffer.pageSize value in order to prevent problems acquiring 
memory

If this size is reduced, let's say to 1 MB, this test will fail, here is the 
problem scenario

We run with 1,048,576 page size (1 mb)
Default is 67,108,864 size (64 mb)
Test fails: java.lang.AssertionError: expected:<14563> but was:<932067>

932,067 is 64x bigger than 14,563 and the default page size is 64x bigger than 
1 MB (which is when we see the failure)

The failure is at
Assert.assertEquals(batch.numRows(), numRows);

This minor improvement has the test use whatever the user has specified (looks 
for spark.buffer.pageSize) to prevent this problem from occurring for anyone 
testing Apache Spark on a box with a reduced page size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-09-13 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15487035#comment-15487035
 ] 

Yanbo Liang commented on SPARK-17471:
-

[~sethah] I'm sorry that I have some emergent affairs to deal with in these 
days, so please feel free to take over this task. Thanks!

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >