[jira] [Assigned] (SPARK-18395) Evaluate common subexpression like lazy variable with a function approach

2016-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18395:


Assignee: Apache Spark

> Evaluate common subexpression like lazy variable with a function approach
> -
>
> Key: SPARK-18395
> URL: https://issues.apache.org/jira/browse/SPARK-18395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> As per the discussion at pr 15807, we need to change the way of subexpression 
> elimination.
> In current approach, common subexpressions are evaluated no matter they are 
> really used or not later. E.g., in the following generated codes, 
> {{subexpr2}} is evaluated even only the {{if}} branch is run.
> {code}
> if (isNull(subexpr)) {
>   ...
> } else {
>   AssertNotNull(subexpr)  // subexpr2
>   
>   SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
> }
> {code}
> Besides possible performance regression, the expression like 
> {{AssertNotNull}} can throw exception. So wrongly evaluating {{subexpr2}} 
> will throw exception unexceptedly..
> With this patch, now common subexpressions are not evaluated until they are 
> used. We create a function for each common subexpression which evaluates and 
> stores the result as a member variable. We have an initialization status 
> variable to record whether the subexpression is evaluated.
> Thus, when an expression using the subexpression is going to be evaluated, we 
> check if the subexpression is initialized, if yes directly returning the 
> result, if no call the function to evaluate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18395) Evaluate common subexpression like lazy variable with a function approach

2016-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18395:


Assignee: (was: Apache Spark)

> Evaluate common subexpression like lazy variable with a function approach
> -
>
> Key: SPARK-18395
> URL: https://issues.apache.org/jira/browse/SPARK-18395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> As per the discussion at pr 15807, we need to change the way of subexpression 
> elimination.
> In current approach, common subexpressions are evaluated no matter they are 
> really used or not later. E.g., in the following generated codes, 
> {{subexpr2}} is evaluated even only the {{if}} branch is run.
> {code}
> if (isNull(subexpr)) {
>   ...
> } else {
>   AssertNotNull(subexpr)  // subexpr2
>   
>   SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
> }
> {code}
> Besides possible performance regression, the expression like 
> {{AssertNotNull}} can throw exception. So wrongly evaluating {{subexpr2}} 
> will throw exception unexceptedly..
> With this patch, now common subexpressions are not evaluated until they are 
> used. We create a function for each common subexpression which evaluates and 
> stores the result as a member variable. We have an initialization status 
> variable to record whether the subexpression is evaluated.
> Thus, when an expression using the subexpression is going to be evaluated, we 
> check if the subexpression is initialized, if yes directly returning the 
> result, if no call the function to evaluate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18395) Evaluate common subexpression like lazy variable with a function approach

2016-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653351#comment-15653351
 ] 

Apache Spark commented on SPARK-18395:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/15837

> Evaluate common subexpression like lazy variable with a function approach
> -
>
> Key: SPARK-18395
> URL: https://issues.apache.org/jira/browse/SPARK-18395
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> As per the discussion at pr 15807, we need to change the way of subexpression 
> elimination.
> In current approach, common subexpressions are evaluated no matter they are 
> really used or not later. E.g., in the following generated codes, 
> {{subexpr2}} is evaluated even only the {{if}} branch is run.
> {code}
> if (isNull(subexpr)) {
>   ...
> } else {
>   AssertNotNull(subexpr)  // subexpr2
>   
>   SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
> }
> {code}
> Besides possible performance regression, the expression like 
> {{AssertNotNull}} can throw exception. So wrongly evaluating {{subexpr2}} 
> will throw exception unexceptedly..
> With this patch, now common subexpressions are not evaluated until they are 
> used. We create a function for each common subexpression which evaluates and 
> stores the result as a member variable. We have an initialization status 
> variable to record whether the subexpression is evaluated.
> Thus, when an expression using the subexpression is going to be evaluated, we 
> check if the subexpression is initialized, if yes directly returning the 
> result, if no call the function to evaluate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18395) Evaluate common subexpression like lazy variable with a function approach

2016-11-09 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-18395:
---

 Summary: Evaluate common subexpression like lazy variable with a 
function approach
 Key: SPARK-18395
 URL: https://issues.apache.org/jira/browse/SPARK-18395
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


As per the discussion at pr 15807, we need to change the way of subexpression 
elimination.

In current approach, common subexpressions are evaluated no matter they are 
really used or not later. E.g., in the following generated codes, {{subexpr2}} 
is evaluated even only the {{if}} branch is run.

{code}
if (isNull(subexpr)) {
  ...
} else {
  AssertNotNull(subexpr)  // subexpr2
  
  SomeExpr(AssertNotNull(subexpr)) // SomeExpr(subexpr2)
}
{code}

Besides possible performance regression, the expression like {{AssertNotNull}} 
can throw exception. So wrongly evaluating {{subexpr2}} will throw exception 
unexceptedly..

With this patch, now common subexpressions are not evaluated until they are 
used. We create a function for each common subexpression which evaluates and 
stores the result as a member variable. We have an initialization status 
variable to record whether the subexpression is evaluated.

Thus, when an expression using the subexpression is going to be evaluated, we 
check if the subexpression is initialized, if yes directly returning the 
result, if no call the function to evaluate it.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements

2016-11-09 Thread Assaf Mendelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653344#comment-15653344
 ] 

Assaf Mendelson commented on SPARK-17691:
-

While you can use mutable buffers with the aggregator interface, you either 
need to have a constant buffer size (e.g. do a buffer with 5 integers) or use 
an array type. The array type is immutable which means that any reordering or 
changes done to it would cause a copy. As I said, this would be slow.

I have tried implementing this issue myself (I am still learning, which is why 
I have not asked to be assigned this issue myself) by extending 
TypedImperativeAggregate and initial testing is much much faster.

As for this function being specific, I have to disagree. There are many cases 
where one needs to simply collect a list and handle it in a third party tool. 
This is why collect_list is there. The problem is that in many cases, we are 
more interested in the small occurrences than the large ones. Consider for 
example doing a time series. We don't need the entire time series (especially 
if for some key, they are really large) but just part of it.

Sure, you could probably do an aggregate of count, filter it and then use that 
to filter the original dataframe or some other means but I believe that in most 
cases where collect_list is currently used, a version which limits the results 
per key is actually a good fit.

> Add aggregate function to collect list with maximum number of elements
> --
>
> Key: SPARK-17691
> URL: https://issues.apache.org/jira/browse/SPARK-17691
> Project: Spark
>  Issue Type: New Feature
>Reporter: Assaf Mendelson
>Priority: Minor
>
> One of the aggregate functions we have today is the collect_list function. 
> This is a useful tool to do a "catch all" aggregation which doesn't really 
> fit anywhere else.
> The problem with collect_list is that it is unbounded. I would like to see a 
> means to do a collect_list where we limit the maximum number of elements.
> I would see that the input for this would be the maximum number of elements 
> to use and the method of choosing (pick whatever, pick the top N, pick the 
> bottom B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-18048.
---
Resolution: Invalid

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653331#comment-15653331
 ] 

Wenchen Fan edited comment on SPARK-18048 at 11/10/16 7:45 AM:
---

according to the discussion in the PR, this ticket is invalid, I'm closing it.


was (Author: cloud_fan):
according to the discussion in the PR, this ticket is invalid, I'm clsoing it.

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18048) If expression behaves differently if true and false expression are interchanged in case of different data types.

2016-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653331#comment-15653331
 ] 

Wenchen Fan commented on SPARK-18048:
-

according to the discussion in the PR, this ticket is invalid, I'm clsoing it.

> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> 
>
> Key: SPARK-18048
> URL: https://issues.apache.org/jira/browse/SPARK-18048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Priyanka Garg
>
> If expression behaves differently if true and false expression are 
> interchanged in case of different data types.
> For eg. 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1)) is throwing error while 
>  checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1L), TimestampType),
> Literal.create(identity(2), DateType)),
>   identity(1L)) works fine.
> The reason for the same is that the If expression 's datatype only considers 
> trueValue.dataType.
> Also, 
>   checkEvaluation(
>   If(Literal.create(true, BooleanType),
> Literal.create(identity(1), DateType),
> Literal.create(identity(2L), TimestampType)),
>   identity(1))
>  is breaking only in case of Generated mutable Projection and Unsafe 
> projection. For all other types its working fine.
> Either both should work or none should work



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18050) spark 2.0.1 enable hive throw AlreadyExistsException(message:Database default already exists)

2016-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653322#comment-15653322
 ] 

Wenchen Fan commented on SPARK-18050:
-

can you put the stacktrace here too?

> spark 2.0.1 enable hive throw AlreadyExistsException(message:Database default 
> already exists)
> -
>
> Key: SPARK-18050
> URL: https://issues.apache.org/jira/browse/SPARK-18050
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: jdk1.8, macOs,spark 2.0.1
>Reporter: todd.chen
>
> in spark 2.0.1 ,I enable hive support and when init the sqlContext ,throw a 
> AlreadyExistsException(message:Database default already exists),same as 
> https://www.mail-archive.com/dev@spark.apache.org/msg15306.html ,my code is 
> {code}
>   private val master = "local[*]"
>   private val appName = "xqlServerSpark"
>   val fileSystem = FileSystem.get()
>   val sparkConf = new SparkConf().setMaster(master).
> setAppName(appName).set("spark.sql.warehouse.dir", 
> s"${fileSystem.getUri.toASCIIString}/user/hive/warehouse")
>   val   hiveContext = 
> SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate().sqlContext
> print(sparkConf.get("spark.sql.warehouse.dir"))
> hiveContext.sql("show tables").show()
> {code}
> the result is correct,but a exception also throwBy the code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14450) Python OneVsRest should train multiple models at once

2016-11-09 Thread fanlu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653309#comment-15653309
 ] 

fanlu commented on SPARK-14450:
---

Why scala version does not need to use parallelization

> Python OneVsRest should train multiple models at once
> -
>
> Key: SPARK-14450
> URL: https://issues.apache.org/jira/browse/SPARK-14450
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> [SPARK-7861] adds a Python wrapper for OneVsRest.  Because of possible issues 
> related to using existing libraries like {{multiprocessing}}, we are not 
> training multiple models in parallel initially.
> This issue is for prototyping, testing, and implementing a way to train 
> multiple models at once.  Speaking with [~joshrosen], a good option might be 
> the concurrent.futures package:
> * Python 3.x: 
> [https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures]
> * Python 2.x: [https://pypi.python.org/pypi/futures]
> We will *not* add this for Spark 2.0, but it will be good to investigate for 
> 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18064) Spark SQL can't load default config file

2016-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653305#comment-15653305
 ] 

Wenchen Fan commented on SPARK-18064:
-

since this ticket has no description and the reporter has no response for a 
while, I'm closing it.

> Spark SQL can't load default config file 
> -
>
> Key: SPARK-18064
> URL: https://issues.apache.org/jira/browse/SPARK-18064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: darion yaphet
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18064) Spark SQL can't load default config file

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-18064.
---
Resolution: Invalid

> Spark SQL can't load default config file 
> -
>
> Key: SPARK-18064
> URL: https://issues.apache.org/jira/browse/SPARK-18064
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: darion yaphet
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18168) Revert the change of SPARK-18167

2016-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653256#comment-15653256
 ] 

Wenchen Fan commented on SPARK-18168:
-

it's already reverted right?

> Revert the change of SPARK-18167
> 
>
> Key: SPARK-18168
> URL: https://issues.apache.org/jira/browse/SPARK-18168
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Blocker
>
> We need to revert the change for 
> https://github.com/apache/spark/pull/15676/files before the release. That 
> jira is used to investigate a flaky test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18220) ClassCastException occurs when using select query on ORC file

2016-11-09 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653246#comment-15653246
 ] 

Wenchen Fan commented on SPARK-18220:
-

Is it an external orc file or written by Spark SQL? 

> ClassCastException occurs when using select query on ORC file
> -
>
> Key: SPARK-18220
> URL: https://issues.apache.org/jira/browse/SPARK-18220
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Jerryjung
>  Labels: orcfile, sql
>
> Error message is below.
> ==
> 16/11/02 16:38:09 INFO ReaderImpl: Reading ORC rows from 
> hdfs://xxx/part-00022 with {include: [true], offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:38:09 INFO Executor: Finished task 17.0 in stage 22.0 (TID 42). 
> 1220 bytes result sent to driver
> 16/11/02 16:38:09 INFO TaskSetManager: Finished task 17.0 in stage 22.0 (TID 
> 42) in 116 ms on localhost (executor driver) (19/20)
> 16/11/02 16:38:09 ERROR Executor: Exception in task 10.0 in stage 22.0 (TID 
> 35)
> java.lang.ClassCastException: 
> org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to 
> org.apache.hadoop.io.Text
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObjectInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:41)
>   at 
> org.apache.spark.sql.hive.HiveInspectors$$anonfun$unwrapperFor$23.apply(HiveInspectors.scala:526)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14$$anonfun$apply$15.apply(TableReader.scala:419)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:435)
>   at 
> org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$2.apply(TableReader.scala:426)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:804)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> ORC dump info.
> ==
> File Version: 0.12 with HIVE_8732
> 16/11/02 16:39:21 INFO orc.ReaderImpl: Reading ORC rows from 
> hdfs://XXX/part-0 with {include: null, offset: 0, length: 
> 9223372036854775807}
> 16/11/02 16:39:21 INFO orc.RecordReaderFactory: Schema is not specified on 
> read. Using file schema.
> Rows: 7
> Compression: ZLIB
> Compression size: 262144
> Type: 
> struct



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2016-11-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653228#comment-15653228
 ] 

Dongjoon Hyun commented on SPARK-16628:
---

Hi, is there any progress on this issue?

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12998) Enable OrcRelation when connecting via spark thrift server

2016-11-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-12998.
-
Resolution: Duplicate

Hi, [~rajesh.balamohan].
I'll close this issue since the PR is closed and the issue seems to be resolved 
by another issue, SPARK-14070.

> Enable OrcRelation when connecting via spark thrift server
> --
>
> Key: SPARK-12998
> URL: https://issues.apache.org/jira/browse/SPARK-12998
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Rajesh Balamohan
>
> When a user connects via spark-thrift server to execute SQL, it does not 
> enable PPD with ORC. It ends up creating MetastoreRelation which does not 
> have ORC PPD.  Purpose of this JIRA is to convert MetastoreRelation to 
> OrcRelation in HiveMetastoreCatalog, so that users can benefit from PPD even 
> when connecting to spark-thrift server.
> {noformat}
> For example, "explain select count(1) from  tpch_flat_orc_1000.lineitem where 
> l_shipdate = '1990-04-18'", current plan is 
> +--+--+
> |   plan  
>  |
> +--+--+
> | == Physical Plan == 
>  |
> | TungstenAggregate(key=[], 
> functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#17L]) 
>  |
> | +- Exchange SinglePartition, None   
>  |
> |+- WholeStageCodegen 
>  |
> |   :  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#20L])  |
> |   : +- Project  
>  |
> |   :+- Filter (l_shipdate#11 = 1990-04-18)   
>  |
> |   :   +- INPUT  
>  |
> |   +- HiveTableScan [l_shipdate#11], MetastoreRelation tpch_1000, 
> lineitem, None |
> +--+--+
> It would be good to change it to OrcRelation to do PPD with ORC, which 
> reduces the runtime by large margin.
>  
> +---+--+
> | 
> plan  
> |
> +---+--+
> | == Physical Plan == 
>   
> |
> | TungstenAggregate(key=[], 
> functions=[(count(1),mode=Final,isDistinct=false)], output=[_c0#70L]) 
>   
> |
> | +- Exchange SinglePartition, None   
>   
> |
> |+- WholeStageCodegen 
>   
> |
> |   :  +- TungstenAggregate(key=[], 
> functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#106L])
>   |
> |   : +- Project  
>   
> |
> |   :+- Filter (_col10#64 = 1990-04-18)   
>   
> |
> |   :   +- INPUT  

[jira] [Closed] (SPARK-18271) hash udf in HiveSessionCatalog.hiveFunctions seq is redundant

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-18271.
---
Resolution: Won't Fix

> hash udf in HiveSessionCatalog.hiveFunctions seq is redundant
> -
>
> Key: SPARK-18271
> URL: https://issues.apache.org/jira/browse/SPARK-18271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Song Jun
>Priority: Minor
>
> when lookupfunction in HiveSessionCatalog, first look up in spark's build-in 
> functionRegistry, if it not existed, then look up the Hive'S build-in 
> functions which are listed in Seq(hiveFunctions).
> But the [hash] function is already in spark's build-in functionRegistry list, 
> so it will never to go to look up in Hive's hiveFunctions list, so the [hash] 
> function which is in Hive's hiveFunctions is redundant, we can remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18344) TRUNCATE TABLE should fail if no partition is matched for the given non-partial partition spec

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan closed SPARK-18344.
---
Resolution: Duplicate

> TRUNCATE TABLE should fail if no partition is matched for the given 
> non-partial partition spec
> --
>
> Key: SPARK-18344
> URL: https://issues.apache.org/jira/browse/SPARK-18344
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18075:

Component/s: SQL

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
> 16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 

[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18075:

Labels:   (was: sql)

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
> 16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 

[jira] [Updated] (SPARK-18172) AnalysisException in first/last during aggregation

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18172:

Component/s: SQL

> AnalysisException in first/last during aggregation
> --
>
> Key: SPARK-18172
> URL: https://issues.apache.org/jira/browse/SPARK-18172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Emlyn Corrin
>
> Since Spark 2.0.1, the following pyspark snippet fails with 
> {{AnalysisException: The second argument of First should be a boolean 
> literal}} (but it's not restricted to Python, similar code with in Java fails 
> in the same way).
> It worked in Spark 2.0.0, so I believe it may be related to the fix for 
> SPARK-16648.
> {code}
> from pyspark.sql import functions as F
> ds = spark.createDataFrame(sc.parallelize([[1, 1, 2], [1, 2, 3], [1, 3, 4]]))
> ds.groupBy(ds._1).agg(F.first(ds._2), F.countDistinct(ds._2), 
> F.countDistinct(ds._2, ds._3)).show()
> {code}
> It works if any of the three arguments to {{.agg}} is removed.
> The stack trace is:
> {code}
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 
> ds.groupBy(ds._1).agg(F.first(ds._2),F.countDistinct(ds._2),F.countDistinct(ds._2,
>  ds._3)).show()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/dataframe.py 
> in show(self, n, truncate)
> 285 +---+-+
> 286 """
> --> 287 print(self._jdf.showString(n, truncate))
> 288
> 289 def __repr__(self):
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>1131 answer = self.gateway_client.send_command(command)
>1132 return_value = get_return_value(
> -> 1133 answer, self.gateway_client, self.target_id, self.name)
>1134
>1135 for temp_arg in temp_args:
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py in 
> deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> /usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 317 raise Py4JJavaError(
> 318 "An error occurred while calling {0}{1}{2}.\n".
> --> 319 format(target_id, ".", name), value)
> 320 else:
> 321 raise Py4JError(
> Py4JJavaError: An error occurred while calling o76.showString.
> : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, 
> tree: first(_2#1L)()
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:387)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:256)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.org$apache$spark$sql$catalyst$optimizer$RewriteDistinctAggregates$$patchAggregateFunctionChildren$1(RewriteDistinctAggregates.scala:140)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:182)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$16.apply(RewriteDistinctAggregates.scala:180)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$.rewrite(RewriteDistinctAggregates.scala:180)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.RewriteDistinctAggregates$$anonfun$apply$1.applyOrElse(RewriteDistinctAggregates.scala:104)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
> 

[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18075:

Labels: sql  (was: )

> UDF doesn't work on non-local spark
> ---
>
> Key: SPARK-18075
> URL: https://issues.apache.org/jira/browse/SPARK-18075
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Nick Orka
>  Labels: sql
>
> I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
> According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 
> I've made all spark dependancies with PROVIDED scope. I use 100% same 
> versions of spark in the app as well as for spark server. 
> Here is my pom:
> {code:title=pom.xml}
> 
> 1.6
> 1.6
> UTF-8
> 2.11.8
> 2.0.0
> 2.7.0
> 
> 
> 
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> provided
> 
> 
> org.apache.spark
> spark-hive_2.11
> ${spark.version}
> provided
> 
> 
> {code}
> As you can see all spark dependencies have provided scope
> And this is a code for reproduction:
> {code:title=udfTest.scala}
> import org.apache.spark.sql.types.{StringType, StructField, StructType}
> import org.apache.spark.sql.{Row, SparkSession}
> /**
>   * Created by nborunov on 10/19/16.
>   */
> object udfTest {
>   class Seq extends Serializable {
> var i = 0
> def getVal: Int = {
>   i = i + 1
>   i
> }
>   }
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
> .master("spark://nborunov-mbp.local:7077")
> //  .master("local")
>   .getOrCreate()
> val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))
> val schema = StructType(Array(StructField("name", StringType)))
> val df = spark.createDataFrame(rdd, schema)
> df.show()
> spark.udf.register("func", (name: String) => name.toUpperCase)
> import org.apache.spark.sql.functions.expr
> val newDf = df.withColumn("upperName", expr("func(name)"))
> newDf.show()
> val seq = new Seq
> spark.udf.register("seq", () => seq.getVal)
> val seqDf = df.withColumn("id", expr("seq()"))
> seqDf.show()
> df.createOrReplaceTempView("df")
> spark.sql("select *, seq() as sql_id from df").show()
>   }
> }
> {code}
> When .master("local") - everything works fine. When 
> .master("spark://...:7077"), it fails on line:
> {code}
> newDf.show()
> {code}
> The error is exactly the same:
> {code}
> scala> udfTest.main(Array())
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
> 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
> 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
> 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: Set(nborunov); 
> groups with view permissions: Set(); users  with modify permissions: 
> Set(nborunov); groups with modify permissions: Set()
> 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
> port 57828.
> 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
> 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
> 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
> 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MB
> 16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator
> 16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 

[jira] [Resolved] (SPARK-18147) Broken Spark SQL Codegen

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18147.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15807
[https://github.com/apache/spark/pull/15807]

> Broken Spark SQL Codegen
> 
>
> Key: SPARK-18147
> URL: https://issues.apache.org/jira/browse/SPARK-18147
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: koert kuipers
>Priority: Critical
> Fix For: 2.1.0
>
>
> this is me on purpose trying to break spark sql codegen to uncover potential 
> issues, by creating arbitrately complex data structures using primitives, 
> strings, basic collections (map, seq, option), tuples, and case classes.
> first example: nested case classes
> code:
> {noformat}
> class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) 
> extends Aggregator[Row, B, C] {
>   override def reduce(b: B, input: Row): B = b
>   override def merge(b1: B, b2: B): B = b1
>   override def finish(reduction: B): C = result
>   override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
>   override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
> }
> case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: 
> Seq[Long] = Seq.empty)
> case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())
> val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
> "y").groupBy("x").agg(
>   new ComplexResultAgg("boo", Struct3()).toColumn
> )
> df1.printSchema
> df1.show
> {noformat}
> the result is:
> {noformat}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
> [info] /* 001 */ public java.lang.Object generate(Object[] references) {
> [info] /* 002 */   return new SpecificMutableProjection(references);
> [info] /* 003 */ }
> [info] /* 004 */
> [info] /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> [info] /* 006 */
> [info] /* 007 */   private Object[] references;
> [info] /* 008 */   private MutableRow mutableRow;
> [info] /* 009 */   private Object[] values;
> [info] /* 010 */   private java.lang.String errMsg;
> [info] /* 011 */   private Object[] values1;
> [info] /* 012 */   private java.lang.String errMsg1;
> [info] /* 013 */   private boolean[] argIsNulls;
> [info] /* 014 */   private scala.collection.Seq argValue;
> [info] /* 015 */   private java.lang.String errMsg2;
> [info] /* 016 */   private boolean[] argIsNulls1;
> [info] /* 017 */   private scala.collection.Seq argValue1;
> [info] /* 018 */   private java.lang.String errMsg3;
> [info] /* 019 */   private java.lang.String errMsg4;
> [info] /* 020 */   private Object[] values2;
> [info] /* 021 */   private java.lang.String errMsg5;
> [info] /* 022 */   private boolean[] argIsNulls2;
> [info] /* 023 */   private scala.collection.Seq argValue2;
> [info] /* 024 */   private java.lang.String errMsg6;
> [info] /* 025 */   private boolean[] argIsNulls3;
> [info] /* 026 */   private scala.collection.Seq argValue3;
> [info] /* 027 */   private java.lang.String errMsg7;
> [info] /* 028 */   private boolean isNull_0;
> [info] /* 029 */   private InternalRow value_0;
> [info] /* 030 */
> [info] /* 031 */   private void apply_1(InternalRow i) {
> [info] /* 032 */
> [info] /* 033 */ if (isNull1) {
> [info] /* 034 */   throw new RuntimeException(errMsg3);
> [info] /* 035 */ }
> [info] /* 036 */
> [info] /* 037 */ boolean isNull24 = false;
> [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
> null : (com.tresata.spark.sql.Struct2) value1.a();
> [info] /* 039 */ isNull24 = value24 == null;
> [info] /* 040 */
> [info] /* 041 */ boolean isNull23 = isNull24;
> [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
> (scala.collection.Seq) value24.s2();
> [info] /* 043 */ isNull23 = value23 == null;
> [info] /* 044 */ argIsNulls1[0] = isNull23;
> [info] /* 045 */ argValue1 = value23;
> [info] /* 046 */
> [info] /* 047 */
> [info] /* 048 */
> [info] /* 049 */ boolean isNull22 = false;
> [info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
> [info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
> [info] /* 052 */ }
> [info] /* 053 */
> [info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
> [info] /* 055 */ if (isNull22) {
> [info] /* 056 */   values1[2] = null;
> [info] /* 057 */ } else {
> [info] /* 058 */   values1[2] = value22;
> [info] /* 

[jira] [Updated] (SPARK-18147) Broken Spark SQL Codegen

2016-11-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-18147:

Assignee: Liang-Chi Hsieh

> Broken Spark SQL Codegen
> 
>
> Key: SPARK-18147
> URL: https://issues.apache.org/jira/browse/SPARK-18147
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: koert kuipers
>Assignee: Liang-Chi Hsieh
>Priority: Critical
> Fix For: 2.1.0
>
>
> this is me on purpose trying to break spark sql codegen to uncover potential 
> issues, by creating arbitrately complex data structures using primitives, 
> strings, basic collections (map, seq, option), tuples, and case classes.
> first example: nested case classes
> code:
> {noformat}
> class ComplexResultAgg[B: TypeTag, C: TypeTag](val zero: B, result: C) 
> extends Aggregator[Row, B, C] {
>   override def reduce(b: B, input: Row): B = b
>   override def merge(b1: B, b2: B): B = b1
>   override def finish(reduction: B): C = result
>   override def bufferEncoder: Encoder[B] = ExpressionEncoder[B]()
>   override def outputEncoder: Encoder[C] = ExpressionEncoder[C]()
> }
> case class Struct2(d: Double = 0.0, s1: Seq[Double] = Seq.empty, s2: 
> Seq[Long] = Seq.empty)
> case class Struct3(a: Struct2 = Struct2(), b: Struct2 = Struct2())
> val df1 = Seq(("a", "aa"), ("a", "aa"), ("b", "b"), ("b", null)).toDF("x", 
> "y").groupBy("x").agg(
>   new ComplexResultAgg("boo", Struct3()).toColumn
> )
> df1.printSchema
> df1.show
> {noformat}
> the result is:
> {noformat}
> [info]   Cause: java.util.concurrent.ExecutionException: java.lang.Exception: 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 33, Column 12: Expression "isNull1" is not an rvalue
> [info] /* 001 */ public java.lang.Object generate(Object[] references) {
> [info] /* 002 */   return new SpecificMutableProjection(references);
> [info] /* 003 */ }
> [info] /* 004 */
> [info] /* 005 */ class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
> [info] /* 006 */
> [info] /* 007 */   private Object[] references;
> [info] /* 008 */   private MutableRow mutableRow;
> [info] /* 009 */   private Object[] values;
> [info] /* 010 */   private java.lang.String errMsg;
> [info] /* 011 */   private Object[] values1;
> [info] /* 012 */   private java.lang.String errMsg1;
> [info] /* 013 */   private boolean[] argIsNulls;
> [info] /* 014 */   private scala.collection.Seq argValue;
> [info] /* 015 */   private java.lang.String errMsg2;
> [info] /* 016 */   private boolean[] argIsNulls1;
> [info] /* 017 */   private scala.collection.Seq argValue1;
> [info] /* 018 */   private java.lang.String errMsg3;
> [info] /* 019 */   private java.lang.String errMsg4;
> [info] /* 020 */   private Object[] values2;
> [info] /* 021 */   private java.lang.String errMsg5;
> [info] /* 022 */   private boolean[] argIsNulls2;
> [info] /* 023 */   private scala.collection.Seq argValue2;
> [info] /* 024 */   private java.lang.String errMsg6;
> [info] /* 025 */   private boolean[] argIsNulls3;
> [info] /* 026 */   private scala.collection.Seq argValue3;
> [info] /* 027 */   private java.lang.String errMsg7;
> [info] /* 028 */   private boolean isNull_0;
> [info] /* 029 */   private InternalRow value_0;
> [info] /* 030 */
> [info] /* 031 */   private void apply_1(InternalRow i) {
> [info] /* 032 */
> [info] /* 033 */ if (isNull1) {
> [info] /* 034 */   throw new RuntimeException(errMsg3);
> [info] /* 035 */ }
> [info] /* 036 */
> [info] /* 037 */ boolean isNull24 = false;
> [info] /* 038 */ final com.tresata.spark.sql.Struct2 value24 = isNull24 ? 
> null : (com.tresata.spark.sql.Struct2) value1.a();
> [info] /* 039 */ isNull24 = value24 == null;
> [info] /* 040 */
> [info] /* 041 */ boolean isNull23 = isNull24;
> [info] /* 042 */ final scala.collection.Seq value23 = isNull23 ? null : 
> (scala.collection.Seq) value24.s2();
> [info] /* 043 */ isNull23 = value23 == null;
> [info] /* 044 */ argIsNulls1[0] = isNull23;
> [info] /* 045 */ argValue1 = value23;
> [info] /* 046 */
> [info] /* 047 */
> [info] /* 048 */
> [info] /* 049 */ boolean isNull22 = false;
> [info] /* 050 */ for (int idx = 0; idx < 1; idx++) {
> [info] /* 051 */   if (argIsNulls1[idx]) { isNull22 = true; break; }
> [info] /* 052 */ }
> [info] /* 053 */
> [info] /* 054 */ final ArrayData value22 = isNull22 ? null : new 
> org.apache.spark.sql.catalyst.util.GenericArrayData(argValue1);
> [info] /* 055 */ if (isNull22) {
> [info] /* 056 */   values1[2] = null;
> [info] /* 057 */ } else {
> [info] /* 058 */   values1[2] = value22;
> [info] /* 059 */ }
> [info] /* 060 */   }
> [info] /* 061 */
> [info] /* 

[jira] [Comment Edited] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653009#comment-15653009
 ] 

Xiao Li edited comment on SPARK-18389 at 11/10/16 4:36 AM:
---

The above example should be fixed now. Let me submit a PR for it.


was (Author: smilegator):
This should be fixed now. Let me submit a PR for it.

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15653009#comment-15653009
 ] 

Xiao Li commented on SPARK-18389:
-

This should be fixed now. Let me submit a PR for it.

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652978#comment-15652978
 ] 

Xiao Li commented on SPARK-18389:
-

If we do not allow cyclic view reference, we need to detect it in {{CREATE 
VIEW}} too. For example,

{code}
CREATE VIEW w AS WITH w AS (SELECT 1 AS n) SELECT n FROM w
{code}

Above is allowed in the current master branch. However, this is not allowed in 
DB2. CTE name has to be different from the view name. 

BTW, concurrency control is missing in our catalog support. 

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652968#comment-15652968
 ] 

Nicholas Chammas commented on SPARK-18367:
--

Even if I cut the number of records I'm processing with this new code base to 
literally 10-20 records, I'm still seeing Spark use north of 6K files. This 
smells like a resource leak or a bug.

I will try to boil this down to a minimal repro that I can share.

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18367) limit() makes the lame walk again

2016-11-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652406#comment-15652406
 ] 

Nicholas Chammas edited comment on SPARK-18367 at 11/10/16 3:24 AM:


I've spent the day trying to narrow down what is happening, but I haven't made 
much progress. All I've really found is that adding a {{coalesce()}} at the 
"right" place in the query tree (dunno if that's the right terminology) can 
reduce the number of open files enough that things succeed, like how the 
{{limit()}} is helping things. Maybe Spark just needs that many files and there 
is no issue here? I dunno.

Is there a rough rule of thumb I can use to determine how many files Spark 
should be opening? Just a rough way to determine the order of magnitude of open 
files based on something I'm doing. In my case, I have a DataFrame with no more 
than a few partitions that I'm applying a UDF to and then joining to itself 
twice. The resulting DataFrame has no more than a dozen or so partitions.

Is it conceivable that this would somehow make Spark spawn more than 10K files, 
which is the maximum number of files macOS will allow open per process? This is 
the limit I'm hitting which is causing my job to fail. Even if I coalesce the 
DataFrame to 1 partition I still see Spark spawn up to 7K files. Is this normal?


was (Author: nchammas):
I've spent the day trying to narrow down what is happening, but I haven't made 
much progress. All I've really found is that adding a {{coalesce()}} at the 
"right" place in the query tree (dunno if that's the right terminology) can 
reduce the number of open files enough that things succeed, like how the 
{{limit()}} is helping things. Maybe Spark just needs that many files and there 
is no issue here? I dunno.

Is there a rough rule of thumb I can use to determine how many files Spark 
should be opening? Just a rough way to determine the order of magnitude of open 
files based on something I'm doing. In my case, I have a DataFrame with no more 
than a few partitions that I'm applying a UDF to and then joining to itself 
twice. The resulting DataFrame has no more than a dozen or so partitions.

Is it conceivable that this would somehow make Spark spawn more than 10K files, 
which is the maximum number of files macOS will allow open per process? Even if 
I coalesce the DataFrame to 1 partition I still see Spark spawn up to 7K files. 
Is this normal?

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   

[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652881#comment-15652881
 ] 

Nicholas Chammas commented on SPARK-18367:
--

To provide some context, this code base I'm struggling with, which is 
triggering the above behavior, is a rewrite of a legacy application. The legacy 
application mostly uses RDDs and runs on 1.6. The new and shiny is 
DataFrame-only and runs on 2.0.

I ran the legacy app on the same dataset and saw Spark use less than 2K files. 
So either due to the different APIs, or due to the different Spark version, or 
due to some combination thereof, there is a large difference in the number of 
files Spark uses between the legacy app and the rewrite that I'm currently 
struggling with.

[~hvanhovell] - Can you give me a general sense of whether the open file count 
is more likely to just be my fault (from poor API usage, for example) or if 
it's a sign that perhaps Spark is doing something wrong? Is there anything you 
think I should look into?

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18318) ML, Graph 2.1 QA: API: New Scala APIs, docs

2016-11-09 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-18318:
---

Assignee: Yanbo Liang

> ML, Graph 2.1 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-18318
> URL: https://issues.apache.org/jira/browse/SPARK-18318
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17691) Add aggregate function to collect list with maximum number of elements

2016-11-09 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652723#comment-15652723
 ] 

Michael Armbrust commented on SPARK-17691:
--

I think that should be able to use mutable buffers with the aggregator 
interface (and if there is bad performance there we should fix it).  Depending 
on what you are trying to do, I'd imagine groupByKey and mapGroups would also 
be fast.  You could also collect the top N per group using window functions.

Basically, this function sounds pretty specific (correct me if I'm wrong and 
this a common thing that other system support).  So I think it makes more sense 
to find fast/general mechanisms that let you build something specific like 
this, rather than adding yet another aggregate function.

> Add aggregate function to collect list with maximum number of elements
> --
>
> Key: SPARK-17691
> URL: https://issues.apache.org/jira/browse/SPARK-17691
> Project: Spark
>  Issue Type: New Feature
>Reporter: Assaf Mendelson
>Priority: Minor
>
> One of the aggregate functions we have today is the collect_list function. 
> This is a useful tool to do a "catch all" aggregation which doesn't really 
> fit anywhere else.
> The problem with collect_list is that it is unbounded. I would like to see a 
> means to do a collect_list where we limit the maximum number of elements.
> I would see that the input for this would be the maximum number of elements 
> to use and the method of choosing (pick whatever, pick the top N, pick the 
> bottom B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18391) Openstack deployment scenarios

2016-11-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-18391.
---
Resolution: Not A Problem

> Openstack deployment scenarios
> --
>
> Key: SPARK-18391
> URL: https://issues.apache.org/jira/browse/SPARK-18391
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
> Environment: Openstack
>Reporter: Oleg Borisenko
> Fix For: 2.0.1, 2.0.0, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 
> 1.4.1, 1.4.0, 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0, 1.1.1, 1.1.0, 1.0.1, 1.0.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There is no reliable way to deploy any version of Apache Spark in Openstack 
> cloud except Openstack Sahara at current moment. 
> Nevertheless Openstack Sahara is very slow and it has a small subset of Spark 
> versions to deploy; support for Spark is limited to current Openstack release 
> etc.
> We provide a way to do it on any Openstack cloud since Juno, with any Apache 
> Spark/Hadoop combination since Spark 1.0 and with additional tools (Apache 
> YARN, Apache Ignite, Jupyter, NFS-mount, Ganglia — these optional-tools are 
> on-demand).
> We provide our solution with Apache 2.0
> (look forward to pool-request soon)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18394) Executing the same query twice in a row results in CodeGenerator cache misses

2016-11-09 Thread Jonny Serencsa (JIRA)
Jonny Serencsa created SPARK-18394:
--

 Summary: Executing the same query twice in a row results in 
CodeGenerator cache misses
 Key: SPARK-18394
 URL: https://issues.apache.org/jira/browse/SPARK-18394
 Project: Spark
  Issue Type: Bug
  Components: SQL
 Environment: HiveThriftServer2 running on branch-2.0 on Mac laptop
Reporter: Jonny Serencsa


Executing the query:
{noformat}
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem_1_row
where
l_shipdate <= date_sub('1998-12-01', '90')
group by
l_returnflag,
l_linestatus
;
{noformat}
twice (in succession), will result in CodeGenerator cache misses in BOTH 
executions. Since the query is identical, I would expect the same code to be 
generated. 

Turns out, the generated code is not exactly the same, resulting in cache 
misses when performing the lookup in the CodeGenerator cache. Yet, the code is 
equivalent. 

Below is (some portion of the) generated code for two runs of the query:

run-1
{noformat}
import java.nio.ByteBuffer;
import java.nio.ByteOrder;
import scala.collection.Iterator;
import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder;
import org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter;
import org.apache.spark.sql.execution.columnar.MutableUnsafeRow;

public SpecificColumnarIterator generate(Object[] references) {
return new SpecificColumnarIterator();
}

class SpecificColumnarIterator extends 
org.apache.spark.sql.execution.columnar.ColumnarIterator {

private ByteOrder nativeOrder = null;
private byte[][] buffers = null;
private UnsafeRow unsafeRow = new UnsafeRow(7);
private BufferHolder bufferHolder = new BufferHolder(unsafeRow);
private UnsafeRowWriter rowWriter = new UnsafeRowWriter(bufferHolder, 7);
private MutableUnsafeRow mutableRow = null;

private int currentRow = 0;
private int numRowsInBatch = 0;

private scala.collection.Iterator input = null;
private DataType[] columnTypes = null;
private int[] columnIndexes = null;

private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor;
private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor1;
private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor2;
private org.apache.spark.sql.execution.columnar.StringColumnAccessor accessor3;
private org.apache.spark.sql.execution.columnar.DoubleColumnAccessor accessor4;
private org.apache.spark.sql.execution.columnar.StringColumnAccessor accessor5;
private org.apache.spark.sql.execution.columnar.StringColumnAccessor accessor6;

public SpecificColumnarIterator() {
this.nativeOrder = ByteOrder.nativeOrder();
this.buffers = new byte[7][];
this.mutableRow = new MutableUnsafeRow(rowWriter);
}

public void initialize(Iterator input, DataType[] columnTypes, int[] 
columnIndexes) {
this.input = input;
this.columnTypes = columnTypes;
this.columnIndexes = columnIndexes;
}



public boolean hasNext() {
if (currentRow < numRowsInBatch) {
return true;
}
if (!input.hasNext()) {
return false;
}

org.apache.spark.sql.execution.columnar.CachedBatch batch = 
(org.apache.spark.sql.execution.columnar.CachedBatch) input.next();
currentRow = 0;
numRowsInBatch = batch.numRows();
for (int i = 0; i < columnIndexes.length; i ++) {
buffers[i] = batch.buffers()[columnIndexes[i]];
}
accessor = new 
org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[0]).order(nativeOrder));
accessor1 = new 
org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[1]).order(nativeOrder));
accessor2 = new 
org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[2]).order(nativeOrder));
accessor3 = new 
org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[3]).order(nativeOrder));
accessor4 = new 
org.apache.spark.sql.execution.columnar.DoubleColumnAccessor(ByteBuffer.wrap(buffers[4]).order(nativeOrder));
accessor5 = new 
org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[5]).order(nativeOrder));
accessor6 = new 
org.apache.spark.sql.execution.columnar.StringColumnAccessor(ByteBuffer.wrap(buffers[6]).order(nativeOrder));

return hasNext();
}

public InternalRow next() {
currentRow += 1;
bufferHolder.reset();
rowWriter.zeroOutNullBytes();
accessor.extractTo(mutableRow, 0);
accessor1.extractTo(mutableRow, 1);
accessor2.extractTo(mutableRow, 2);
accessor3.extractTo(mutableRow, 3);
accessor4.extractTo(mutableRow, 4);

[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-09 Thread Jason Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652585#comment-15652585
 ] 

Jason Pan commented on SPARK-18353:
---

No matter what the default value is at last,  I think we need a way to 
configure it.

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-09 Thread Jason Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652576#comment-15652576
 ] 

Jason Pan commented on SPARK-18353:
---

append the summit command:

spark-submit --class org.apache.spark.examples.SparkPi --master 
spark://9.111.159.127:7101 --conf  spark.rpc.askTimeout=150 --deploy-mode 
cluster /opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar 
10
the parameter doesn't work.

Use rest: (6068 port is for rest)
spark-submit --class org.apache.spark.examples.SparkPi --master 
spark://9.111.159.127:6068 --conf  spark.rpc.askTimeout=150 --deploy-mode 
cluster /opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar 
10
the parameter works.
 
We now summit to rest url for a workaround, otherwise,  the 
"spark.rpc.askTimeout" is always 10 due to the hardcode.

Thanks.

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-09 Thread Jason Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652543#comment-15652543
 ] 

Jason Pan commented on SPARK-18353:
---

Hi Sean.

I was using "--conf" to set the parameter when summit. It didn't work.

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-09 Thread Jason Pan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Pan updated SPARK-18353:
--
Comment: was deleted

(was: Hi Sean.

--conf also didn't make it work.

Thanks.)

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-09 Thread Jason Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652537#comment-15652537
 ] 

Jason Pan commented on SPARK-18353:
---

Hi Sean.

--conf also didn't make it work.

Thanks.

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-09 Thread Luke Miner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Miner resolved SPARK-18343.

Resolution: Not A Bug

This was due to some clash in versions between the libraries I was using. 
Bumping all libraries up to their latest versions, as outlined in my comment, 
fixed the issue.

> FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
> --
>
> Key: SPARK-18343
> URL: https://issues.apache.org/jira/browse/SPARK-18343
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1
> Hadoop 2.7.1
> Mesos 1.0.1
> Ubuntu 14.04
>Reporter: Luke Miner
>
> I have a driver program where I write read data in from Cassandra using 
> spark, perform some operations, and then write out to JSON on S3. The program 
> runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.
> However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
> spark-cassandra-connector 2.0.0-M3, the program completes in the sense that 
> all the expected files are written to S3, but the program never terminates.
> I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. 
> In both cases I use the default output committer.
> From the thread dump (included below) it seems like it could be waiting on: 
> `org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`
> Code snippet:
> {code}
> // get MongoDB oplog operations
> val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
>   .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)
> 
> // replay oplog operations into documents
> val documents = operations
>   .spanBy(op => op.id)
>   .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
>   .filter { case (id, result) => result.isInstanceOf[Document] }
>   .map { case (id, document) => MergedDocument(id = id, document = 
> document
> .asInstanceOf[Document])
>   }
> 
> // write documents to json on s3
> documents
>   .map(document => document.toJson)
>   .coalesce(partitions)
>   .saveAsTextFile(path, classOf[GzipCodec])
> sc.stop()
> {code}
> build.sbt
> {code}
> scalaVersion := "2.11.8"
> ivyScala := ivyScala.value map {
>   _.copy(overrideScalaVersion = true)
> }
> resolvers += Resolver.mavenLocal
> // spark
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % 
> "provided"
> // spark cassandra connector
> libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % 
> "2.0.0-M3" excludeAll
>   ExclusionRule(organization = "org.apache.spark")
> // other libraries
> libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"
> libraryDependencies += "org.rogach" %% "scallop" % "2.0.1"
> libraryDependencies += "com.github.nscala-time" %% "nscala-time" % "1.8.0"
> // test
> libraryDependencies += "org.scalatest" % "scalatest_2.11" % "2.2.6" % "test"
> libraryDependencies += "junit" % "junit" % "4.11" % "test"
> libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"
> assemblyOption in assembly := (assemblyOption in 
> assembly).value.copy(includeScala = false)
> net.virtualvoid.sbt.graph.Plugin.graphSettings
> fork := true
> assemblyShadeRules in assembly := Seq(
> ShadeRule.rename("com.google.common.**" -> "shadedguava.@1").inAll
>   )
>   
> assemblyMergeStrategy in assembly := {
>   case PathList(ps@_*) if ps.last endsWith ".properties" => 
> MergeStrategy.first
>   case x =>
> val oldStrategy = (assemblyMergeStrategy in assembly).value
> oldStrategy(x)
> }
> scalacOptions += "-deprecation"
> {code}
> Thread dump on the driver:
> {code}
> 60  context-cleaner-periodic-gc TIMED_WAITING
> 46  dag-scheduler-event-loopWAITING
> 4389DestroyJavaVM   RUNNABLE
> 12  dispatcher-event-loop-0 WAITING
> 13  dispatcher-event-loop-1 WAITING
> 14  dispatcher-event-loop-2 WAITING
> 15  dispatcher-event-loop-3 WAITING
> 47  driver-revive-threadTIMED_WAITING
> 3   Finalizer   WAITING
> 82  ForkJoinPool-1-worker-17WAITING
> 43  heartbeat-receiver-event-loop-threadTIMED_WAITING
> 93  java-sdk-http-connection-reaper TIMED_WAITING
> 4387java-sdk-progress-listener-callback-thread  WAITING
> 25  map-output-dispatcher-0 WAITING
> 26  map-output-dispatcher-1 WAITING
> 27  map-output-dispatcher-2 WAITING
> 28  map-output-dispatcher-3 WAITING
> 29  map-output-dispatcher-4 WAITING
> 30  map-output-dispatcher-5 WAITING
> 31  map-output-dispatcher-6 WAITING
> 32  map-output-dispatcher-7 WAITING
> 48  

[jira] [Commented] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-09 Thread Luke Miner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652487#comment-15652487
 ] 

Luke Miner commented on SPARK-18343:


Updating some of those libraries to their latest versions fixed it. In 
particular, I updated
spark 2.0.0 to 2.0.1
json4s 3.2.11 to 3.5.0
scallop 2.0.1 to 2.0.5
nscala-time  1.8.0 to 2.14.0

> FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
> --
>
> Key: SPARK-18343
> URL: https://issues.apache.org/jira/browse/SPARK-18343
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1
> Hadoop 2.7.1
> Mesos 1.0.1
> Ubuntu 14.04
>Reporter: Luke Miner
>
> I have a driver program where I write read data in from Cassandra using 
> spark, perform some operations, and then write out to JSON on S3. The program 
> runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.
> However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
> spark-cassandra-connector 2.0.0-M3, the program completes in the sense that 
> all the expected files are written to S3, but the program never terminates.
> I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. 
> In both cases I use the default output committer.
> From the thread dump (included below) it seems like it could be waiting on: 
> `org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`
> Code snippet:
> {code}
> // get MongoDB oplog operations
> val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
>   .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)
> 
> // replay oplog operations into documents
> val documents = operations
>   .spanBy(op => op.id)
>   .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
>   .filter { case (id, result) => result.isInstanceOf[Document] }
>   .map { case (id, document) => MergedDocument(id = id, document = 
> document
> .asInstanceOf[Document])
>   }
> 
> // write documents to json on s3
> documents
>   .map(document => document.toJson)
>   .coalesce(partitions)
>   .saveAsTextFile(path, classOf[GzipCodec])
> sc.stop()
> {code}
> build.sbt
> {code}
> scalaVersion := "2.11.8"
> ivyScala := ivyScala.value map {
>   _.copy(overrideScalaVersion = true)
> }
> resolvers += Resolver.mavenLocal
> // spark
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % 
> "provided"
> // spark cassandra connector
> libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % 
> "2.0.0-M3" excludeAll
>   ExclusionRule(organization = "org.apache.spark")
> // other libraries
> libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"
> libraryDependencies += "org.rogach" %% "scallop" % "2.0.1"
> libraryDependencies += "com.github.nscala-time" %% "nscala-time" % "1.8.0"
> // test
> libraryDependencies += "org.scalatest" % "scalatest_2.11" % "2.2.6" % "test"
> libraryDependencies += "junit" % "junit" % "4.11" % "test"
> libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"
> assemblyOption in assembly := (assemblyOption in 
> assembly).value.copy(includeScala = false)
> net.virtualvoid.sbt.graph.Plugin.graphSettings
> fork := true
> assemblyShadeRules in assembly := Seq(
> ShadeRule.rename("com.google.common.**" -> "shadedguava.@1").inAll
>   )
>   
> assemblyMergeStrategy in assembly := {
>   case PathList(ps@_*) if ps.last endsWith ".properties" => 
> MergeStrategy.first
>   case x =>
> val oldStrategy = (assemblyMergeStrategy in assembly).value
> oldStrategy(x)
> }
> scalacOptions += "-deprecation"
> {code}
> Thread dump on the driver:
> {code}
> 60  context-cleaner-periodic-gc TIMED_WAITING
> 46  dag-scheduler-event-loopWAITING
> 4389DestroyJavaVM   RUNNABLE
> 12  dispatcher-event-loop-0 WAITING
> 13  dispatcher-event-loop-1 WAITING
> 14  dispatcher-event-loop-2 WAITING
> 15  dispatcher-event-loop-3 WAITING
> 47  driver-revive-threadTIMED_WAITING
> 3   Finalizer   WAITING
> 82  ForkJoinPool-1-worker-17WAITING
> 43  heartbeat-receiver-event-loop-threadTIMED_WAITING
> 93  java-sdk-http-connection-reaper TIMED_WAITING
> 4387java-sdk-progress-listener-callback-thread  WAITING
> 25  map-output-dispatcher-0 WAITING
> 26  map-output-dispatcher-1 WAITING
> 27  map-output-dispatcher-2 WAITING
> 28  map-output-dispatcher-3 WAITING
> 29  map-output-dispatcher-4 WAITING
> 30  map-output-dispatcher-5 WAITING
> 31  map-output-dispatcher-6 WAITING
> 32  

[jira] [Issue Comment Deleted] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-09 Thread Luke Miner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luke Miner updated SPARK-18343:
---
Comment: was deleted

(was: Any suggestions on how one might hunt down that library? I've included my 
build.sbt.)

> FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
> --
>
> Key: SPARK-18343
> URL: https://issues.apache.org/jira/browse/SPARK-18343
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1
> Hadoop 2.7.1
> Mesos 1.0.1
> Ubuntu 14.04
>Reporter: Luke Miner
>
> I have a driver program where I write read data in from Cassandra using 
> spark, perform some operations, and then write out to JSON on S3. The program 
> runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.
> However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
> spark-cassandra-connector 2.0.0-M3, the program completes in the sense that 
> all the expected files are written to S3, but the program never terminates.
> I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. 
> In both cases I use the default output committer.
> From the thread dump (included below) it seems like it could be waiting on: 
> `org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`
> Code snippet:
> {code}
> // get MongoDB oplog operations
> val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
>   .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)
> 
> // replay oplog operations into documents
> val documents = operations
>   .spanBy(op => op.id)
>   .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
>   .filter { case (id, result) => result.isInstanceOf[Document] }
>   .map { case (id, document) => MergedDocument(id = id, document = 
> document
> .asInstanceOf[Document])
>   }
> 
> // write documents to json on s3
> documents
>   .map(document => document.toJson)
>   .coalesce(partitions)
>   .saveAsTextFile(path, classOf[GzipCodec])
> sc.stop()
> {code}
> build.sbt
> {code}
> scalaVersion := "2.11.8"
> ivyScala := ivyScala.value map {
>   _.copy(overrideScalaVersion = true)
> }
> resolvers += Resolver.mavenLocal
> // spark
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % 
> "provided"
> // spark cassandra connector
> libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % 
> "2.0.0-M3" excludeAll
>   ExclusionRule(organization = "org.apache.spark")
> // other libraries
> libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"
> libraryDependencies += "org.rogach" %% "scallop" % "2.0.1"
> libraryDependencies += "com.github.nscala-time" %% "nscala-time" % "1.8.0"
> // test
> libraryDependencies += "org.scalatest" % "scalatest_2.11" % "2.2.6" % "test"
> libraryDependencies += "junit" % "junit" % "4.11" % "test"
> libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"
> assemblyOption in assembly := (assemblyOption in 
> assembly).value.copy(includeScala = false)
> net.virtualvoid.sbt.graph.Plugin.graphSettings
> fork := true
> assemblyShadeRules in assembly := Seq(
> ShadeRule.rename("com.google.common.**" -> "shadedguava.@1").inAll
>   )
>   
> assemblyMergeStrategy in assembly := {
>   case PathList(ps@_*) if ps.last endsWith ".properties" => 
> MergeStrategy.first
>   case x =>
> val oldStrategy = (assemblyMergeStrategy in assembly).value
> oldStrategy(x)
> }
> scalacOptions += "-deprecation"
> {code}
> Thread dump on the driver:
> {code}
> 60  context-cleaner-periodic-gc TIMED_WAITING
> 46  dag-scheduler-event-loopWAITING
> 4389DestroyJavaVM   RUNNABLE
> 12  dispatcher-event-loop-0 WAITING
> 13  dispatcher-event-loop-1 WAITING
> 14  dispatcher-event-loop-2 WAITING
> 15  dispatcher-event-loop-3 WAITING
> 47  driver-revive-threadTIMED_WAITING
> 3   Finalizer   WAITING
> 82  ForkJoinPool-1-worker-17WAITING
> 43  heartbeat-receiver-event-loop-threadTIMED_WAITING
> 93  java-sdk-http-connection-reaper TIMED_WAITING
> 4387java-sdk-progress-listener-callback-thread  WAITING
> 25  map-output-dispatcher-0 WAITING
> 26  map-output-dispatcher-1 WAITING
> 27  map-output-dispatcher-2 WAITING
> 28  map-output-dispatcher-3 WAITING
> 29  map-output-dispatcher-4 WAITING
> 30  map-output-dispatcher-5 WAITING
> 31  map-output-dispatcher-6 WAITING
> 32  map-output-dispatcher-7 WAITING
> 48  MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE
> 44  

[jira] [Created] (SPARK-18393) DataFrame pivot output column names should respect aliases

2016-11-09 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18393:
--

 Summary: DataFrame pivot output column names should respect aliases
 Key: SPARK-18393
 URL: https://issues.apache.org/jira/browse/SPARK-18393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Eric Liang
Priority: Minor


For example

{code}
val df = spark.range(100).selectExpr("id % 5 as x", "id % 2 as a", "id as b")
df
  .groupBy('x)
  .pivot("a", Seq(0, 1))
  .agg(expr("sum(b)").as("blah"), expr("count(b)").as("foo"))
  .show()
+---++-++-+
|  x|0_sum(`b`) AS `blah`|0_count(`b`) AS `foo`|1_sum(`b`) AS 
`blah`|1_count(`b`) AS `foo`|
+---++-++-+
|  0| 450|   10| 500|   
10|
|  1| 510|   10| 460|   
10|
|  3| 530|   10| 480|   
10|
|  2| 470|   10| 520|   
10|
|  4| 490|   10| 540|   
10|
+---++-++-+
{code}

The column names here are quite hard to read. Ideally we would respect the 
aliases and generate column names like 0_blah, 0_foo, 1_blah, 1_foo instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2016-11-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652453#comment-15652453
 ] 

Joseph K. Bradley commented on SPARK-18392:
---

There are a few items still being discussed:
* terminology (noted above)
* class names
* MinHash hashDistance

What do you think?  [~sethah] [~karlhigley] [~mlnick] [~yunn]

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2016-11-09 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18392:
-

 Summary: LSH API, algorithm, and documentation follow-ups
 Key: SPARK-18392
 URL: https://issues.apache.org/jira/browse/SPARK-18392
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley


This JIRA summarizes discussions from the initial LSH PR 
[https://github.com/apache/spark/pull/15148] as well as the follow-up for hash 
distance [https://github.com/apache/spark/pull/15800].  This will be broken 
into subtasks:
* API changes (targeted for 2.1)
* algorithmic fixes (targeted for 2.1)
* documentation improvements (ideally 2.1, but could slip)

The major issues we have mentioned are as follows:
* OR vs AND amplification
** Need to make API flexible enough to support both types of amplification in 
the future
** Need to clarify which we support, including in each model function 
(transform, similarity join, neighbors)
* Need to clarify which algorithms we have implemented, improve docs and 
references, and fix the algorithms if needed.

These major issues are broken down into detailed issues below.

h3. LSH abstraction

* Rename {{outputDim}} to something indicative of OR-amplification.
** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used in 
the future for AND amplification (Thanks [~mlnick]!)

* transform
** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This is 
the "raw" output of all hash functions, i.e., with no aggregation for 
amplification.
** Clarify meaning of output in terms of multiple hash functions and 
amplification.
** Note: We will _not_ worry about users using this output for dimensionality 
reduction; if anything, that use case can be explained in the User Guide.

* Documentation
** Clarify terminology used everywhere
*** hash function {{h_i}}: basic hash function without amplification
*** hash value {{h_i(key)}}: output of a hash function
*** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
AND-amplification using K base hash functions
*** compound hash function value {{g(key)}}: vector-valued output
*** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
OR-amplification using L compound hash functions
*** hash table value {{H(key)}}: output of array of vectors
*** This terminology is largely pulled from Wang et al.'s survey and the 
multi-probe LSH paper.
** Link clearly to documentation (Wikipedia or papers) which matches our 
terminology and what we implemented

h3. RandomProjection (or P-Stable Distributions)

* Rename {{RandomProjection}}
** Options include: {{ScalarRandomProjectionLSH}}, 
{{BucketedRandomProjectionLSH}}, {{PStableLSH}}

* API privacy
** Make randUnitVectors private

* hashFunction
** Currently, this uses OR-amplification for single probing, as we intended.
** It does *not* do multiple probing, at least not in the sense of the original 
MPLSH paper.  We should fix that or at least document its behavior.

* Documentation
** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
** Also link to the multi-probe LSH paper since that explains this method very 
clearly.
** Clarify hash function and distance metric

h3. MinHash

* Rename {{MinHash}} -> {{MinHashLSH}}

* API privacy
** Make randCoefficients, numEntries private

* hashDistance (used in approxNearestNeighbors)
** Update to use average of indicators of hash collisions [SPARK-18334]
** See [Wikipedia | 
https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
reference

h3. All references

I'm just listing references I looked at.

Papers
* [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
* [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
* [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
* [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe LSH 
paper

Wikipedia
* 
[https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
* [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17916) CSV data source treats empty string as null no matter what nullValue option is

2016-11-09 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652414#comment-15652414
 ] 

Eric Liang commented on SPARK-17916:


In our case, a user wants the empty string (whether actually missing, e.g. ,, 
or quoted ,""), to resolve as the empty string. It should only turn into null 
if nullValue is set to "". There doesn't currently appear to be some option 
combination that allows this.

> CSV data source treats empty string as null no matter what nullValue option is
> --
>
> Key: SPARK-17916
> URL: https://issues.apache.org/jira/browse/SPARK-17916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Hossein Falaki
>
> When user configures {{nullValue}} in CSV data source, in addition to those 
> values, all empty string values are also converted to null.
> {code}
> data:
> col1,col2
> 1,"-"
> 2,""
> {code}
> {code}
> spark.read.format("csv").option("nullValue", "-")
> {code}
> We will find a null in both rows.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18367) limit() makes the lame walk again

2016-11-09 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652406#comment-15652406
 ] 

Nicholas Chammas commented on SPARK-18367:
--

I've spent the day trying to narrow down what is happening, but I haven't made 
much progress. All I've really found is that adding a {{coalesce()}} at the 
"right" place in the query tree (dunno if that's the right terminology) can 
reduce the number of open files enough that things succeed, like how the 
{{limit()}} is helping things. Maybe Spark just needs that many files and there 
is no issue here? I dunno.

Is there a rough rule of thumb I can use to determine how many files Spark 
should be opening? Just a rough way to determine the order of magnitude of open 
files based on something I'm doing. In my case, I have a DataFrame with no more 
than a few partitions that I'm applying a UDF to and then joining to itself 
twice. The resulting DataFrame has no more than a dozen or so partitions.

Is it conceivable that this would somehow make Spark spawn more than 10K files, 
which is the maximum number of files macOS will allow open per process? Even if 
I coalesce the DataFrame to 1 partition I still see Spark spawn up to 7K files. 
Is this normal?

> limit() makes the lame walk again
> -
>
> Key: SPARK-18367
> URL: https://issues.apache.org/jira/browse/SPARK-18367
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
> Environment: Python 3.5, Java 8
>Reporter: Nicholas Chammas
> Attachments: plan-with-limit.txt, plan-without-limit.txt
>
>
> I have a complex DataFrame query that fails to run normally but succeeds if I 
> add a dummy {{limit()}} upstream in the query tree.
> The failure presents itself like this:
> {code}
> ERROR DiskBlockObjectWriter: Uncaught exception while reverting partial 
> writes to file 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
> java.io.FileNotFoundException: 
> /private/var/folders/f5/t48vxz555b51mr3g6jjhxv40gq/T/blockmgr-1e908314-1d49-47ba-8c95-fa43ff43cee4/31/temp_shuffle_6b939273-c6ce-44a0-99b1-db668e6c89dc
>  (Too many open files in system)
> {code}
> My {{ulimit -n}} is already set to 10,000, and I can't set it much higher on 
> macOS. However, I don't think that's the issue, since if I add a dummy 
> {{limit()}} early on the query tree -- dummy as in it does not actually 
> reduce the number of rows queried -- then the same query works.
> I've diffed the physical query plans to see what this {{limit()}} is actually 
> doing, and the diff is as follows:
> {code}
> diff plan-with-limit.txt plan-without-limit.txt
> 24,28c24
> <:  : : +- *GlobalLimit 100
> <:  : :+- Exchange SinglePartition
> <:  : :   +- *LocalLimit 100
> <:  : :  +- *Project [...]
> <:  : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >:  : : +- *Scan orc [...] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> 49,53c45
> <   : : +- *GlobalLimit 100
> <   : :+- Exchange SinglePartition
> <   : :   +- *LocalLimit 100
> <   : :  +- *Project [...]
> <   : : +- *Scan orc [...] 
> Format: ORC, InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct<...
> ---
> >   : : +- *Scan orc [] Format: ORC, 
> > InputPaths: file:/..., PartitionFilters: [], PushedFilters: [], ReadSchema: 
> > struct<...
> {code}
> Does this give any clues as to why this {{limit()}} is helping? Again, the 
> 100 limit you can see in the plan is much higher than the cardinality of 
> the dataset I'm reading, so there is no theoretical impact on the output. You 
> can see the full query plans attached to this ticket.
> Unfortunately, I don't have a minimal reproduction for this issue, but I can 
> work towards one with some clues.
> I'm seeing this behavior on 2.0.1 and on master at commit 
> {{26e1c53aceee37e3687a372ff6c6f05463fd8a94}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Assigned] (SPARK-18391) Openstack deployment scenarios

2016-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18391:


Assignee: Apache Spark

> Openstack deployment scenarios
> --
>
> Key: SPARK-18391
> URL: https://issues.apache.org/jira/browse/SPARK-18391
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
> Environment: Openstack
>Reporter: Oleg Borisenko
>Assignee: Apache Spark
> Fix For: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There is no reliable way to deploy any version of Apache Spark in Openstack 
> cloud except Openstack Sahara at current moment. 
> Nevertheless Openstack Sahara is very slow and it has a small subset of Spark 
> versions to deploy; support for Spark is limited to current Openstack release 
> etc.
> We provide a way to do it on any Openstack cloud since Juno, with any Apache 
> Spark/Hadoop combination since Spark 1.0 and with additional tools (Apache 
> YARN, Apache Ignite, Jupyter, NFS-mount, Ganglia — these optional-tools are 
> on-demand).
> We provide our solution with Apache 2.0
> (look forward to pool-request soon)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18391) Openstack deployment scenarios

2016-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652387#comment-15652387
 ] 

Apache Spark commented on SPARK-18391:
--

User 'al-indigo' has created a pull request for this issue:
https://github.com/apache/spark/pull/15836

> Openstack deployment scenarios
> --
>
> Key: SPARK-18391
> URL: https://issues.apache.org/jira/browse/SPARK-18391
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
> Environment: Openstack
>Reporter: Oleg Borisenko
> Fix For: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There is no reliable way to deploy any version of Apache Spark in Openstack 
> cloud except Openstack Sahara at current moment. 
> Nevertheless Openstack Sahara is very slow and it has a small subset of Spark 
> versions to deploy; support for Spark is limited to current Openstack release 
> etc.
> We provide a way to do it on any Openstack cloud since Juno, with any Apache 
> Spark/Hadoop combination since Spark 1.0 and with additional tools (Apache 
> YARN, Apache Ignite, Jupyter, NFS-mount, Ganglia — these optional-tools are 
> on-demand).
> We provide our solution with Apache 2.0
> (look forward to pool-request soon)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18391) Openstack deployment scenarios

2016-11-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18391:


Assignee: (was: Apache Spark)

> Openstack deployment scenarios
> --
>
> Key: SPARK-18391
> URL: https://issues.apache.org/jira/browse/SPARK-18391
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
> Environment: Openstack
>Reporter: Oleg Borisenko
> Fix For: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There is no reliable way to deploy any version of Apache Spark in Openstack 
> cloud except Openstack Sahara at current moment. 
> Nevertheless Openstack Sahara is very slow and it has a small subset of Spark 
> versions to deploy; support for Spark is limited to current Openstack release 
> etc.
> We provide a way to do it on any Openstack cloud since Juno, with any Apache 
> Spark/Hadoop combination since Spark 1.0 and with additional tools (Apache 
> YARN, Apache Ignite, Jupyter, NFS-mount, Ganglia — these optional-tools are 
> on-demand).
> We provide our solution with Apache 2.0
> (look forward to pool-request soon)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18391) Openstack deployment scenarios

2016-11-09 Thread Oleg Borisenko (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652386#comment-15652386
 ] 

Oleg Borisenko commented on SPARK-18391:


https://github.com/apache/spark/pull/15836

> Openstack deployment scenarios
> --
>
> Key: SPARK-18391
> URL: https://issues.apache.org/jira/browse/SPARK-18391
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
> Environment: Openstack
>Reporter: Oleg Borisenko
> Fix For: 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, 1.3.0, 
> 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 2.0.0, 2.0.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> There is no reliable way to deploy any version of Apache Spark in Openstack 
> cloud except Openstack Sahara at current moment. 
> Nevertheless Openstack Sahara is very slow and it has a small subset of Spark 
> versions to deploy; support for Spark is limited to current Openstack release 
> etc.
> We provide a way to do it on any Openstack cloud since Juno, with any Apache 
> Spark/Hadoop combination since Spark 1.0 and with additional tools (Apache 
> YARN, Apache Ignite, Jupyter, NFS-mount, Ganglia — these optional-tools are 
> on-demand).
> We provide our solution with Apache 2.0
> (look forward to pool-request soon)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652366#comment-15652366
 ] 

Xiangrui Meng commented on SPARK-18390:
---

This is a bug because the user didn't ask a cartesian join. Anyway, this was 
fixed.

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-18390.
---
Resolution: Duplicate

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18391) Openstack deployment scenarios

2016-11-09 Thread Oleg Borisenko (JIRA)
Oleg Borisenko created SPARK-18391:
--

 Summary: Openstack deployment scenarios
 Key: SPARK-18391
 URL: https://issues.apache.org/jira/browse/SPARK-18391
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.0.1, 2.0.0, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 
1.4.1, 1.4.0, 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0, 1.1.1, 1.1.0, 1.0.1, 1.0.0
 Environment: Openstack
Reporter: Oleg Borisenko
 Fix For: 2.0.1, 2.0.0, 1.6.2, 1.6.1, 1.6.0, 1.5.2, 1.5.1, 1.5.0, 
1.4.1, 1.4.0, 1.3.1, 1.3.0, 1.2.2, 1.2.1, 1.2.0, 1.1.1, 1.1.0, 1.0.1, 1.0.0


There is no reliable way to deploy any version of Apache Spark in Openstack 
cloud except Openstack Sahara at current moment. 

Nevertheless Openstack Sahara is very slow and it has a small subset of Spark 
versions to deploy; support for Spark is limited to current Openstack release 
etc.

We provide a way to do it on any Openstack cloud since Juno, with any Apache 
Spark/Hadoop combination since Spark 1.0 and with additional tools (Apache 
YARN, Apache Ignite, Jupyter, NFS-mount, Ganglia — these optional-tools are 
on-demand).

We provide our solution with Apache 2.0

(look forward to pool-request soon)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17829) Stable format for offset log

2016-11-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17829.
--
Resolution: Fixed

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
> Fix For: 2.1.0
>
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17829) Stable format for offset log

2016-11-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17829:
-
Fix Version/s: 2.1.0

> Stable format for offset log
> 
>
> Key: SPARK-17829
> URL: https://issues.apache.org/jira/browse/SPARK-17829
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Michael Armbrust
>Assignee: Tyson Condie
> Fix For: 2.1.0
>
>
> Currently we use java serialization for the WAL that stores the offsets 
> contained in each batch.  This has two main issues:
>  - It can break across spark releases (though this is not the only thing 
> preventing us from upgrading a running query)
>  - It is unnecessarily opaque to the user.
> I'd propose we require offsets to provide a user readable serialization and 
> use that instead.  JSON is probably a good option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12333) Support shuffle spill encryption in Spark

2016-11-09 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652280#comment-15652280
 ] 

Marcelo Vanzin commented on SPARK-12333:


{{blockManager.getDiskWriter}} uses the block manager's serializer manager, 
which does encryption.

> Support shuffle spill encryption in Spark
> -
>
> Key: SPARK-12333
> URL: https://issues.apache.org/jira/browse/SPARK-12333
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Ferdinand Xu
>
> Like shuffle file encryption in SPARK-5682, spills data should also be 
> encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12333) Support shuffle spill encryption in Spark

2016-11-09 Thread Krish Dey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652263#comment-15652263
 ] 

Krish Dey commented on SPARK-12333:
---

The constructor still seems to be the same as it is. Doesn't this to be changed 
to accommodate encryption of spill to disk? Moreover passing the 
DummySerializerInstance it should be allowed to pass any Serializer

public UnsafeSorterSpillWriter(BlockManager blockManager, int fileBufferSize, 
ShuffleWriteMetrics writeMetrics, int numRecordsToWrite) throws IOException{
   final Tuple2 spilledFileInfo =
  blockManager.diskBlockManager().createTempLocalBlock();
this.file = spilledFileInfo._2();
this.blockId = spilledFileInfo._1();
this.numRecordsToWrite = numRecordsToWrite;
// Unfortunately, we need a serializer instance in order to construct a 
DiskBlockObjectWriter.
// Our write path doesn't actually use this serializer (since we end up 
calling the `write()`
// OutputStream methods), but DiskBlockObjectWriter still calls some 
methods on it. To work
// around this, we pass a dummy no-op serializer.
writer = blockManager.getDiskWriter(
  blockId, file, DummySerializerInstance.INSTANCE, fileBufferSize, 
writeMetrics);
// Write the number of records
writeIntToBuffer(numRecordsToWrite, 0);
writer.write(writeBuffer, 0, 4);
  }


> Support shuffle spill encryption in Spark
> -
>
> Key: SPARK-12333
> URL: https://issues.apache.org/jira/browse/SPARK-12333
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: Ferdinand Xu
>
> Like shuffle file encryption in SPARK-5682, spills data should also be 
> encrypted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5682) Add encrypted shuffle in spark

2016-11-09 Thread Krish Dey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652217#comment-15652217
 ] 

Krish Dey edited comment on SPARK-5682 at 11/9/16 10:29 PM:


The constructor still seems to be the same as it is. Doesn't this to be changed 
to accommodate encryption of spill to disk?  Moreover passing the 
DummySerializerInstance it should be allowed to pass any Serializer

public UnsafeSorterSpillWriter(BlockManager blockManager, int fileBufferSize, 
ShuffleWriteMetrics writeMetrics, int numRecordsToWrite) throws IOException {
final Tuple2 spilledFileInfo =  
blockManager.diskBlockManager().createTempLocalBlock();
this.file = spilledFileInfo._2();
this.blockId = spilledFileInfo._1();
this.numRecordsToWrite = numRecordsToWrite;
// Unfortunately, we need a serializer instance in order to construct a 
DiskBlockObjectWriter.
// Our write path doesn't actually use this serializer (since we end up 
calling the `write()`
// OutputStream methods), but DiskBlockObjectWriter still calls some 
methods on it. To work
// around this, we pass a dummy no-op serializer.
writer = blockManager.getDiskWriter(
  blockId, file, DummySerializerInstance.INSTANCE, fileBufferSize, 
writeMetrics);
// Write the number of records
writeIntToBuffer(numRecordsToWrite, 0);
writer.write(writeBuffer, 0, 4);
  }



was (Author: krish.dey):
The method still seems to be the same as it is. Doesn't this to be changed to 
accommodate encryption of spill to disk?

public UnsafeSorterSpillWriter(BlockManager blockManager, int fileBufferSize, 
ShuffleWriteMetrics writeMetrics, int numRecordsToWrite) throws IOException {
final Tuple2 spilledFileInfo =  
blockManager.diskBlockManager().createTempLocalBlock();
this.file = spilledFileInfo._2();
this.blockId = spilledFileInfo._1();
this.numRecordsToWrite = numRecordsToWrite;
// Unfortunately, we need a serializer instance in order to construct a 
DiskBlockObjectWriter.
// Our write path doesn't actually use this serializer (since we end up 
calling the `write()`
// OutputStream methods), but DiskBlockObjectWriter still calls some 
methods on it. To work
// around this, we pass a dummy no-op serializer.
writer = blockManager.getDiskWriter(
  blockId, file, DummySerializerInstance.INSTANCE, fileBufferSize, 
writeMetrics);
// Write the number of records
writeIntToBuffer(numRecordsToWrite, 0);
writer.write(writeBuffer, 0, 4);
  }


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
>Assignee: Ferdinand Xu
> Fix For: 2.1.0
>
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2016-11-09 Thread Krish Dey (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652217#comment-15652217
 ] 

Krish Dey commented on SPARK-5682:
--

The method still seems to be the same as it is. Doesn't this to be changed to 
accommodate encryption of spill to disk?

public UnsafeSorterSpillWriter(BlockManager blockManager, int fileBufferSize, 
ShuffleWriteMetrics writeMetrics, int numRecordsToWrite) throws IOException {
final Tuple2 spilledFileInfo =  
blockManager.diskBlockManager().createTempLocalBlock();
this.file = spilledFileInfo._2();
this.blockId = spilledFileInfo._1();
this.numRecordsToWrite = numRecordsToWrite;
// Unfortunately, we need a serializer instance in order to construct a 
DiskBlockObjectWriter.
// Our write path doesn't actually use this serializer (since we end up 
calling the `write()`
// OutputStream methods), but DiskBlockObjectWriter still calls some 
methods on it. To work
// around this, we pass a dummy no-op serializer.
writer = blockManager.getDiskWriter(
  blockId, file, DummySerializerInstance.INSTANCE, fileBufferSize, 
writeMetrics);
// Write the number of records
writeIntToBuffer(numRecordsToWrite, 0);
writer.write(writeBuffer, 0, 4);
  }


> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
>Assignee: Ferdinand Xu
> Fix For: 2.1.0
>
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652202#comment-15652202
 ] 

Cheng Lian commented on SPARK-18390:


I think this issue has already been fixed by SPARK-17298 and 
https://github.com/apache/spark/pull/14866

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18388) Running aggregation on many columns throws SOE

2016-11-09 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652104#comment-15652104
 ] 

Herman van Hovell commented on SPARK-18388:
---

Could you try this on master? We added an optimizer rule that collapses similar 
windows in (SPARK-17739): 
https://github.com/apache/spark/commit/aef506e39a41cfe7198162c324a11ef2f01136c3

> Running aggregation on many columns throws SOE
> --
>
> Key: SPARK-18388
> URL: https://issues.apache.org/jira/browse/SPARK-18388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
> Environment: PySpark 2.0.1, Jupyter
>Reporter: Raviteja Lokineni
>Priority: Critical
> Attachments: spark-bug-jupyter.py, spark-bug-stacktrace.txt, 
> spark-bug.csv
>
>
> Usecase: I am generating weekly aggregates of every column of data
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
> timeSeries = sqlContext.read.option("header", 
> "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
> w = (Window()
>  .partitionBy("id")
>  .orderBy(col("dt").cast("timestamp").cast("long"))
>  .rangeBetween(-days(6), 0))
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
> for col in timeSeries.columns:
> if col in skipCols:
> continue
> cols.append(mean(col).over(w).alias("mean_7_"+col))
> cols.append(count(col).over(w).alias("count_7_"+col))
> cols.append(sum(col).over(w).alias("sum_7_"+col))
> cols.append(min(col).over(w).alias("min_7_"+col))
> cols.append(max(col).over(w).alias("max_7_"+col))
> df = timeSeries.select(cols)
> df.orderBy('id', 
> 'dt').write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("file:///tmp/spark-bug-out.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18388) Running aggregation on many columns throws SOE

2016-11-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-18388:
-
Component/s: (was: Spark Core)
 SQL

> Running aggregation on many columns throws SOE
> --
>
> Key: SPARK-18388
> URL: https://issues.apache.org/jira/browse/SPARK-18388
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
> Environment: PySpark 2.0.1, Jupyter
>Reporter: Raviteja Lokineni
>Priority: Critical
> Attachments: spark-bug-jupyter.py, spark-bug-stacktrace.txt, 
> spark-bug.csv
>
>
> Usecase: I am generating weekly aggregates of every column of data
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
> timeSeries = sqlContext.read.option("header", 
> "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
> w = (Window()
>  .partitionBy("id")
>  .orderBy(col("dt").cast("timestamp").cast("long"))
>  .rangeBetween(-days(6), 0))
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
> for col in timeSeries.columns:
> if col in skipCols:
> continue
> cols.append(mean(col).over(w).alias("mean_7_"+col))
> cols.append(count(col).over(w).alias("count_7_"+col))
> cols.append(sum(col).over(w).alias("sum_7_"+col))
> cols.append(min(col).over(w).alias("min_7_"+col))
> cols.append(max(col).over(w).alias("max_7_"+col))
> df = timeSeries.select(cols)
> df.orderBy('id', 
> 'dt').write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("file:///tmp/spark-bug-out.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18374) Incorrect words in StopWords/english.txt

2016-11-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652076#comment-15652076
 ] 

Sean Owen commented on SPARK-18374:
---

It's a fair point indeed, because it would be much better to omit "won't" than 
incorrectly omit "won", rare as that might be. Right now, simply removing those 
words causes a different problem because, I presume, you'd find "won" and "t" 
in your tokens. (Worth testing?) If you can see an easy way to improve the 
tokenization, that could become the topic of this issue.

> Incorrect words in StopWords/english.txt
> 
>
> Key: SPARK-18374
> URL: https://issues.apache.org/jira/browse/SPARK-18374
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.1
>Reporter: nirav patel
>
> I was just double checking english.txt for list of stopwords as I felt it was 
> taking out valid tokens like 'won'. I think issue is english.txt list is 
> missing apostrophe character and all character after apostrophe. So "won't" 
> becam "won" in that list; "wouldn't" is "wouldn" .
> Here are some incorrect tokens in this list:
> won
> wouldn
> ma
> mightn
> mustn
> needn
> shan
> shouldn
> wasn
> weren
> I think ideal list should have both style. i.e. won't and wont both should be 
> part of english.txt as some tokenizer might remove special characters. But 
> 'won' is obviously shouldn't be in this list.
> Here's list of snowball english stop words:
> http://snowball.tartarus.org/algorithms/english/stop.txt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17059) Allow FileFormat to specify partition pruning strategy

2016-11-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652068#comment-15652068
 ] 

Apache Spark commented on SPARK-17059:
--

User 'pwoody' has created a pull request for this issue:
https://github.com/apache/spark/pull/15835

> Allow FileFormat to specify partition pruning strategy
> --
>
> Key: SPARK-17059
> URL: https://issues.apache.org/jira/browse/SPARK-17059
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Allow Spark to have pluggable pruning of input files for FileSourceScanExec 
> by allowing FileFormat's to specify format-specific filterPartitions method.
> This is especially useful for Parquet as Spark does not currently make use of 
> the summary metadata, instead reading the footer of all part files for a 
> Parquet data source. This can lead to massive speedups when reading a 
> filtered chunk of a dataset, especially when using remote storage (S3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652039#comment-15652039
 ] 

Nattavut Sutyanyong edited comment on SPARK-18389 at 11/9/16 9:10 PM:
--

In CREATE VIEW, if we will check that the objects in the view definition need 
to exist in the Catalog, then CREATE VIEW cannot create a cyclic reference. It 
is only the ALTER VIEW that can create this problem. We will need to expand the 
objects referenced and any descendant views recursively during the processing 
of ALTER VIEW.

I have a side question regarding how Spark deals with the concurrency and 
locking of catalog changes.

{code}
time|   session 1  |  session 2 
T1  | CREATE TAB T1|
T2  | CREATE VIEW V1 -> T1 |
T3  | CREATE VIEW V2 -> V1 |
T4  | CREATE VIEW V3 -> V2 | ALTER VIEW V2-> V3 
{code}

Does the current code have a mechanism to lock the entry of V2 during the 
compilation of VIEW V3 in time T4 to prevent any change made to the definition 
of V2 in session 2?


was (Author: nsyca):
In CREATE VIEW, if we will check that the objects in the view definition need 
to exist in the Catalog, then CREATE VIEW cannot create a cyclic reference. It 
is only the ALTER VIEW that can create this problem. We will need to expand the 
objects referenced and any descendant views recursively during the processing 
of ALTER VIEW.

I have a side question regarding how Spark deals with the concurrency and 
locking of catalog changes.


time|   session 1  |  session 2 

T1  | CREATE TAB T1|
T2  | CREATE VIEW V1 -> T1 |
T3  | CREATE VIEW V2 -> V1 |
T4  | CREATE VIEW V3 -> V2 | ALTER VIEW V2-> V3 


Does the current code have a mechanism to lock the entry of V2 during the 
compilation of VIEW V3 in time T4 to prevent any change made to the definition 
of V2 in session 2?

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652039#comment-15652039
 ] 

Nattavut Sutyanyong commented on SPARK-18389:
-

In CREATE VIEW, if we will check that the objects in the view definition need 
to exist in the Catalog, then CREATE VIEW cannot create a cyclic reference. It 
is only the ALTER VIEW that can create this problem. We will need to expand the 
objects referenced and any descendant views recursively during the processing 
of ALTER VIEW.

I have a side question regarding how Spark deals with the concurrency and 
locking of catalog changes.


time|   session 1  |  session 2 

T1  | CREATE TAB T1|
T2  | CREATE VIEW V1 -> T1 |
T3  | CREATE VIEW V2 -> V1 |
T4  | CREATE VIEW V3 -> V2 | ALTER VIEW V2-> V3 


Does the current code have a mechanism to lock the entry of V2 during the 
compilation of VIEW V3 in time T4 to prevent any change made to the definition 
of V2 in session 2?

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652022#comment-15652022
 ] 

Herman van Hovell commented on SPARK-18390:
---

I don't think this is a bug. It is doing exactly what you expect it to do: 
preventing you from accidentally using a cartesian join.

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18363) Connected component for large graph result is wrong

2016-11-09 Thread Philip Adetiloye (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652009#comment-15652009
 ] 

Philip Adetiloye commented on SPARK-18363:
--

I logged a similar Issue with the graphframe but the problem exist also in 
Graphx

Basically I'm trying to cluster a hierarchical dataset. This works fine for 
small dataset, I could cluster the data into separate clusters. 

However, for large hierarchical dataset (about `1.60million vertices`) the 
result seems wrong. 
The resulting clusters from connected component have many intersections. This 
should not be the case. I expect the hierarchical dataset to be clustered into 
separate smaller clusters.

...
val vertices = universe.map(u => (u.id, u.username, u.age, u.gamescore))
  .toDF("id", "username", "age","gamescore")
  .alias("v")


val lookup = 
sparkSession.sparkContext.broadcast(universeMap.rdd.collectAsMap())


def buildEdges(src: String, dest: String) = {
Edge(lookup.value.get(src).get, lookup.value.get(dest).get, 0)
}


val edges  =  similarityDatasetNoJboss.mapPartitions(_.map(s => 
buildEdges(s.username1, s.username2)))
  .toDF("src", "dst", "default")

val graph = GraphFrame(vertices, edges)

val cc = graph.connectedComponents.run().select("id", "component")


Do some validation test

Select id, count(component)
group by id

I expect each`id` to belong to one cluster/component and count = 1 instead `id` 
belong to multiple clusters/component.

> Connected component for large graph result is wrong
> ---
>
> Key: SPARK-18363
> URL: https://issues.apache.org/jira/browse/SPARK-18363
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.0.1
>Reporter: Philip Adetiloye
>
> The clustering done by Graphx connected component doesn't seems to work 
> correctly with large nodes.
> It only works correctly on a small graph



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Srinath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651980#comment-15651980
 ] 

Srinath commented on SPARK-18390:
-

FYI, these are in branch 2.1
{noformat}
commit e6132a6cf10df8b12af8dd8d1a2c563792b5cc5a
Author: Srinath Shankar 
Date:   Sat Sep 3 00:20:43 2016 +0200

[SPARK-17298][SQL] Require explicit CROSS join for cartesian products

{noformat}
and
{noformat}
commit 2d96d35dc0fed6df249606d9ce9272c0f0109fa2
Author: Srinath Shankar 
Date:   Fri Oct 14 18:24:47 2016 -0700

[SPARK-17946][PYSPARK] Python crossJoin API similar to Scala
{noformat}

With the above 2 changes, if a user requests a cross join (with the crossJoin 
API), the join will always be performed regardless of the physical plan chosen

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18370) InsertIntoHadoopFsRelationCommand should keep track of its table

2016-11-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-18370.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> InsertIntoHadoopFsRelationCommand should keep track of its table
> 
>
> Key: SPARK-18370
> URL: https://issues.apache.org/jira/browse/SPARK-18370
> Project: Spark
>  Issue Type: Bug
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.1.0
>
>
> When we plan a {{InsertIntoHadoopFsRelationCommand}} we drop the {{Table}} 
> name. This is quite annoying when debugging plans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651947#comment-15651947
 ] 

Herman van Hovell commented on SPARK-18390:
---

[~vssrinath] Can you take a look?

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell updated SPARK-18390:
--
Assignee: Srinath

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>Assignee: Srinath
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18343) FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write

2016-11-09 Thread Luke Miner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651942#comment-15651942
 ] 

Luke Miner commented on SPARK-18343:


Any suggestions on how one might hunt down that library? I've included my 
build.sbt.

> FileSystem$Statistics$StatisticsDataReferenceCleaner hangs on s3 write
> --
>
> Key: SPARK-18343
> URL: https://issues.apache.org/jira/browse/SPARK-18343
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.1
> Environment: Spark 2.0.1
> Hadoop 2.7.1
> Mesos 1.0.1
> Ubuntu 14.04
>Reporter: Luke Miner
>
> I have a driver program where I write read data in from Cassandra using 
> spark, perform some operations, and then write out to JSON on S3. The program 
> runs fine when I use Spark 1.6.1 and the spark-cassandra-connector 1.6.0-M1.
> However, if I try to upgrade to Spark 2.0.1 (hadoop 2.7.1) and 
> spark-cassandra-connector 2.0.0-M3, the program completes in the sense that 
> all the expected files are written to S3, but the program never terminates.
> I do run `sc.stop()` at the end of the program. I am also using Mesos 1.0.1. 
> In both cases I use the default output committer.
> From the thread dump (included below) it seems like it could be waiting on: 
> `org.apache.hadoop.fs.FileSystem$Statistics$StatisticsDataReferenceCleaner`
> Code snippet:
> {code}
> // get MongoDB oplog operations
> val operations = sc.cassandraTable[JsonOperation](keyspace, namespace)
>   .where("ts >= ? AND ts < ?", minTimestamp, maxTimestamp)
> 
> // replay oplog operations into documents
> val documents = operations
>   .spanBy(op => op.id)
>   .map { case (id: String, ops: Iterable[T]) => (id, apply(ops)) }
>   .filter { case (id, result) => result.isInstanceOf[Document] }
>   .map { case (id, document) => MergedDocument(id = id, document = 
> document
> .asInstanceOf[Document])
>   }
> 
> // write documents to json on s3
> documents
>   .map(document => document.toJson)
>   .coalesce(partitions)
>   .saveAsTextFile(path, classOf[GzipCodec])
> sc.stop()
> {code}
> build.sbt
> {code}
> scalaVersion := "2.11.8"
> ivyScala := ivyScala.value map {
>   _.copy(overrideScalaVersion = true)
> }
> resolvers += Resolver.mavenLocal
> // spark
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % 
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % 
> "provided"
> // spark cassandra connector
> libraryDependencies += "com.datastax.spark" %% "spark-cassandra-connector" % 
> "2.0.0-M3" excludeAll
>   ExclusionRule(organization = "org.apache.spark")
> // other libraries
> libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.11"
> libraryDependencies += "org.rogach" %% "scallop" % "2.0.1"
> libraryDependencies += "com.github.nscala-time" %% "nscala-time" % "1.8.0"
> // test
> libraryDependencies += "org.scalatest" % "scalatest_2.11" % "2.2.6" % "test"
> libraryDependencies += "junit" % "junit" % "4.11" % "test"
> libraryDependencies += "com.novocode" % "junit-interface" % "0.11" % "test"
> assemblyOption in assembly := (assemblyOption in 
> assembly).value.copy(includeScala = false)
> net.virtualvoid.sbt.graph.Plugin.graphSettings
> fork := true
> assemblyShadeRules in assembly := Seq(
> ShadeRule.rename("com.google.common.**" -> "shadedguava.@1").inAll
>   )
>   
> assemblyMergeStrategy in assembly := {
>   case PathList(ps@_*) if ps.last endsWith ".properties" => 
> MergeStrategy.first
>   case x =>
> val oldStrategy = (assemblyMergeStrategy in assembly).value
> oldStrategy(x)
> }
> scalacOptions += "-deprecation"
> {code}
> Thread dump on the driver:
> {code}
> 60  context-cleaner-periodic-gc TIMED_WAITING
> 46  dag-scheduler-event-loopWAITING
> 4389DestroyJavaVM   RUNNABLE
> 12  dispatcher-event-loop-0 WAITING
> 13  dispatcher-event-loop-1 WAITING
> 14  dispatcher-event-loop-2 WAITING
> 15  dispatcher-event-loop-3 WAITING
> 47  driver-revive-threadTIMED_WAITING
> 3   Finalizer   WAITING
> 82  ForkJoinPool-1-worker-17WAITING
> 43  heartbeat-receiver-event-loop-threadTIMED_WAITING
> 93  java-sdk-http-connection-reaper TIMED_WAITING
> 4387java-sdk-progress-listener-callback-thread  WAITING
> 25  map-output-dispatcher-0 WAITING
> 26  map-output-dispatcher-1 WAITING
> 27  map-output-dispatcher-2 WAITING
> 28  map-output-dispatcher-3 WAITING
> 29  map-output-dispatcher-4 WAITING
> 30  map-output-dispatcher-5 WAITING
> 31  map-output-dispatcher-6 WAITING
> 32  map-output-dispatcher-7 WAITING
> 48  MesosCoarseGrainedSchedulerBackend-mesos-driver RUNNABLE
> 44  

[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651927#comment-15651927
 ] 

Yin Huai commented on SPARK-18390:
--

Can you do a explain?

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18387) Test that expressions can be serialized

2016-11-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18387:

Target Version/s: 2.0.3, 2.1.0  (was: 2.1.0)

> Test that expressions can be serialized
> ---
>
> Key: SPARK-18387
> URL: https://issues.apache.org/jira/browse/SPARK-18387
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> SPARK-18368 fixes regexp_replace when it is serialized. One of the reviews 
> requested updating the tests so that all expressions that are tested using 
> checkEvaluation are first serialized. That caused several new [test 
> failures], so this issue is to add serialization to the tests ([pick commit 
> 10805059|https://github.com/apache/spark/commit/10805059]) and fix the bugs 
> serialization exposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18387) Test that expressions can be serialized

2016-11-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-18387:

Target Version/s: 2.1.0
Priority: Blocker  (was: Major)

> Test that expressions can be serialized
> ---
>
> Key: SPARK-18387
> URL: https://issues.apache.org/jira/browse/SPARK-18387
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Ryan Blue
>Priority: Blocker
>
> SPARK-18368 fixes regexp_replace when it is serialized. One of the reviews 
> requested updating the tests so that all expressions that are tested using 
> checkEvaluation are first serialized. That caused several new [test 
> failures], so this issue is to add serialization to the tests ([pick commit 
> 10805059|https://github.com/apache/spark/commit/10805059]) and fix the bugs 
> serialization exposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651865#comment-15651865
 ] 

Nattavut Sutyanyong commented on SPARK-18389:
-

Yes. CREATE/ALTER is not the place where we should do extensive checking, maybe 
syntax checking but leaving the relation name as unresolved relation. 
Otherwise, you would need to do nested expansion of the objects in CREATE/ALTER.

Would my proposal fit well with the way Spark is implemented today?

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651873#comment-15651873
 ] 

Reynold Xin edited comment on SPARK-18389 at 11/9/16 7:51 PM:
--

It'd make more sense to do this check during the command execution so we fail 
early.

We can do the full expansion in the analyzer. Doing full expansion in analyzer 
doesn't mean we need to use the fully expanded plan to generate the SQL for the 
view.


was (Author: rxin):
It'd make more sense to do this check during the command execution so we fail 
early.

We can do the full expansion in the analyzer.


> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651873#comment-15651873
 ] 

Reynold Xin commented on SPARK-18389:
-

It'd make more sense to do this check during the command execution so we fail 
early.

We can do the full expansion in the analyzer.


> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-18390:
---
Description: 
{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
prohibitively expensive and are disabled by default. To explicitly enable them, 
please set spark.sql.crossJoin.enabled = true;

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.

  was:
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{code}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{code}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.


> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> bq. org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18390:
--
Description: 
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{code}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{code}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.

  was:
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.


> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> I hit this error when I tried to test skewed joins.
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> {code}
> org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> {code}
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-18390:
--
Description: 
I hit this error when I tried to test skewed joins.

{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.

  was:
{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.


> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> I hit this error when I tried to test skewed joins.
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> {noformat}
> org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> {noformat}
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651859#comment-15651859
 ] 

Xiangrui Meng commented on SPARK-18390:
---

cc: [~yhuai] [~lian cheng]

> Optimized plan tried to use Cartesian join when it is not enabled
> -
>
> Key: SPARK-18390
> URL: https://issues.apache.org/jira/browse/SPARK-18390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Xiangrui Meng
>
> {code}
> val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
> val df3 = spark.range(1e9.toInt)
> df3.join(df2, df3("id") === df2("one")).count()
> {code}
> throws
> {noformat}
> org.apache.spark.sql.AnalysisException: Cartesian joins could be 
> prohibitively expensive and are disabled by default. To explicitly enable 
> them, please set spark.sql.crossJoin.enabled = true;
> {noformat}
> This is probably not the right behavior because it was not the user who 
> suggested using cartesian product. SQL picked it while knowing it is not 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18390) Optimized plan tried to use Cartesian join when it is not enabled

2016-11-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-18390:
-

 Summary: Optimized plan tried to use Cartesian join when it is not 
enabled
 Key: SPARK-18390
 URL: https://issues.apache.org/jira/browse/SPARK-18390
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1
Reporter: Xiangrui Meng


{code}
val df2 = spark.range(1e9.toInt).withColumn("one", lit(1))
val df3 = spark.range(1e9.toInt)
df3.join(df2, df3("id") === df2("one")).count()
{code}

throws

{noformat}
org.apache.spark.sql.AnalysisException: Cartesian joins could be prohibitively 
expensive and are disabled by default. To explicitly enable them, please set 
spark.sql.crossJoin.enabled = true;
{noformat}

This is probably not the right behavior because it was not the user who 
suggested using cartesian product. SQL picked it while knowing it is not 
enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651832#comment-15651832
 ] 

Xiao Li commented on SPARK-18389:
-

Are you saying we do not need to detect it at `CREATE VIEW` and `ALTER VIEW`?

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18336) SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2

2016-11-09 Thread Egor Pahomov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pahomov closed SPARK-18336.

Resolution: Invalid

Moving from 1.6 to 2.0 forced me use spark-submit instead of using spark 
assembly jar. I though, that if I define driver memory in spark-defaults.conf 
spark-submit would read it before creating java machine and would create driver 
with configured memory. I was wrong - I needed to specify it though parameter. 
It's counterintuitive, but I was wrong to make such assumption. 

> SQL started to fail with OOM and etc. after move from 1.6.2 to 2.0.2
> 
>
> Key: SPARK-18336
> URL: https://issues.apache.org/jira/browse/SPARK-18336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>
> I had several(~100) quires, which were run one after another in single spark 
> context. I can provide code of runner - it's very simple. It worked fine on 
> 1.6.2, than I moved to 2551d959a6c9fb27a54d38599a2301d735532c24 (branch-2.0 
> on 31.10.2016 17:04:12). It started to fail with OOM and other errors. When I 
> separate my 100 quires to 2 sets and run set after set it works fine. I would 
> suspect problems with memory on driver, but nothing points to that. 
> My conf: 
> {code}
> lazy val sparkConfTemplate = new SparkConf()
> .setMaster("yarn-client")
> .setAppName(appName)
> .set("spark.executor.memory", "25g")
> .set("spark.executor.instances", "40")
> .set("spark.dynamicAllocation.enabled", "false")
> .set("spark.yarn.executor.memoryOverhead", "3000")
> .set("spark.executor.cores", "6")
> .set("spark.driver.memory", "25g")
> .set("spark.driver.cores", "5")
> .set("spark.yarn.am.memory", "20g")
> .set("spark.shuffle.io.numConnectionsPerPeer", "5")
> .set("spark.sql.autoBroadcastJoinThreshold", "10")
> .set("spark.network.timeout", "4000s")
> .set("spark.driver.maxResultSize", "5g")
> .set("spark.sql.parquet.compression.codec", "gzip")
> .set("spark.kryoserializer.buffer.max", "1200m")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .set("spark.yarn.driver.memoryOverhead", "1000")
> .set("spark.scheduler.mode", "FIFO")
> .set("spark.sql.broadcastTimeout", "2")
> .set("spark.akka.frameSize", "200")
> .set("spark.sql.shuffle.partitions", partitions)
> .set("spark.network.timeout", "1000s")
> 
> .setJars(List(this.getClass.getProtectionDomain().getCodeSource().getLocation().toURI().getPath()))
> {code}
> Errors, which started to happen:
> {code}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f04c6cf3ea8, pid=17479, tid=139658116687616
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build 
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x64bea8]  
> InstanceKlass::oop_follow_contents(ParCompactionManager*, oopDesc*)+0x88
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/egor/hs_err_pid17479.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> {code}
> {code}
> Exception in thread "refresh progress" java.lang.OutOfMemoryError: Java heap 
> space
>   at scala.collection.immutable.Iterable$.newBuilder(Iterable.scala:44)
>   at scala.collection.Iterable$.newBuilder(Iterable.scala:50)
>   at 
> scala.collection.generic.GenericTraversableTemplate$class.genericBuilder(GenericTraversableTemplate.scala:70)
>   at 
> scala.collection.AbstractTraversable.genericBuilder(Traversable.scala:104)
>   at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:57)
>   at 
> scala.collection.generic.GenTraversableFactory$GenericCanBuildFrom.apply(GenTraversableFactory.scala:52)
>   at 
> scala.collection.TraversableLike$class.builder$1(TraversableLike.scala:229)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.SparkStatusTracker.getActiveStageIds(SparkStatusTracker.scala:61)
>   at 
> org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:66)
>   at 
> org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:54)
>   at java.util.TimerThread.mainLoop(Unknown Source)
>   at 

[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651834#comment-15651834
 ] 

Xiao Li commented on SPARK-18389:
-

Are you saying we do not need to detect it at `CREATE VIEW` and `ALTER VIEW`?

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-18389:

Comment: was deleted

(was: Are you saying we do not need to detect it at `CREATE VIEW` and `ALTER 
VIEW`?)

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18211) Spark SQL ignores split.size

2016-11-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-18211:


Assignee: Michael Armbrust

> Spark SQL ignores split.size
> 
>
> Key: SPARK-18211
> URL: https://issues.apache.org/jira/browse/SPARK-18211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: lostinoverflow
>Assignee: Michael Armbrust
>
> I expect that RDD and DataFrame will have the same number of partitions 
> (worked in 1.6) but it looks like Spark SQL ignores Hadoop configuration.
> {code}
> import org.apache.spark.sql.SparkSession
> object App {
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("split size")
>   .getOrCreate()
> spark.sparkContext.hadoopConfiguration.setInt("mapred.min.split.size", 
> args(0).toInt)
> spark.sparkContext.hadoopConfiguration.setInt("mapred.max.split.size", 
> args(0).toInt)
> println(spark.sparkContext.textFile(args(1)).partitions.size)
> println(spark.read.textFile(args(1)).rdd.partitions.size)
> spark.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18211) Spark SQL ignores split.size

2016-11-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-18211.
--
Resolution: Not A Problem

As of Spark 2.0 we do our own splitting/bin-packing of files into tasks, rather 
than relying on Hadoop InputFormats.  The configuration you are probably 
looking for is: {{spark.sql.files.maxPartitionBytes}} which you can set on the 
SQLContext.

> Spark SQL ignores split.size
> 
>
> Key: SPARK-18211
> URL: https://issues.apache.org/jira/browse/SPARK-18211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: lostinoverflow
>
> I expect that RDD and DataFrame will have the same number of partitions 
> (worked in 1.6) but it looks like Spark SQL ignores Hadoop configuration.
> {code}
> import org.apache.spark.sql.SparkSession
> object App {
>   def main(args: Array[String]) {
> val spark = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("split size")
>   .getOrCreate()
> spark.sparkContext.hadoopConfiguration.setInt("mapred.min.split.size", 
> args(0).toInt)
> spark.sparkContext.hadoopConfiguration.setInt("mapred.max.split.size", 
> args(0).toInt)
> println(spark.sparkContext.textFile(args(1)).partitions.size)
> println(spark.read.textFile(args(1)).rdd.partitions.size)
> spark.stop()
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18389) Disallow cyclic view reference

2016-11-09 Thread Nattavut Sutyanyong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651816#comment-15651816
 ] 

Nattavut Sutyanyong commented on SPARK-18389:
-

The ALTER VIEW should run successfully keeping the new definition of the 
testView in a textual format. It is when we expand the definition of testView 
in the Analyzer that we need to keep the DAG of all the derived definition of 
the testView and flag an error when the DAG becomes a cyclic graph.

> Disallow cyclic view reference
> --
>
> Key: SPARK-18389
> URL: https://issues.apache.org/jira/browse/SPARK-18389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> The following should not be allowed:
> {code}
> CREATE VIEW testView AS SELECT id FROM jt
> CREATE VIEW testView2 AS SELECT id FROM testView
> ALTER VIEW testView AS SELECT * FROM testView2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18131) Support returning Vector/Dense Vector from backend

2016-11-09 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651811#comment-15651811
 ] 

Felix Cheung commented on SPARK-18131:
--

I think it's good to have a wrapper, but as you say we should make sure with 
"head" or "collect" (or collect(col) or head(col) - on Column, which is pending 
PR) it can convert to native R Matrix when available.


> Support returning Vector/Dense Vector from backend
> --
>
> Key: SPARK-18131
> URL: https://issues.apache.org/jira/browse/SPARK-18131
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> For `spark.logit`, there is a `probabilityCol`, which is a vector in the 
> backend (scala side). When we do collect(select(df, "probabilityCol")), 
> backend returns the java object handle (memory address). We need to implement 
> a method to convert a Vector/Dense Vector column as R vector, which can be 
> read in SparkR. It is a followup JIRA of adding `spark.logit`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651806#comment-15651806
 ] 

Saikat Kanjilal edited comment on SPARK-9487 at 11/9/16 7:22 PM:
-

Ok , for some odd reason my local branch had the changes but weren't committed, 
PR is here:  
https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244

I added the change to local[4] to both streaming as well as repl, based on what 
I'm seeing locally all Java/Scala changes should be accounted for and unit 
tests pass, the only exception is the code inside spark examples in 
PageViewStream.scala, should I change this, seems like it doesn't belong as 
part of the unit tests.


My next TODOs:
1) Change the example code if it makes sense in PageViewStream
2) Start the code changes to fix the python unit tests

Let me know thoughts or concerns.


was (Author: kanjilal):
Ok , for some odd reason my local branch had the changes but weren't committed, 
PR is here:  
https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244

I added the change to local[4] to both streaming as well as repl, based on what 
I'm seeing locally all Java/Scala changes should be accounted for except for 
spark examples with the code inside PageViewStream.scala, should I change this, 
seems like it doesn't belong as part of the unit tests.


My next TODOs:
1) Change the example code if it makes sense in PageViewStream
2) Start the code changes to fix the python unit tests

Let me know thoughts or concerns.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10816) EventTime based sessionization

2016-11-09 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10816:
-
Target Version/s:   (was: 2.2.0)

> EventTime based sessionization
> --
>
> Key: SPARK-10816
> URL: https://issues.apache.org/jira/browse/SPARK-10816
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9487) Use the same num. worker threads in Scala/Python unit tests

2016-11-09 Thread Saikat Kanjilal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651806#comment-15651806
 ] 

Saikat Kanjilal commented on SPARK-9487:


Ok , for some odd reason my local branch had the changes but weren't committed, 
PR is here:  
https://github.com/skanjila/spark/commit/ec0b2a81dc8362e84e70457873560d997a7cb244

I added the change to local[4] to both streaming as well as repl, based on what 
I'm seeing locally all Java/Scala changes should be accounted for except for 
spark examples with the code inside PageViewStream.scala, should I change this, 
seems like it doesn't belong as part of the unit tests.


My next TODOs:
1) Change the example code if it makes sense in PageViewStream
2) Start the code changes to fix the python unit tests

Let me know thoughts or concerns.

> Use the same num. worker threads in Scala/Python unit tests
> ---
>
> Key: SPARK-9487
> URL: https://issues.apache.org/jira/browse/SPARK-9487
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core, SQL, Tests
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>  Labels: starter
> Attachments: ContextCleanerSuiteResults, HeartbeatReceiverSuiteResults
>
>
> In Python we use `local[4]` for unit tests, while in Scala/Java we use 
> `local[2]` and `local` for some unit tests in SQL, MLLib, and other 
> components. If the operation depends on partition IDs, e.g., random number 
> generator, this will lead to different result in Python and Scala/Java. It 
> would be nice to use the same number in all unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18353) spark.rpc.askTimeout defalut value is not 120s

2016-11-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651807#comment-15651807
 ] 

Sean Owen commented on SPARK-18353:
---

BTW [~JasonPan] it looks like you're setting JVM props, and not using --conf ? 
does --conf make it work?

> spark.rpc.askTimeout defalut value is not 120s
> --
>
> Key: SPARK-18353
> URL: https://issues.apache.org/jira/browse/SPARK-18353
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.1
> Environment: Linux zzz 3.10.0-327.el7.x86_64 #1 SMP Thu Oct 29 
> 17:29:29 EDT 2015 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Jason Pan
>Priority: Critical
>
> in http://spark.apache.org/docs/latest/configuration.html 
> spark.rpc.askTimeout  120sDuration for an RPC ask operation to wait 
> before timing out
> the defalut value is 120s as documented.
> However when I run "spark-summit" with standalone cluster mode:
> the cmd is:
> Launch Command: "/opt/jdk1.8.0_102/bin/java" "-cp" 
> "/opt/spark-2.0.1-bin-hadoop2.7/conf/:/opt/spark-2.0.1-bin-hadoop2.7/jars/*" 
> "-Xmx1024M" "-Dspark.eventLog.enabled=true" 
> "-Dspark.master=spark://9.111.159.127:7101" "-Dspark.driver.supervise=false" 
> "-Dspark.app.name=org.apache.spark.examples.SparkPi" 
> "-Dspark.submit.deployMode=cluster" 
> "-Dspark.jars=file:/opt/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "-Dspark.history.ui.port=18087" "-Dspark.rpc.askTimeout=10" 
> "-Dspark.history.fs.logDirectory=file:/opt/tmp/spark-event" 
> "-Dspark.eventLog.dir=file:///opt/tmp/spark-event" 
> "org.apache.spark.deploy.worker.DriverWrapper" 
> "spark://Worker@9.111.159.127:7103" 
> "/opt/spark-2.0.1-bin-hadoop2.7/work/driver-20161109031939-0002/spark-examples-1.6.1-hadoop2.6.0.jar"
>  "org.apache.spark.examples.SparkPi" "1000"
> Dspark.rpc.askTimeout=10
> the value is 10, it is not the same as document.
> Note: when I summit to REST URL, it has no this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18209) More robust view canonicalization without full SQL expansion

2016-11-09 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651798#comment-15651798
 ] 

Reynold Xin commented on SPARK-18209:
-

Here is a ticket https://issues.apache.org/jira/browse/SPARK-18389

> More robust view canonicalization without full SQL expansion
> 
>
> Key: SPARK-18209
> URL: https://issues.apache.org/jira/browse/SPARK-18209
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> Spark SQL currently stores views by analyzing the provided SQL and then 
> generating fully expanded SQL out of the analyzed logical plan. This is 
> actually a very error prone way of doing it, because:
> 1. It is non-trivial to guarantee that the generated SQL is correct without 
> being extremely verbose, given the current set of operators.
> 2. We need extensive testing for all combination of operators.
> 3. Whenever we introduce a new logical plan operator, we need to be super 
> careful because it might break SQL generation. This is the main reason 
> broadcast join hint has taken forever to be merged because it is very 
> difficult to guarantee correctness.
> Given the two primary reasons to do view canonicalization is to provide the 
> context for the database as well as star expansion, I think we can this 
> through a simpler approach, by taking the user given SQL, analyze it, and 
> just wrap the original SQL with a SELECT clause at the outer and store the 
> database as a hint.
> For example, given the following view creation SQL:
> {code}
> USE DATABASE my_db;
> CREATE TABLE my_table (id int, name string);
> CREATE VIEW my_view AS SELECT * FROM my_table WHERE id > 10;
> {code}
> We store the following SQL instead:
> {code}
> SELECT /*+ current_db: `my_db` */ id, name FROM (SELECT * FROM my_table WHERE 
> id > 10);
> {code}
> During parsing time, we expand the view along using the provided database 
> context.
> (We don't need to follow exactly the same hint, as I'm merely illustrating 
> the high level approach here.)
> Note that there is a chance that the underlying base table(s)' schema change 
> and the stored schema of the view might differ from the actual SQL schema. In 
> that case, I think we should throw an exception at runtime to warn users. 
> This exception can be controlled by a flag.
> Update 1: based on the discussion below, we don't even need to put the view 
> definition in a sub query. We can just add it via a logical plan at the end.
> Update 2: we should make sure permanent views do not depend on temporary 
> objects (views, tables, or functions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18388) Running aggregation on many columns throws SOE

2016-11-09 Thread Raviteja Lokineni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raviteja Lokineni updated SPARK-18388:
--
Description: 
Usecase: I am generating weekly aggregates of every column of data

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import *

timeSeries = sqlContext.read.option("header", 
"true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")

# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400

w = (Window()
 .partitionBy("id")
 .orderBy(col("dt").cast("timestamp").cast("long"))
 .rangeBetween(-days(6), 0))

cols = ["id", "dt"]
skipCols = ["id", "dt"]

for col in timeSeries.columns:
if col in skipCols:
continue
cols.append(mean(col).over(w).alias("mean_7_"+col))
cols.append(count(col).over(w).alias("count_7_"+col))
cols.append(sum(col).over(w).alias("sum_7_"+col))
cols.append(min(col).over(w).alias("min_7_"+col))
cols.append(max(col).over(w).alias("max_7_"+col))

df = timeSeries.select(cols)
df.orderBy('id', 
'dt').write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("file:///tmp/spark-bug-out.csv")
{code}

  was:
Usecase: I am generating weekly aggregates of every column of data

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import *

timeSeries = sqlContext.read.option("header", 
"true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")

# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400

w = (Window()
 .partitionBy("id")
 .orderBy(col("dt").cast("timestamp").cast("long"))
 .rangeBetween(-days(6), 0))

cols = ["id", "dt"]
skipCols = ["id", "dt"]

for col in timeSeries.columns:
if col in skipCols:
continue
cols.append(mean(col).over(w).alias("mean_7_"+col))
cols.append(count(col).over(w).alias("count_7_"+col))
cols.append(sum(col).over(w).alias("sum_7_"+col))
cols.append(min(col).over(w).alias("min_7_"+col))
cols.append(max(col).over(w).alias("max_7_"+col))

df = timeSeries.select(cols)
df.orderBy('id', 'dt').write\
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")\
.save("file:///tmp/spark-bug-out.csv")
{code}


> Running aggregation on many columns throws SOE
> --
>
> Key: SPARK-18388
> URL: https://issues.apache.org/jira/browse/SPARK-18388
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.2, 1.6.2, 2.0.1
> Environment: PySpark 2.0.1, Jupyter
>Reporter: Raviteja Lokineni
>Priority: Critical
> Attachments: spark-bug-jupyter.py, spark-bug-stacktrace.txt, 
> spark-bug.csv
>
>
> Usecase: I am generating weekly aggregates of every column of data
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import *
> timeSeries = sqlContext.read.option("header", 
> "true").format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").load("file:///tmp/spark-bug.csv")
> # Hive timestamp is interpreted as UNIX timestamp in seconds*
> days = lambda i: i * 86400
> w = (Window()
>  .partitionBy("id")
>  .orderBy(col("dt").cast("timestamp").cast("long"))
>  .rangeBetween(-days(6), 0))
> cols = ["id", "dt"]
> skipCols = ["id", "dt"]
> for col in timeSeries.columns:
> if col in skipCols:
> continue
> cols.append(mean(col).over(w).alias("mean_7_"+col))
> cols.append(count(col).over(w).alias("count_7_"+col))
> cols.append(sum(col).over(w).alias("sum_7_"+col))
> cols.append(min(col).over(w).alias("min_7_"+col))
> cols.append(max(col).over(w).alias("max_7_"+col))
> df = timeSeries.select(cols)
> df.orderBy('id', 
> 'dt').write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").save("file:///tmp/spark-bug-out.csv")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18332) SparkR 2.1 QA: Programming guide, migration guide, vignettes updates

2016-11-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651801#comment-15651801
 ] 

Joseph K. Bradley commented on SPARK-18332:
---

I like splitting apart programming guide updates as:
* Updates for new features
* Major updates affecting the whole Spark component

I feel like it helps break the PRs into smaller tasks which are easier to 
parallelize.

> SparkR 2.1 QA: Programming guide, migration guide, vignettes updates
> 
>
> Key: SPARK-18332
> URL: https://issues.apache.org/jira/browse/SPARK-18332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Before the release, we need to update the SparkR Programming Guide, its 
> migration guide, and the R vignettes.  Updates will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs and [SPARK-17692].
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Update R vignettes
> Note: This task is for large changes to the guides.  New features are handled 
> in [SPARK-18330].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >