date:20160725

[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-07-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393283#comment-15393283
 ] 

Hyukjin Kwon commented on SPARK-14536:
--

FYI, It seems {{ArrayType}} is not supported for JDBC for now, SPARK-8500 and 
therefore, handling array in 
https://github.com/apache/spark/blob/7ffd99ec5f267730734431097cbb700ad074bebe/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L411-L451
 is dead codes for now.

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8500) Support for array types in JDBCRDD

2016-07-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393282#comment-15393282
 ] 

Hyukjin Kwon commented on SPARK-8500:
-

FYI, this is still happening in 2.0.0 and current master. The main problem with 
this is, it seems we can't know what the element type of the array is before 
actually reading and accessing to the array (See 
https://docs.oracle.com/javase/7/docs/api/java/sql/Array.html#getBaseType()).

I did a bit of researches but I could not find a proper way to find a element 
type of the array from {{MetaData}}. This can be easily done if there is any 
way to find the element type so that we can return a complete array type from 
{{JDBCRDD.getCatalystType}}.

> Support for array types in JDBCRDD
> --
>
> Key: SPARK-8500
> URL: https://issues.apache.org/jira/browse/SPARK-8500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: MacOSX 10.10.3, Postgres 9.3.5, Spark 1.4 hadoop 2.6, 
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
> spark-shell --driver-class-path ./postgresql-9.3-1103.jdbc41.jar
>Reporter: michal pisanko
>
> Loading a table with a text[] column via sqlContext causes an error.
> sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://localhost/my_db", 
> "dbtable" -> "table"))
> Table has a column:
> my_col  | text[]  |
> Stacktrace: https://gist.github.com/8b163bf5fdc2aea7dbb6.git
> Same occurs in pyspark shell.
> Loading another table without text array column works allright.
> Possible hint:
> https://github.com/apache/spark/blob/d986fb9a378416248768828e6e6c7405697f9a5a/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L57



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16730) Spark 2.0 breaks various Hive cast functions

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393218#comment-15393218
 ] 

Apache Spark commented on SPARK-16730:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14362

> Spark 2.0 breaks various Hive cast functions
> 
>
> Key: SPARK-16730
> URL: https://issues.apache.org/jira/browse/SPARK-16730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>
> In Spark 1.x, it is possible to use "int", "string", and other functions to 
> perform type cast. This functionality is broken in Spark 2.0, because Spark 
> no longer falls back to Hive for these functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16730) Spark 2.0 breaks various Hive cast functions

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16730:


Assignee: (was: Apache Spark)

> Spark 2.0 breaks various Hive cast functions
> 
>
> Key: SPARK-16730
> URL: https://issues.apache.org/jira/browse/SPARK-16730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>
> In Spark 1.x, it is possible to use "int", "string", and other functions to 
> perform type cast. This functionality is broken in Spark 2.0, because Spark 
> no longer falls back to Hive for these functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16730) Spark 2.0 breaks various Hive cast functions

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16730:


Assignee: Apache Spark

> Spark 2.0 breaks various Hive cast functions
> 
>
> Key: SPARK-16730
> URL: https://issues.apache.org/jira/browse/SPARK-16730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>
> In Spark 1.x, it is possible to use "int", "string", and other functions to 
> perform type cast. This functionality is broken in Spark 2.0, because Spark 
> no longer falls back to Hive for these functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16686) Dataset.sample with seed: result seems to depend on downstream usage

2016-07-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16686:

Assignee: Liang-Chi Hsieh

> Dataset.sample with seed: result seems to depend on downstream usage
> 
>
> Key: SPARK-16686
> URL: https://issues.apache.org/jira/browse/SPARK-16686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: Spark 1.6.2 and Spark 2.0 - RC4
> Standalone
> Single-worker cluster
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
> Fix For: 2.1.0
>
> Attachments: DataFrame.sample bug - 2.0.html
>
>
> Summary to reproduce bug:
> * Create a DataFrame DF, and sample it with a fixed seed.
> * Collect that DataFrame -> result1
> * Call a particular UDF on that DataFrame -> result2
> You would expect results 1 and 2 to use the same rows from DF, but they 
> appear not to.
> Note: result1 and result2 are both deterministic.
> See the attached notebook for details.  Cells in the notebook were executed 
> in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action

2016-07-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393182#comment-15393182
 ] 

Hyukjin Kwon commented on SPARK-16720:
--

[~holdenk] I just tried to reproduce this with the codes below:

{code}
val path = "/tmp/test.csv"
val df = spark.read.format("csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .load(path)
df.take(1)
{code}

this gives me an error as below:

{code}

Reference 'Daily Total check-ins' is ambiguous, could be: Daily Total 
check-ins#1776, Daily Total check-ins#1779.;
org.apache.spark.sql.AnalysisException: Reference 'Daily Total check-ins' is 
ambiguous, could be: Daily Total check-ins#1776, Daily Total check-ins#1779.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:158)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:130)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
{code}

It seems the header has the duplicated names.

{code}
root
...
 |-- Daily Total check-ins: integer (nullable = true)
 |-- Weekly Total check-ins: integer (nullable = true)
 |-- 28 Days Total check-ins: integer (nullable = true)
 |-- Daily Total check-ins: integer (nullable = true)
 |-- Weekly Total check-ins: integer (nullable = true)
 |-- 28 Days Total check-ins: integer (nullable = true)
 |-- Daily Total check-ins using mobile devices: integer (nullable = true)
 |-- Weekly Total check-ins using mobile devices: integer (nullable = true)
 |-- 28 Days Total check-ins using mobile devices: integer (nullable = true)
 |-- Daily Total check-ins using mobile devices: integer (nullable = true)
...
{code}

If I don't use the header, it seems okay as below:

{code}
val path = "/tmp/test.csv"
val df = spark.read.format("csv")
  .option("inferSchema", "true")
  .load(path)
df.take(1)
{code}




> Loading CSV file with 2k+ columns fails during attribute resolution on action
> -
>
> Key: SPARK-16720
> URL: https://issues.apache.org/jira/browse/SPARK-16720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> Example shell for repro:
> {quote}
> scala> val df =spark.read.format("csv").option("header", 
> "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv")
> df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int 
> ... 2125 more fields]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(Date,StringType,true), StructField(Lifetime Total 
> Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), 
> StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged 
> Users,IntegerType,true), StructField(Weekly Page Engaged 
> Users,IntegerType,true), StructField(28 Days Page Engaged 
> Users,IntegerType,true), StructField(Daily Like Sources - On Your 
> Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), 
> StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total 
> Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), 
> StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days 
> Organic Reach,IntegerType,true), StructField(Daily T...
> scala> df.take(1)
> [GIANT LIST OF COLUMNS]
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at

[jira] [Resolved] (SPARK-16642) ResolveWindowFrame should not be triggered on UnresolvedFunctions.

2016-07-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16642.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14284
[https://github.com/apache/spark/pull/14284]

> ResolveWindowFrame should not be triggered on UnresolvedFunctions.
> --
>
> Key: SPARK-16642
> URL: https://issues.apache.org/jira/browse/SPARK-16642
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.1, 2.1.0
>
>
> The case at 
> https://github.com/apache/spark/blob/75146be6ba5e9f559f5f15430310bb476ee0812c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L1790-L1792
>  is shown below
> {code}
> case we @ WindowExpression(e, s @ WindowSpecDefinition(_, o, 
> UnspecifiedFrame)) =>
>   val frame = SpecifiedWindowFrame.defaultWindowFrame(o.nonEmpty, 
> acceptWindowFrame = true)
>   we.copy(windowSpec = s.copy(frameSpecification = frame))
> {code}
> This case will be triggered even when the function is an unresolved. So, when 
> the functions like lead are used, we may see errors like {{Window Frame RANGE 
> BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW must match the required frame 
> ROWS BETWEEN 1 FOLLOWING AND 1 FOLLOWING.}} because we wrongly set the the 
> frame specification.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16633) lag/lead using constant input values does not return the default value when the offset row does not exist

2016-07-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16633.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14284
[https://github.com/apache/spark/pull/14284]

> lag/lead using constant input values does not return the default value when 
> the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16721) Lead/lag needs to respect nulls

2016-07-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16721.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14284
[https://github.com/apache/spark/pull/14284]

> Lead/lag needs to respect nulls 
> 
>
> Key: SPARK-16721
> URL: https://issues.apache.org/jira/browse/SPARK-16721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 2.0.1, 2.1.0
>
>
> Seems 2.0.0 changes the behavior of lead and lag to ignore nulls. This PR is 
> changing the behavior back to 1.6's behavior, which is respecting nulls.
> For example 
> {code}
> SELECT
> b,
> lag(a, 1, 321) OVER (ORDER BY b) as lag,
> lead(a, 1, 321) OVER (ORDER BY b) as lead
> FROM (SELECT cast(null as int) as a, 1 as b
> UNION ALL
> select cast(null as int) as id, 2 as b) tmp
> {code}
> This query should return 
> {code}
> +---+++
> |  b| lag|lead|
> +---+++
> |  1| 321|null|
> |  2|null| 321|
> +---+++
> {code}
> instead of 
> {code}
> +---+---++
> |  b|lag|lead|
> +---+---++
> |  1|321| 321|
> |  2|321| 321|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16730) Spark 2.0 breaks various Hive cast functions

2016-07-25 Thread Peter Lee (JIRA)

Peter Lee created SPARK-16730:
-

 Summary: Spark 2.0 breaks various Hive cast functions
 Key: SPARK-16730
 URL: https://issues.apache.org/jira/browse/SPARK-16730
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Peter Lee


In Spark 1.x, it is possible to use "int", "string", and other functions to 
perform type cast. This functionality is broken in Spark 2.0, because Spark no 
longer falls back to Hive for these functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14536) NPE in JDBCRDD when array column contains nulls (postgresql)

2016-07-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393161#comment-15393161
 ] 

Hyukjin Kwon commented on SPARK-14536:
--

Hi [~jeremyrsmith], are you working on this?

> NPE in JDBCRDD when array column contains nulls (postgresql)
> 
>
> Key: SPARK-14536
> URL: https://issues.apache.org/jira/browse/SPARK-14536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Jeremy Smith
>  Labels: NullPointerException
>
> At 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L453
>  it is assumed that the JDBC driver will definitely return a non-null `Array` 
> object from the call to `getArray`, and that in the event of a null array it 
> will return an non-null `Array` object with a null underlying array.  But as 
> you can see here 
> https://github.com/pgjdbc/pgjdbc/blob/master/pgjdbc/src/main/java/org/postgresql/jdbc/PgResultSet.java#L387
>  that isn't the case, at least for PostgreSQL.  This causes a 
> `NullPointerException` whenever an array column contains null values. It 
> seems like the PostgreSQL JDBC driver is probably doing the wrong thing, but 
> even so there should be a null check in JDBCRDD.  I'm happy to submit a PR if 
> that would be helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16724.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Expose DefinedByConstructorParams
> -
>
> Key: SPARK-16724
> URL: https://issues.apache.org/jira/browse/SPARK-16724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 2.0.1, 2.1.0
>
>
> Generally we don't mark things in catalyst/execution as private.  Instead 
> they are not included in scala doc as they are not considered stable APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16672) SQLBuilder should not raise exceptions on EXISTS queries

2016-07-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16672.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.1.0
   2.0.1

> SQLBuilder should not raise exceptions on EXISTS queries
> 
>
> Key: SPARK-16672
> URL: https://issues.apache.org/jira/browse/SPARK-16672
> Project: Spark
>  Issue Type: Bug
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on 
> **unoptimized** `EXISTS` queries. We had better prevent this.
> {code}
> scala> sql("CREATE TABLE t1(a int)")
> scala> val df = sql("select * from t1 b where exists (select * from t1 a)")
> scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL
> java.lang.UnsupportedOperationException: empty.reduceLeft
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393068#comment-15393068
 ] 

Asmaa Ali  edited comment on SPARK-16723 at 7/26/16 2:23 AM:
-

Yes, I've copied all what I got from the command, there is no other info 
appeared.



was (Author: soma):
Yes, I've copied all what I got from the command, there is no other info 
appeared.
I've accessed the NM's log file, Is there a way to get logs of specific date or 
must I just scroll down until I get it  ?!!

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
>

[jira] [Commented] (SPARK-16719) RandomForest: communicate fewer trees on each iteration

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393075#comment-15393075
 ] 

Apache Spark commented on SPARK-16719:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/14359

> RandomForest: communicate fewer trees on each iteration
> ---
>
> Key: SPARK-16719
> URL: https://issues.apache.org/jira/browse/SPARK-16719
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> RandomForest currently sends the entire forest to each worker on each 
> iteration.  This is because (a) the node queue is FIFO and (b) the closure 
> references the entire array of trees ({{topNodes}}).  (a) causes RFs to 
> handle splits in many trees, especially early on in learning.  (b) sends all 
> trees explicitly.
> Proposal:
> (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a 
> few trees before focusing on others.
> (b) Change topNodes to pass only the trees required on that iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16720) Loading CSV file with 2k+ columns fails during attribute resolution on action

2016-07-25 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393069#comment-15393069
 ] 

Hyukjin Kwon commented on SPARK-16720:
--

Hi [~holdenk], this part seems familiar to me. Do you mind if i look into this 
and work on this? 

> Loading CSV file with 2k+ columns fails during attribute resolution on action
> -
>
> Key: SPARK-16720
> URL: https://issues.apache.org/jira/browse/SPARK-16720
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: holdenk
>
> Example shell for repro:
> {quote}
> scala> val df =spark.read.format("csv").option("header", 
> "true").option("inferSchema", "true").load("/home/holden/Downloads/ex*.csv")
> df: org.apache.spark.sql.DataFrame = [Date: string, Lifetime Total Likes: int 
> ... 2125 more fields]
> scala> df.schema
> res0: org.apache.spark.sql.types.StructType = 
> StructType(StructField(Date,StringType,true), StructField(Lifetime Total 
> Likes,IntegerType,true), StructField(Daily New Likes,IntegerType,true), 
> StructField(Daily Unlikes,IntegerType,true), StructField(Daily Page Engaged 
> Users,IntegerType,true), StructField(Weekly Page Engaged 
> Users,IntegerType,true), StructField(28 Days Page Engaged 
> Users,IntegerType,true), StructField(Daily Like Sources - On Your 
> Page,IntegerType,true), StructField(Daily Total Reach,IntegerType,true), 
> StructField(Weekly Total Reach,IntegerType,true), StructField(28 Days Total 
> Reach,IntegerType,true), StructField(Daily Organic Reach,IntegerType,true), 
> StructField(Weekly Organic Reach,IntegerType,true), StructField(28 Days 
> Organic Reach,IntegerType,true), StructField(Daily T...
> scala> df.take(1)
> [GIANT LIST OF COLUMNS]
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at 
> org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at 
> org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at 
>

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393068#comment-15393068
 ] 

Asmaa Ali  commented on SPARK-16723:


Yes, I've copied all what I got from the command, there is no other info 
appeared.
I've accessed the NM's log file, Is there a way to get logs of specific date or 
must I just scroll down until I get it  ?!!

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
>

[jira] [Commented] (SPARK-16729) Spark should throw analysis exception for invalid casts to date type

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393064#comment-15393064
 ] 

Apache Spark commented on SPARK-16729:
--

User 'petermaxlee' has created a pull request for this issue:
https://github.com/apache/spark/pull/14358

> Spark should throw analysis exception for invalid casts to date type
> 
>
> Key: SPARK-16729
> URL: https://issues.apache.org/jira/browse/SPARK-16729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>
> Spark currently throws exceptions for invalid casts for all other data types 
> except date type. Somehow date type returns null. It should be consistent and 
> throws analysis exception as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16729) Spark should throw analysis exception for invalid casts to date type

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16729:


Assignee: (was: Apache Spark)

> Spark should throw analysis exception for invalid casts to date type
> 
>
> Key: SPARK-16729
> URL: https://issues.apache.org/jira/browse/SPARK-16729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>
> Spark currently throws exceptions for invalid casts for all other data types 
> except date type. Somehow date type returns null. It should be consistent and 
> throws analysis exception as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16729) Spark should throw analysis exception for invalid casts to date type

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16729:


Assignee: Apache Spark

> Spark should throw analysis exception for invalid casts to date type
> 
>
> Key: SPARK-16729
> URL: https://issues.apache.org/jira/browse/SPARK-16729
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Apache Spark
>
> Spark currently throws exceptions for invalid casts for all other data types 
> except date type. Somehow date type returns null. It should be consistent and 
> throws analysis exception as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true

2016-07-25 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393062#comment-15393062
 ] 

Hong Shen edited comment on SPARK-16709 at 7/26/16 2:10 AM:


It different, the task has no attempt succeed. This happen when a task attempt 
failed when performCommit(), the following attempts can't commit any more, 
because just one task can performCommit(), even though the first attempt failed 
at performCommit().



was (Author: shenhong):
It different, the task has no attempt succeed. this happen when a task attempt 
failed when performCommit(), the following attempts can commit any more, 
because just one task can performCommit(), even though the first attempt failed 
at performCommit().


> Task with commit failed will retry infinite when speculation set to true
> 
>
> Key: SPARK-16709
> URL: https://issues.apache.org/jira/browse/SPARK-16709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
> Attachments: commit failed.png
>
>
> In our cluster, we set spark.speculation=true,  but when a task throw 
> exception at SparkHadoopMapRedUtil.performCommit(), this task can retry 
> infinite.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true

2016-07-25 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393062#comment-15393062
 ] 

Hong Shen commented on SPARK-16709:
---

It different, the task has no attempt succeed. this happen when a task attempt 
failed when performCommit(), the following attempts can commit any more, 
because just one task can performCommit(), even though the first attempt failed 
at performCommit().


> Task with commit failed will retry infinite when speculation set to true
> 
>
> Key: SPARK-16709
> URL: https://issues.apache.org/jira/browse/SPARK-16709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
> Attachments: commit failed.png
>
>
> In our cluster, we set spark.speculation=true,  but when a task throw 
> exception at SparkHadoopMapRedUtil.performCommit(), this task can retry 
> infinite.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16729) Spark should throw analysis exception for invalid casts to date type

2016-07-25 Thread Peter Lee (JIRA)

Peter Lee created SPARK-16729:
-

 Summary: Spark should throw analysis exception for invalid casts 
to date type
 Key: SPARK-16729
 URL: https://issues.apache.org/jira/browse/SPARK-16729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Peter Lee


Spark currently throws exceptions for invalid casts for all other data types 
except date type. Somehow date type returns null. It should be consistent and 
throws analysis exception as well.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16708) ExecutorAllocationManager.numRunningTasks can be negative when stage retry

2016-07-25 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393054#comment-15393054
 ] 

Hong Shen commented on SPARK-16708:
---

It's different,ExecutorAllocationManager.numRunningTasks be negative can cause 
ExecutorAllocationManager.maxNumExecutorsNeeded be negative, sometimes it can 
lead to many task pending to running, but ExecutorAllocationManager not 
allocate executors.

> ExecutorAllocationManager.numRunningTasks can be negative when stage retry
> --
>
> Key: SPARK-16708
> URL: https://issues.apache.org/jira/browse/SPARK-16708
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>
> When a task fetch failed, the stage will complete and retry, when the stage 
> complete, ExecutorAllocationManager.numRunningTasks will be set 0, here is 
> the code:
> {code}
> override def onStageCompleted(stageCompleted: 
> SparkListenerStageCompleted): Unit = {
>   val stageId = stageCompleted.stageInfo.stageId
>   allocationManager.synchronized {
> stageIdToNumTasks -= stageId
> stageIdToTaskIndices -= stageId
> stageIdToExecutorPlacementHints -= stageId
> // Update the executor placement hints
> updateExecutorPlacementHints()
> // If this is the last stage with pending tasks, mark the scheduler 
> queue as empty
> // This is needed in case the stage is aborted for any reason
> if (stageIdToNumTasks.isEmpty) {
>   allocationManager.onSchedulerQueueEmpty()
>   if (numRunningTasks != 0) {
> logWarning("No stages are running, but numRunningTasks != 0")
> numRunningTasks = 0
>   }
> }
>   }
> }
> {code}
> But  when the stage's running task finished, numRunningTasks will minus 1, so 
> numRunningTasks be negative, it can cause maxNeeded be negative.
> {code}
> override def onTaskEnd(taskEnd: SparkListenerTaskEnd): Unit = {
>   val executorId = taskEnd.taskInfo.executorId
>   val taskId = taskEnd.taskInfo.taskId
>   val taskIndex = taskEnd.taskInfo.index
>   val stageId = taskEnd.stageId
>   allocationManager.synchronized {
> numRunningTasks -= 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16677) Strange Error when Issuing Load Table Against A View

2016-07-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16677:

Assignee: Xiao Li

> Strange Error when Issuing Load Table Against A View
> 
>
> Key: SPARK-16677
> URL: https://issues.apache.org/jira/browse/SPARK-16677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> Users should not be allowed to issue LOAD DATA against a view. Currently, 
> when users doing it, we got a very strange runtime error:
> For example,
> {noformat}
> LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
> {noformat}
> {noformat}
> java.lang.reflect.InvocationTargetException was thrown.
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16677) Strange Error when Issuing Load Table Against A View

2016-07-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16677.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14314
[https://github.com/apache/spark/pull/14314]

> Strange Error when Issuing Load Table Against A View
> 
>
> Key: SPARK-16677
> URL: https://issues.apache.org/jira/browse/SPARK-16677
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> Users should not be allowed to issue LOAD DATA against a view. Currently, 
> when users doing it, we got a very strange runtime error:
> For example,
> {noformat}
> LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName
> {noformat}
> {noformat}
> java.lang.reflect.InvocationTargetException was thrown.
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393033#comment-15393033
 ] 

Saisai Shao commented on SPARK-16723:
-

So maybe this application is not yet started in the yarn side, so there's no 
application log in the yarn side.

I'm not sure if the stack attached above is the only one you saw, if not please 
provide them, otherwise it is hard to tell the reason why application is failed.

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
>

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393021#comment-15393021
 ] 

Asmaa Ali  commented on SPARK-16723:


I have set the  yarn.log-aggregation-enable to true, but The command still 
doesn't work. Do I need to do anything else to make it work ?!

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
> spark.dynamicAllocation.minExecutors -> 1
> spark.yarn.executor.memoryOverhead -> 558
>

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393003#comment-15393003
 ] 

Saisai Shao commented on SPARK-16723:
-

Did you enable log aggregation in YARN, if not this command is worked, you have 
to manually access into NM's log dir to see the detailed application logs. In 
the meantime if you could provide some more useful stack trace, that would be 
helpful.

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
>

[jira] [Comment Edited] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15393003#comment-15393003
 ] 

Saisai Shao edited comment on SPARK-16723 at 7/26/16 1:36 AM:
--

Did you enable log aggregation in YARN, if not this command is not worked, you 
have to manually access into NM's log dir to see the detailed application logs. 
In the meantime if you could provide some more useful stack trace, that would 
be helpful.


was (Author: jerryshao):
Did you enable log aggregation in YARN, if not this command is worked, you have 
to manually access into NM's log dir to see the detailed application logs. In 
the meantime if you could provide some more useful stack trace, that would be 
helpful.

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize ->

[jira] [Resolved] (SPARK-16678) Disallow Creating/Altering a View when the same-name Table Exists

2016-07-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16678.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14314
[https://github.com/apache/spark/pull/14314]

> Disallow Creating/Altering a View when the same-name Table Exists
> -
>
> Key: SPARK-16678
> URL: https://issues.apache.org/jira/browse/SPARK-16678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.1.0
>
>
> When we create OR alter a view, we check whether the view already exists. In 
> the current implementation, if a table with the same name exists, we treat it 
> as a view. However, this is not the right behavior. We should follow what 
> Hive does. For example,
> {noformat}
> hive> CREATE TABLE tab1 (id int);
> OK
> Time taken: 0.196 seconds
> hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
> FAILED: SemanticException [Error 10218]: Existing table is not a view
>  The following is an existing table, not a view: default.tab1
> hive> ALTER VIEW tab1 AS SELECT * FROM t1;
> FAILED: SemanticException [Error 10218]: Existing table is not a view
>  The following is an existing table, not a view: default.tab1
> hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
> OK
> Time taken: 0.678 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16678) Disallow Creating/Altering a View when the same-name Table Exists

2016-07-25 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-16678:

Assignee: Xiao Li

> Disallow Creating/Altering a View when the same-name Table Exists
> -
>
> Key: SPARK-16678
> URL: https://issues.apache.org/jira/browse/SPARK-16678
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> When we create OR alter a view, we check whether the view already exists. In 
> the current implementation, if a table with the same name exists, we treat it 
> as a view. However, this is not the right behavior. We should follow what 
> Hive does. For example,
> {noformat}
> hive> CREATE TABLE tab1 (id int);
> OK
> Time taken: 0.196 seconds
> hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1;
> FAILED: SemanticException [Error 10218]: Existing table is not a view
>  The following is an existing table, not a view: default.tab1
> hive> ALTER VIEW tab1 AS SELECT * FROM t1;
> FAILED: SemanticException [Error 10218]: Existing table is not a view
>  The following is an existing table, not a view: default.tab1
> hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1;
> OK
> Time taken: 0.678 seconds
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392999#comment-15392999
 ] 

Asmaa Ali  commented on SPARK-16723:


Connecting to ResourceManager at cluster-cancerdetector-m/10.132.0.2:8032
/yarn-logs/cancerdetector/logs/application_1467990031555_0089 does not exist.
Log aggregation has not completed or is not enabled.

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
>

[jira] [Resolved] (SPARK-16722) Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-16722.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14354
[https://github.com/apache/spark/pull/14354]

> Fix a StreamingContext leak in StreamingContextSuite when eventually fails
> --
>
> Key: SPARK-16722
> URL: https://issues.apache.org/jira/browse/SPARK-16722
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> This patch moves `ssc.stop()` into `finally` for 
> `StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
> StreamingContext since leaking a StreamingContext will fail a lot of tests 
> and make us hard to find the real failure one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16707) TransportClientFactory.createClient may throw NPE

2016-07-25 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392987#comment-15392987
 ] 

Hong Shen commented on SPARK-16707:
---

It happened many times in our cluster(with thousand of machines), but it's too 
hard to reproduce this, I can just retry the way stackoverflow.com describe.

> TransportClientFactory.createClient may throw NPE
> -
>
> Key: SPARK-16707
> URL: https://issues.apache.org/jira/browse/SPARK-16707
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
>
> I have encounter  some NullPointerException when 
> TransportClientFactory.createClient in my cluster, here is the following 
> stack trace.
> {code}
> org.apache.spark.shuffle.FetchFailedException
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:326)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:303)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:53)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:511)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.(TungstenAggregationIterator.scala:686)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:741)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:337)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:301)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:144)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:107)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:146)
>   at 
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:126)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:155)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:319)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:299)
>   ... 32 more
> {code}
> The code is at 
>

[jira] [Updated] (SPARK-15959) Add the support of hive.metastore.warehouse.dir back

2016-07-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15959:

Labels:   (was: release_notes releasenotes)

> Add the support of hive.metastore.warehouse.dir back
> 
>
> Key: SPARK-15959
> URL: https://issues.apache.org/jira/browse/SPARK-15959
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 2.0.0
>
>
> Right now, we do not load the value of this value at all 
> (https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSharedState.scala#L35-L41).
>  Let's maintain the backward compatibility by loading it if spark's warehouse 
> conf is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12544) Support window functions in SQLContext

2016-07-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12544:

Labels:   (was: releasenotes)

> Support window functions in SQLContext
> --
>
> Key: SPARK-12544
> URL: https://issues.apache.org/jira/browse/SPARK-12544
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392980#comment-15392980
 ] 

Saisai Shao commented on SPARK-16723:
-

{{yarn logs -applicationId application_1467990031555_0089}}

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
> spark.dynamicAllocation.minExecutors -> 1
> spark.yarn.executor.memoryOverhead -> 558
> spark.driver.extraJavaOptions -> 
>

[jira] [Comment Edited] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392972#comment-15392972
 ] 

Asmaa Ali  edited comment on SPARK-16723 at 7/26/16 1:20 AM:
-

[~jerryshao] How can I check the AM and executor logs, please?!


was (Author: soma):
[~saisai_shao] How can I check the AM and executor logs, please?!

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
>

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392972#comment-15392972
 ] 

Asmaa Ali  commented on SPARK-16723:


[~saisai_shao] How can I check the AM and executor logs, please?!

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
> spark.dynamicAllocation.minExecutors -> 1
> spark.yarn.executor.memoryOverhead -> 558
> spark.driver.extraJavaOptions -> 
>

[jira] [Commented] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392965#comment-15392965
 ] 

Saisai Shao commented on SPARK-16723:
-

I think you should check the AM and executor logs to see the details, your 
description doesn't provide the useful information.

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
> spark.dynamicAllocation.minExecutors -> 1
> spark.yarn.executor.memoryOverhead -> 558
>

[jira] [Commented] (SPARK-16727) SparkR unit test fails - incorrect expected output

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392959#comment-15392959
 ] 

Apache Spark commented on SPARK-16727:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/14357

> SparkR unit test fails - incorrect expected output
> --
>
> Key: SPARK-16727
> URL: https://issues.apache.org/jira/browse/SPARK-16727
> Project: Spark
>  Issue Type: Bug
>Reporter: Junyang Qian
>
> https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1827
> When I run spark/R/run-tests.sh, the tests failed with the following message:
> 1. Failure (at test_sparkSQL.R#1827): describe() and summarize() on a 
> DataFrame 
> collect(stats)[4, "name"] not equal to "Andy"
> target is NULL, current is character
> 2. Failure (at test_sparkSQL.R#1831): describe() and summarize() on a 
> DataFrame 
> collect(stats2)[4, "name"] not equal to "Andy"
> target is NULL, current is character
> Error: Test failures
> Execution halted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16727) SparkR unit test fails - incorrect expected output

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16727:


Assignee: (was: Apache Spark)

> SparkR unit test fails - incorrect expected output
> --
>
> Key: SPARK-16727
> URL: https://issues.apache.org/jira/browse/SPARK-16727
> Project: Spark
>  Issue Type: Bug
>Reporter: Junyang Qian
>
> https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1827
> When I run spark/R/run-tests.sh, the tests failed with the following message:
> 1. Failure (at test_sparkSQL.R#1827): describe() and summarize() on a 
> DataFrame 
> collect(stats)[4, "name"] not equal to "Andy"
> target is NULL, current is character
> 2. Failure (at test_sparkSQL.R#1831): describe() and summarize() on a 
> DataFrame 
> collect(stats2)[4, "name"] not equal to "Andy"
> target is NULL, current is character
> Error: Test failures
> Execution halted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16727) SparkR unit test fails - incorrect expected output

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16727:


Assignee: Apache Spark

> SparkR unit test fails - incorrect expected output
> --
>
> Key: SPARK-16727
> URL: https://issues.apache.org/jira/browse/SPARK-16727
> Project: Spark
>  Issue Type: Bug
>Reporter: Junyang Qian
>Assignee: Apache Spark
>
> https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1827
> When I run spark/R/run-tests.sh, the tests failed with the following message:
> 1. Failure (at test_sparkSQL.R#1827): describe() and summarize() on a 
> DataFrame 
> collect(stats)[4, "name"] not equal to "Andy"
> target is NULL, current is character
> 2. Failure (at test_sparkSQL.R#1831): describe() and summarize() on a 
> DataFrame 
> collect(stats2)[4, "name"] not equal to "Andy"
> target is NULL, current is character
> Error: Test failures
> Execution halted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392953#comment-15392953
 ] 

Apache Spark commented on SPARK-16724:
--

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/14356

> Expose DefinedByConstructorParams
> -
>
> Key: SPARK-16724
> URL: https://issues.apache.org/jira/browse/SPARK-16724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> Generally we don't mark things in catalyst/execution as private.  Instead 
> they are not included in scala doc as they are not considered stable APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16724:


Assignee: Apache Spark  (was: Michael Armbrust)

> Expose DefinedByConstructorParams
> -
>
> Key: SPARK-16724
> URL: https://issues.apache.org/jira/browse/SPARK-16724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>
> Generally we don't mark things in catalyst/execution as private.  Instead 
> they are not included in scala doc as they are not considered stable APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16724:


Assignee: Michael Armbrust  (was: Apache Spark)

> Expose DefinedByConstructorParams
> -
>
> Key: SPARK-16724
> URL: https://issues.apache.org/jira/browse/SPARK-16724
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> Generally we don't mark things in catalyst/execution as private.  Instead 
> they are not included in scala doc as they are not considered stable APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16702) Driver hangs after executors are lost

2016-07-25 Thread Angus Gerry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392938#comment-15392938
 ] 

Angus Gerry edited comment on SPARK-16702 at 7/26/16 12:54 AM:
---

I'm not so sure about SPARK-12419.

SPARK-16533 however definitely looks the same. The logs in my scenario are 
similar to what's described there. Effectively it's just repetitions of:
{noformat}
WARN ExecutorAllocationManager: Uncaught exception in thread 
spark-dynamic-executor-allocation
org.apache.spark.SparkException: Error sending message [message = 
RequestExecutors(...)]
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on 
host: ... was preempted.)] in 3 attempts
WARN NettyRpcEndpointRef: Error sending message [message = 
KillExecutors(List(450))] in 1 attempts
{noformat}


was (Author: ango...@gmail.com):
I'm not so sure about SPARK-12419.

SPARK-16355 however definitely looks the same. The logs in my scenario are 
similar to what's described there. Effectively it's just repetitions of:
{noformat}
WARN ExecutorAllocationManager: Uncaught exception in thread 
spark-dynamic-executor-allocation
org.apache.spark.SparkException: Error sending message [message = 
RequestExecutors(...)]
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on 
host: ... was preempted.)] in 3 attempts
WARN NettyRpcEndpointRef: Error sending message [message = 
KillExecutors(List(450))] in 1 attempts
{noformat}

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
>

[jira] [Commented] (SPARK-16702) Driver hangs after executors are lost

2016-07-25 Thread Angus Gerry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392938#comment-15392938
 ] 

Angus Gerry commented on SPARK-16702:
-

I'm not so sure about SPARK-12419.

SPARK-16355 however definitely looks the same. The logs in my scenario are 
similar to what's described there. Effectively it's just repetitions of:
{noformat}
WARN ExecutorAllocationManager: Uncaught exception in thread 
spark-dynamic-executor-allocation
org.apache.spark.SparkException: Error sending message [message = 
RequestExecutors(...)]
WARN NettyRpcEndpointRef: Error sending message [message = 
RemoveExecutor(383,Container container_e12_1466755357617_0813_01_002077 on 
host: ... was preempted.)] in 3 attempts
WARN NettyRpcEndpointRef: Error sending message [message = 
KillExecutors(List(450))] in 1 attempts
{noformat}

> Driver hangs after executors are lost
> -
>
> Key: SPARK-16702
> URL: https://issues.apache.org/jira/browse/SPARK-16702
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Angus Gerry
> Attachments: SparkThreadsBlocked.txt
>
>
> It's my first time, please be kind.
> I'm still trying to debug this error locally - at this stage I'm pretty 
> convinced that it's a weird deadlock/livelock problem due to the use of 
> {{scheduleAtFixedRate}} within {{ExecutorAllocationManager}}. This problem is 
> possibly tangentially related to the issues discussed in SPARK-1560 around 
> the use of blocking calls within locks.
> h4. Observed Behavior
> When running a spark job, and executors are lost, the job occassionally goes 
> into a state where it makes no progress with tasks. Most commonly it seems 
> that the issue occurs when executors are preempted by yarn, but I'm not 
> confident enough to state that it's restricted to just this scenario.
> Upon inspecting a thread dump from the driver, the following stack traces 
> seem noteworthy (a full thread dump is attached):
> {noformat:title=Thread 178: spark-dynamic-executor-allocation (TIMED_WAITING)}
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:447)
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1423)
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {noformat}
> {noformat:title=Thread 22: dispatcher-event-loop-10 (BLOCKED)}
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverEndpoint.disableExecutor(CoarseGrainedSchedulerBackend.scala:289)
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnDriverEndpoint$$anonfun$onDisconnected$1.apply(YarnSchedulerBackend.scala:121)
>

[jira] [Created] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml

2016-07-25 Thread Vladimir Feinberg (JIRA)

Vladimir Feinberg created SPARK-16728:
-

 Summary: migrate internal API for MLlib trees from spark.mllib to 
spark.ml
 Key: SPARK-16728
 URL: https://issues.apache.org/jira/browse/SPARK-16728
 Project: Spark
  Issue Type: Sub-task
Reporter: Vladimir Feinberg


Currently, spark.ml trees rely on spark.mllib implementations. There are two 
issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
(the previous ensemble's prediction, for instance) inside the TreePoints (this 
is necessary to have loss-based splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd 
order. These are useless for several impurity measures and inadequate for 
others (e.g., absolute loss or huber loss). It needs some renovation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15590) Paginate Job Table in Jobs tab

2016-07-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-15590.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

> Paginate Job Table in Jobs tab
> --
>
> Key: SPARK-15590
> URL: https://issues.apache.org/jira/browse/SPARK-15590
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Yin Huai
>Assignee: Tao Lin
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16727) SparkR unit test fails - incorrect expected output

2016-07-25 Thread Junyang Qian (JIRA)

Junyang Qian created SPARK-16727:


 Summary: SparkR unit test fails - incorrect expected output
 Key: SPARK-16727
 URL: https://issues.apache.org/jira/browse/SPARK-16727
 Project: Spark
  Issue Type: Bug
Reporter: Junyang Qian


https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L1827

When I run spark/R/run-tests.sh, the tests failed with the following message:

1. Failure (at test_sparkSQL.R#1827): describe() and summarize() on a DataFrame 
collect(stats)[4, "name"] not equal to "Andy"
target is NULL, current is character

2. Failure (at test_sparkSQL.R#1831): describe() and summarize() on a DataFrame 
collect(stats2)[4, "name"] not equal to "Andy"
target is NULL, current is character
Error: Test failures
Execution halted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16726) Improve error message for `Union` queries on incompatible types

2016-07-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16726:
--
Description: 
Currently, `UNION` query on incompatible types shows a misleading error 
message, e.g., `unresolved operator Union`. We had better show a more correct 
message. This will help users in the situation of 
[SPARK-16704|https://issues.apache.org/jira/browse/SPARK-16704]

h4. Before
{code}
scala> sql("select 1 union (select array(1))")
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
{code}

h4. After
{code}
scala> sql("select 1 union (select array(1))")
org.apache.spark.sql.AnalysisException: Unions can only be performed on tables 
with the compatible column types, but one table has '[IntegerType]' and another 
table has '[ArrayType(IntegerType,false)]';
{code}


  was:
Currently, `UNION` query on incompatible types shows a misleading error 
message, e.g., `unresolved operator Union`. We had better show a more correct 
message.

h4. Before
{code}
scala> sql("select 1 union (select array(1))")
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
{code}

h4. After
{code}
scala> sql("select 1 union (select array(1))")
org.apache.spark.sql.AnalysisException: Unions can only be performed on tables 
with the compatible column types, but one table has '[IntegerType]' and another 
table has '[ArrayType(IntegerType,false)]';
{code}



> Improve error message for `Union` queries on incompatible types
> ---
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `UNION` query on incompatible types shows a misleading error 
> message, e.g., `unresolved operator Union`. We had better show a more correct 
> message. This will help users in the situation of 
> [SPARK-16704|https://issues.apache.org/jira/browse/SPARK-16704]
> h4. Before
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> {code}
> h4. After
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: Unions can only be performed on 
> tables with the compatible column types, but one table has '[IntegerType]' 
> and another table has '[ArrayType(IntegerType,false)]';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16718) gbm-style treeboost

2016-07-25 Thread Vladimir Feinberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392914#comment-15392914
 ] 

Vladimir Feinberg commented on SPARK-16718:
---

L1 support for loss-based impurity will be delayed until there's a new internal 
API for GBTs in spark.ml

> gbm-style treeboost
> ---
>
> Key: SPARK-16718
> URL: https://issues.apache.org/jira/browse/SPARK-16718
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>Assignee: Vladimir Feinberg
>
> As an initial minimal change, we should provide TreeBoost as implemented in 
> GBM for L1, L2, and logistic losses: by introducing a new "loss-based" 
> impurity, tree leafs in GBTs can have loss-optimal predictions for their 
> partition of the data.
> Commit should have evidence of accuracy improvment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16726) Improve error message for `Union` queries on incompatible types

2016-07-25 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16726:
--
Summary: Improve error message for `Union` queries on incompatible types  
(was: Improve error message for `Union` queries for incompatible types)

> Improve error message for `Union` queries on incompatible types
> ---
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `UNION` query on incompatible types shows a misleading error 
> message, e.g., `unresolved operator Union`. We had better show a more correct 
> message.
> h4. Before
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> {code}
> h4. After
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: Unions can only be performed on 
> tables with the compatible column types, but one table has '[IntegerType]' 
> and another table has '[ArrayType(IntegerType,false)]';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16726) Improve error message for `Union` queries for incompatible types

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392899#comment-15392899
 ] 

Apache Spark commented on SPARK-16726:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14355

> Improve error message for `Union` queries for incompatible types
> 
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `UNION` query on incompatible types shows a misleading error 
> message, e.g., `unresolved operator Union`. We had better show a more correct 
> message.
> h4. Before
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> {code}
> h4. After
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: Unions can only be performed on 
> tables with the compatible column types, but one table has '[IntegerType]' 
> and another table has '[ArrayType(IntegerType,false)]';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16726) Improve error message for `Union` queries for incompatible types

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16726:


Assignee: (was: Apache Spark)

> Improve error message for `Union` queries for incompatible types
> 
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Currently, `UNION` query on incompatible types shows a misleading error 
> message, e.g., `unresolved operator Union`. We had better show a more correct 
> message.
> h4. Before
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> {code}
> h4. After
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: Unions can only be performed on 
> tables with the compatible column types, but one table has '[IntegerType]' 
> and another table has '[ArrayType(IntegerType,false)]';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16726) Improve error message for `Union` queries for incompatible types

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16726:


Assignee: Apache Spark

> Improve error message for `Union` queries for incompatible types
> 
>
> Key: SPARK-16726
> URL: https://issues.apache.org/jira/browse/SPARK-16726
> Project: Spark
>  Issue Type: Improvement
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, `UNION` query on incompatible types shows a misleading error 
> message, e.g., `unresolved operator Union`. We had better show a more correct 
> message.
> h4. Before
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
> {code}
> h4. After
> {code}
> scala> sql("select 1 union (select array(1))")
> org.apache.spark.sql.AnalysisException: Unions can only be performed on 
> tables with the compatible column types, but one table has '[IntegerType]' 
> and another table has '[ArrayType(IntegerType,false)]';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16726) Improve error message for `Union` queries for incompatible types

2016-07-25 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-16726:
-

 Summary: Improve error message for `Union` queries for 
incompatible types
 Key: SPARK-16726
 URL: https://issues.apache.org/jira/browse/SPARK-16726
 Project: Spark
  Issue Type: Improvement
Reporter: Dongjoon Hyun
Priority: Minor


Currently, `UNION` query on incompatible types shows a misleading error 
message, e.g., `unresolved operator Union`. We had better show a more correct 
message.

h4. Before
{code}
scala> sql("select 1 union (select array(1))")
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
{code}

h4. After
{code}
scala> sql("select 1 union (select array(1))")
org.apache.spark.sql.AnalysisException: Unions can only be performed on tables 
with the compatible column types, but one table has '[IntegerType]' and another 
table has '[ArrayType(IntegerType,false)]';
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16725) Migrate Guava to 16+

2016-07-25 Thread Min Wei (JIRA)

Min Wei created SPARK-16725:
---

 Summary: Migrate Guava to 16+
 Key: SPARK-16725
 URL: https://issues.apache.org/jira/browse/SPARK-16725
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.0.1
Reporter: Min Wei
 Fix For: 2.0.1


Currently Spark depends on an old version of Guava, version 14. However 
Spark-cassandra driver asserts on Guava version 16 and above. 

It would be great to update the Guava dependency to version 16+

diff --git a/core/src/main/scala/org/apache/spark/SecurityManager.scala 
b/core/src/main/scala/org/apache/spark/SecurityManager.scala
index f72c7de..abddafe 100644
--- a/core/src/main/scala/org/apache/spark/SecurityManager.scala
+++ b/core/src/main/scala/org/apache/spark/SecurityManager.scala
@@ -23,7 +23,7 @@ import java.security.{KeyStore, SecureRandom}
 import java.security.cert.X509Certificate
 import javax.net.ssl._
 
-import com.google.common.hash.HashCodes
+import com.google.common.hash.HashCode
 import com.google.common.io.Files
 import org.apache.hadoop.io.Text
 
@@ -432,7 +432,7 @@ private[spark] class SecurityManager(sparkConf: SparkConf)
 val secret = new Array[Byte](length)
 rnd.nextBytes(secret)
 
-val cookie = HashCodes.fromBytes(secret).toString()
+val cookie = HashCode.fromBytes(secret).toString()
 SparkHadoopUtil.get.addSecretKeyToUserCredentials(SECRET_LOOKUP_KEY, 
cookie)
 cookie
   } else {
diff --git a/core/src/main/scala/org/apache/spark/SparkEnv.scala 
b/core/src/main/scala/org/apache/spark/SparkEnv.scala
index af50a6d..02545ae 100644
--- a/core/src/main/scala/org/apache/spark/SparkEnv.scala
+++ b/core/src/main/scala/org/apache/spark/SparkEnv.scala
@@ -72,7 +72,7 @@ class SparkEnv (
 
   // A general, soft-reference map for metadata needed during HadoopRDD split 
computation
   // (e.g., HadoopFileRDD uses this to cache JobConfs and InputFormats).
-  private[spark] val hadoopJobMetadata = new 
MapMaker().softValues().makeMap[String, Any]()
+  private[spark] val hadoopJobMetadata = new 
MapMaker().weakValues().makeMap[String, Any]()
 
   private[spark] var driverTmpDir: Option[String] = None
 
diff --git a/pom.xml b/pom.xml
index d064cb5..7c3e036 100644
--- a/pom.xml
+++ b/pom.xml
@@ -368,8 +368,7 @@
   
 com.google.guava
 guava
-14.0.1
-provided
+19.0
   
   
   




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16724) Expose DefinedByConstructorParams

2016-07-25 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-16724:


 Summary: Expose DefinedByConstructorParams
 Key: SPARK-16724
 URL: https://issues.apache.org/jira/browse/SPARK-16724
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust


Generally we don't mark things in catalyst/execution as private.  Instead they 
are not included in scala doc as they are not considered stable APIs



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status

2016-07-25 Thread Asmaa Ali (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asmaa Ali  updated SPARK-16723:
---
Summary: exception in thread main org.apache.spark.sparkexception 
application finished with failed status  (was: exception in thread main 
org.apache.spark.sparkexception application finished with failed status #19)

> exception in thread main org.apache.spark.sparkexception application finished 
> with failed status
> 
>
> Key: SPARK-16723
> URL: https://issues.apache.org/jira/browse/SPARK-16723
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.2
> Environment: Dataprock cluster from google
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --
> conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 
> 1500m --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip 
> --verbose ./SparkBWA.jar -algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastq Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-cluster
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
> master yarn-cluster
> deployMode null
> executorMemory 1500m
> executorCores 1
> totalExecutorCores null
> propertiesFile /usr/lib/spark/conf/spark-defaults.conf
> driverMemory 1500m
> driverCores null
> driverExtraClassPath null
> driverExtraLibraryPath null
> driverExtraJavaOptions 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> supervise false
> queue null
> numExecutors null
> files null
> pyFiles null
> archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
> mainClass SparkBWA
> primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
> name SparkBWA
> childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
> jars null
> packages null
> packagesExclusions null
> repositories null
> verbose true
> Spark properties used, including those specified through
> --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
> spark.yarn.am.memoryOverhead -> 558
> spark.driver.memory -> 1500m
> spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
> spark.executor.memory -> 5586m
> spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
> spark.eventLog.enabled -> true
> spark.scheduler.minRegisteredResourcesRatio -> 0.0
> spark.dynamicAllocation.maxExecutors -> 1
> spark.akka.frameSize -> 512
> spark.executor.extraJavaOptions -> 
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> spark.sql.parquet.cacheMetadata -> false
> spark.shuffle.service.enabled -> true
> spark.history.fs.logDirectory -> 
> hdfs://cluster-cancerdetector-m/user/spark/eventlog
> spark.dynamicAllocation.initialExecutors -> 1
> spark.dynamicAllocation.minExecutors -> 1
>

[jira] [Created] (SPARK-16723) exception in thread main org.apache.spark.sparkexception application finished with failed status #19

2016-07-25 Thread Asmaa Ali (JIRA)

Asmaa Ali  created SPARK-16723:
--

 Summary: exception in thread main org.apache.spark.sparkexception 
application finished with failed status #19
 Key: SPARK-16723
 URL: https://issues.apache.org/jira/browse/SPARK-16723
 Project: Spark
  Issue Type: Question
  Components: Streaming
Affects Versions: 1.6.2
 Environment: Dataprock cluster from google
Reporter: Asmaa Ali 


What is the reason of this exception ?!

cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit --class 
SparkBWA --master yarn-cluster --
conf spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 1500m 
--executor-memory 1500m --executor-cores 1 --archives ./bwa.zip --verbose 
./SparkBWA.jar -algorithm mem -reads paired -index /Data/HumanBase/hg38 
-partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589
Using properties file: /usr/lib/spark/conf/spark-defaults.conf
Adding default property: 
spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
Adding default property: 
spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.driver.maxResultSize=1920m
Adding default property: spark.shuffle.service.enabled=true
Adding default property: 
spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
Adding default property: spark.sql.parquet.cacheMetadata=false
Adding default property: spark.driver.memory=3840m
Adding default property: spark.dynamicAllocation.maxExecutors=1
Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
Adding default property: spark.yarn.am.memoryOverhead=558
Adding default property: spark.yarn.am.memory=5586m
Adding default property: 
spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
Adding default property: spark.master=yarn-cluster
Adding default property: spark.executor.memory=5586m
Adding default property: 
spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.executor.cores=2
Adding default property: spark.yarn.executor.memoryOverhead=558
Adding default property: spark.dynamicAllocation.minExecutors=1
Adding default property: spark.dynamicAllocation.initialExecutors=1
Adding default property: spark.akka.frameSize=512
Parsed arguments:
master yarn-cluster
deployMode null
executorMemory 1500m
executorCores 1
totalExecutorCores null
propertiesFile /usr/lib/spark/conf/spark-defaults.conf
driverMemory 1500m
driverCores null
driverExtraClassPath null
driverExtraLibraryPath null
driverExtraJavaOptions 
-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
supervise false
queue null
numExecutors null
files null
pyFiles null
archives file:/home/cancerdetector/SparkBWA/build/./bwa.zip
mainClass SparkBWA
primaryResource file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
name SparkBWA
childArgs [-algorithm mem -reads paired -index /Data/HumanBase/hg38 -partitions 
32 ERR000589_1.filt.fastq ERR000589_2.filt.fastq Output_ERR000589]
jars null
packages null
packagesExclusions null
repositories null
verbose true

Spark properties used, including those specified through
--conf and those from the properties file 
/usr/lib/spark/conf/spark-defaults.conf:
spark.yarn.am.memoryOverhead -> 558
spark.driver.memory -> 1500m
spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
spark.executor.memory -> 5586m
spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
spark.eventLog.enabled -> true
spark.scheduler.minRegisteredResourcesRatio -> 0.0
spark.dynamicAllocation.maxExecutors -> 1
spark.akka.frameSize -> 512
spark.executor.extraJavaOptions -> 
-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
spark.sql.parquet.cacheMetadata -> false
spark.shuffle.service.enabled -> true
spark.history.fs.logDirectory -> 
hdfs://cluster-cancerdetector-m/user/spark/eventlog
spark.dynamicAllocation.initialExecutors -> 1
spark.dynamicAllocation.minExecutors -> 1
spark.yarn.executor.memoryOverhead -> 558
spark.driver.extraJavaOptions -> 
-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
spark.eventLog.dir -> hdfs://cluster-cancerdetector-m/user/spark/eventlog
spark.yarn.am.memory -> 5586m
spark.driver.maxResultSize -> 1920m
spark.master -> yarn-cluster
spark.dynamicAllocation.enabled -> true
spark.executor.cores -> 2

Main class:
org.apache.spark.deploy.yarn.Client
Arguments:
--name
SparkBWA
--driver-memory
1500m
--executor-memory
1500m
--executor-cores
1
--archives
file:/home/cancerdetector/SparkBWA/build/./bwa.zip
--jar
file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
--class
SparkBWA
--arg
-algorithm
--arg
mem

[jira] [Assigned] (SPARK-16722) Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16722:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix a StreamingContext leak in StreamingContextSuite when eventually fails
> --
>
> Key: SPARK-16722
> URL: https://issues.apache.org/jira/browse/SPARK-16722
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> This patch moves `ssc.stop()` into `finally` for 
> `StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
> StreamingContext since leaking a StreamingContext will fail a lot of tests 
> and make us hard to find the real failure one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16722) Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-16722:


 Summary: Fix a StreamingContext leak in StreamingContextSuite when 
eventually fails
 Key: SPARK-16722
 URL: https://issues.apache.org/jira/browse/SPARK-16722
 Project: Spark
  Issue Type: Test
  Components: Tests
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


This patch moves `ssc.stop()` into `finally` for 
`StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
StreamingContext since leaking a StreamingContext will fail a lot of tests and 
make us hard to find the real failure one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16722) Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392842#comment-15392842
 ] 

Apache Spark commented on SPARK-16722:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/14354

> Fix a StreamingContext leak in StreamingContextSuite when eventually fails
> --
>
> Key: SPARK-16722
> URL: https://issues.apache.org/jira/browse/SPARK-16722
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> This patch moves `ssc.stop()` into `finally` for 
> `StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
> StreamingContext since leaking a StreamingContext will fail a lot of tests 
> and make us hard to find the real failure one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16722) Fix a StreamingContext leak in StreamingContextSuite when eventually fails

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16722:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix a StreamingContext leak in StreamingContextSuite when eventually fails
> --
>
> Key: SPARK-16722
> URL: https://issues.apache.org/jira/browse/SPARK-16722
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> This patch moves `ssc.stop()` into `finally` for 
> `StreamingContextSuite.createValidCheckpoint` to avoid leaking a 
> StreamingContext since leaking a StreamingContext will fail a lot of tests 
> and make us hard to find the real failure one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16721) Lead/lag needs to respect nulls

2016-07-25 Thread Yin Huai (JIRA)

Yin Huai created SPARK-16721:


 Summary: Lead/lag needs to respect nulls 
 Key: SPARK-16721
 URL: https://issues.apache.org/jira/browse/SPARK-16721
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Yin Huai


Seems 2.0.0 changes the behavior of lead and lag to ignore nulls. This PR is 
changing the behavior back to 1.6's behavior, which is respecting nulls.

For example 
{code}
SELECT
b,
lag(a, 1, 321) OVER (ORDER BY b) as lag,
lead(a, 1, 321) OVER (ORDER BY b) as lead
FROM (SELECT cast(null as int) as a, 1 as b
UNION ALL
select cast(null as int) as id, 2 as b) tmp
{code}
This query should return 
{code}
+---+++
|  b| lag|lead|
+---+++
|  1| 321|null|
|  2|null| 321|
+---+++
{code}
instead of 
{code}
+---+---++
|  b|lag|lead|
+---+---++
|  1|321| 321|
|  2|321| 321|
+---+---++
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16633) lag/lead using constant input values does not return the default value when the offset row does not exist

2016-07-25 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392824#comment-15392824
 ] 

Yin Huai commented on SPARK-16633:
--

https://issues.apache.org/jira/browse/SPARK-16721

> lag/lead using constant input values does not return the default value when 
> the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16721) Lead/lag needs to respect nulls

2016-07-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-16721:


Assignee: Yin Huai

> Lead/lag needs to respect nulls 
> 
>
> Key: SPARK-16721
> URL: https://issues.apache.org/jira/browse/SPARK-16721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Seems 2.0.0 changes the behavior of lead and lag to ignore nulls. This PR is 
> changing the behavior back to 1.6's behavior, which is respecting nulls.
> For example 
> {code}
> SELECT
> b,
> lag(a, 1, 321) OVER (ORDER BY b) as lag,
> lead(a, 1, 321) OVER (ORDER BY b) as lead
> FROM (SELECT cast(null as int) as a, 1 as b
> UNION ALL
> select cast(null as int) as id, 2 as b) tmp
> {code}
> This query should return 
> {code}
> +---+++
> |  b| lag|lead|
> +---+++
> |  1| 321|null|
> |  2|null| 321|
> +---+++
> {code}
> instead of 
> {code}
> +---+---++
> |  b|lag|lead|
> +---+---++
> |  1|321| 321|
> |  2|321| 321|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16721) Lead/lag needs to respect nulls

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392826#comment-15392826
 ] 

Apache Spark commented on SPARK-16721:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/14284

> Lead/lag needs to respect nulls 
> 
>
> Key: SPARK-16721
> URL: https://issues.apache.org/jira/browse/SPARK-16721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Seems 2.0.0 changes the behavior of lead and lag to ignore nulls. This PR is 
> changing the behavior back to 1.6's behavior, which is respecting nulls.
> For example 
> {code}
> SELECT
> b,
> lag(a, 1, 321) OVER (ORDER BY b) as lag,
> lead(a, 1, 321) OVER (ORDER BY b) as lead
> FROM (SELECT cast(null as int) as a, 1 as b
> UNION ALL
> select cast(null as int) as id, 2 as b) tmp
> {code}
> This query should return 
> {code}
> +---+++
> |  b| lag|lead|
> +---+++
> |  1| 321|null|
> |  2|null| 321|
> +---+++
> {code}
> instead of 
> {code}
> +---+---++
> |  b|lag|lead|
> +---+---++
> |  1|321| 321|
> |  2|321| 321|
> +---+---++
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16715) Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic equals and hash"

2016-07-25 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-16715.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Fix a potential ExprId conflict for SubexpressionEliminationSuite."Semantic 
> equals and hash"
> 
>
> Key: SPARK-16715
> URL: https://issues.apache.org/jira/browse/SPARK-16715
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.1, 2.1.0
>
>
> SubexpressionEliminationSuite."Semantic equals and hash" assumes the default 
> AttributeReference's exprId wont' be "ExprId(1)". However, that depends on 
> when this test runs. It may happen to use "ExprId(1)".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16633) lag/lead using constant input values does not return the default value when the offset row does not exist

2016-07-25 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-16633:
-
Summary: lag/lead using constant input values does not return the default 
value when the offset row does not exist  (was: lag/lead does not return the 
default value when the offset row does not exist)

> lag/lead using constant input values does not return the default value when 
> the offset row does not exist
> -
>
> Key: SPARK-16633
> URL: https://issues.apache.org/jira/browse/SPARK-16633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Attachments: window_function_bug.html
>
>
> Please see the attached notebook. Seems lag/lead somehow fail to recognize 
> that a offset row does not exist and generate wrong results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16717) Dataframe (jdbc) is missing a way to link and external function to get a connection

2016-07-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392814#comment-15392814
 ] 

Dongjoon Hyun commented on SPARK-16717:
---

Hi, [~rjtokenring].
It seems not about Spark 1.3.0. Do you mean 1.3.0 really?

> Dataframe (jdbc) is missing a way to link and external function to get a 
> connection
> ---
>
> Key: SPARK-16717
> URL: https://issues.apache.org/jira/browse/SPARK-16717
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Marco Colombo
>
> In JdbcRRD it was possible to use a function to get a JDBC connection. This 
> allow an external handling of the connections while now this is no more 
> possible with dataframes. 
> Please consider an addition to Dataframes for using an externally provided 
> connectionFactory (such as a connection pool) in order to make data loading 
> more efficient, avoiding connection close/recreation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16714:


Assignee: Apache Spark

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3162) Train DecisionTree locally when possible

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3162:
-
Target Version/s: 2.1.0

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3723:
-
Target Version/s: 2.1.0

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-07-25 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392799#comment-15392799
 ] 

Apache Spark commented on SPARK-16714:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/14353

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-07-25 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16714:


Assignee: (was: Apache Spark)

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Comment: was deleted

(was: [~manishamde][~josephkb]
This is very annoying from you people.I have put forward a working 
implementation and there is no response from you people.I know that my 
implementation may not be a perfect one,but I can work on it to improve that 
.I'm ready to take up any suggestions from you people regarding the 
improvements. But the way you people are responding after asking for 
architecture and me uploading the implementation is not encouraging .This will 
certainly discourage people from contributing to spark.)

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16719) RandomForest: communicate fewer trees on each iteration

2016-07-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392794#comment-15392794
 ] 

Joseph K. Bradley commented on SPARK-16719:
---

I've found (for my tests) that this is one of the most important issues when 
training big forests.

> RandomForest: communicate fewer trees on each iteration
> ---
>
> Key: SPARK-16719
> URL: https://issues.apache.org/jira/browse/SPARK-16719
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> RandomForest currently sends the entire forest to each worker on each 
> iteration.  This is because (a) the node queue is FIFO and (b) the closure 
> references the entire array of trees ({{topNodes}}).  (a) causes RFs to 
> handle splits in many trees, especially early on in learning.  (b) sends all 
> trees explicitly.
> Proposal:
> (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a 
> few trees before focusing on others.
> (b) Change topNodes to pass only the trees required on that iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16719) RandomForest: communicate fewer trees on each iteration

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16719:
--
Priority: Critical  (was: Major)

> RandomForest: communicate fewer trees on each iteration
> ---
>
> Key: SPARK-16719
> URL: https://issues.apache.org/jira/browse/SPARK-16719
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> RandomForest currently sends the entire forest to each worker on each 
> iteration.  This is because (a) the node queue is FIFO and (b) the closure 
> references the entire array of trees ({{topNodes}}).  (a) causes RFs to 
> handle splits in many trees, especially early on in learning.  (b) sends all 
> trees explicitly.
> Proposal:
> (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a 
> few trees before focusing on others.
> (b) Change topNodes to pass only the trees required on that iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-07-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392792#comment-15392792
 ] 

Joseph K. Bradley commented on SPARK-3162:
--

[~yuhaoyan] [~MechCoder] This may be one of the most critical improvements for 
scaling trees.

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3162) Train DecisionTree locally when possible

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3162:
-
Priority: Critical  (was: Major)

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-07-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392788#comment-15392788
 ] 

Joseph K. Bradley commented on SPARK-3162:
--

Linking [SPARK-14043] since local training might be a good way to solve 
[SPARK-14043].  Local training might not need indices.

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16719) RandomForest: communicate fewer trees on each iteration

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16719:
--
Priority: Critical  (was: Major)

> RandomForest: communicate fewer trees on each iteration
> ---
>
> Key: SPARK-16719
> URL: https://issues.apache.org/jira/browse/SPARK-16719
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> RandomForest currently sends the entire forest to each worker on each 
> iteration.  This is because (a) the node queue is FIFO and (b) the closure 
> references the entire array of trees ({{topNodes}}).  (a) causes RFs to 
> handle splits in many trees, especially early on in learning.  (b) sends all 
> trees explicitly.
> Proposal:
> (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a 
> few trees before focusing on others.
> (b) Change topNodes to pass only the trees required on that iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16719) RandomForest: communicate fewer trees on each iteration

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-16719:
--
Priority: Major  (was: Critical)

> RandomForest: communicate fewer trees on each iteration
> ---
>
> Key: SPARK-16719
> URL: https://issues.apache.org/jira/browse/SPARK-16719
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> RandomForest currently sends the entire forest to each worker on each 
> iteration.  This is because (a) the node queue is FIFO and (b) the closure 
> references the entire array of trees ({{topNodes}}).  (a) causes RFs to 
> handle splits in many trees, especially early on in learning.  (b) sends all 
> trees explicitly.
> Proposal:
> (a) Change the RF node queue to be FILO, so that RFs tend to focus on 1 or a 
> few trees before focusing on others.
> (b) Change topNodes to pass only the trees required on that iteration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-07-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392780#comment-15392780
 ] 

Dongjoon Hyun commented on SPARK-16714:
---

Thank you! :)

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13434) Reduce Spark RandomForest memory footprint

2016-07-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392779#comment-15392779
 ] 

Joseph K. Bradley commented on SPARK-13434:
---

I agree it's very important.  That JIRA had gotten lost for a while, but it is 
now linked from the umbrella: [SPARK-3162]

> Reduce Spark RandomForest memory footprint
> --
>
> Key: SPARK-13434
> URL: https://issues.apache.org/jira/browse/SPARK-13434
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Ewan Higgs
>  Labels: decisiontree, mllib, randomforest
> Attachments: heap-usage.log, rf-heap-usage.png
>
>
> The RandomForest implementation can easily run out of memory on moderate 
> datasets. This was raised in the a user's benchmarking game on github 
> (https://github.com/szilard/benchm-ml/issues/19). I looked to see if there 
> was a tracking issue, but I couldn't fine one.
> Using Spark 1.6, a user of mine is running into problems running the 
> RandomForest training on largish datasets on machines with 64G memory and the 
> following in {{spark-defaults.conf}}:
> {code}
> spark.executor.cores 2
> spark.executor.instances 199
> spark.executor.memory 10240M
> {code}
> I reproduced the excessive memory use from the benchmark example (using an 
> input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
> --driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
> single machine by running {{jmap -histo:live }}. I took a sample 
> every 5 seconds and at the peak it looks like this:
> {code}
>  num #instances #bytes  class name
> --
>1:   5428073 8458773496  [D
>2:  12293653 4124641992  [I
>3:  32508964 1820501984  org.apache.spark.mllib.tree.model.Node
>4:  53068426 1698189632  org.apache.spark.mllib.tree.model.Predict
>5:  72853787 1165660592  scala.Some
>6:  16263408  910750848  
> org.apache.spark.mllib.tree.model.InformationGainStats
>7: 72969  390492744  [B
>8:   3327008  133080320  
> org.apache.spark.mllib.tree.impl.DTStatsAggregator
>9:   3754500  120144000  
> scala.collection.immutable.HashMap$HashMap1
>   10:   3318349  106187168  org.apache.spark.mllib.tree.model.Split
>   11:   3534946   84838704  
> org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
>   12:   3764745   60235920  java.lang.Integer
>   13:   3327008   53232128  
> org.apache.spark.mllib.tree.impurity.EntropyAggregator
>   14:380804   45361144  [C
>   15:268887   34877128  
>   16:268887   34431568  
>   17:908377   34042760  [Lscala.collection.immutable.HashMap;
>   18:   110   2640  
> org.apache.spark.mllib.regression.LabeledPoint
>   19:   110   2640  org.apache.spark.mllib.linalg.SparseVector
>   20: 20206   25979864  
>   21:   100   2400  org.apache.spark.mllib.tree.impl.TreePoint
>   22:   100   2400  
> org.apache.spark.mllib.tree.impl.BaggedPoint
>   23:908332   21799968  
> scala.collection.immutable.HashMap$HashTrieMap
>   24: 20206   20158864  
>   25: 17023   14380352  
>   26:16   13308288  
> [Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
>   27:445797   10699128  scala.Tuple2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3162) Train DecisionTree locally when possible

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3162:
-
Component/s: (was: MLlib)
 ML

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2016-07-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392775#comment-15392775
 ] 

Joseph K. Bradley commented on SPARK-3717:
--

Note: Initial code for this is available here: 
[https://spark-packages.org/package/fabuzaid21/yggdrasil]

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Assignee: (was: Joseph K. Bradley)

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3155) Support DecisionTree pruning

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3155:
-
Priority: Minor  (was: Major)

> Support DecisionTree pruning
> 
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-07-25 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392772#comment-15392772
 ] 

Yin Huai commented on SPARK-16714:
--

Sure. Thank you!

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392769#comment-15392769
 ] 

Joseph K. Bradley commented on SPARK-3728:
--

I was actually thinking of closing this issue.  I originally made it since 
Sequoia Forests support this feature, but I have not heard of real use cases 
for it.  If you have use cases, it'd be good to hear about.  Otherwise, I think 
we should focus on improvements to in-memory use cases.

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3728) RandomForest: Learn models too large to store in memory

2016-07-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3728:
-
Priority: Minor  (was: Major)

> RandomForest: Learn models too large to store in memory
> ---
>
> Key: SPARK-3728
> URL: https://issues.apache.org/jira/browse/SPARK-3728
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Proposal: Write trees to disk as they are learned.
> RandomForest currently uses a FIFO queue, which means training all trees at 
> once via breadth-first search.  Using a FILO queue would encourage the code 
> to finish one tree before moving on to new ones.  This would allow the code 
> to write trees to disk as they are learned.
> Note: It would also be possible to write nodes to disk as they are learned 
> using a FIFO queue, once the example--node mapping is cached [JIRA].  The 
> [Sequoia Forest package]() does this.  However, it could be useful to learn 
> trees progressively, so that future functionality such as early stopping 
> (training fewer trees than expected) could be supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16714) Fail to create a decimal arrays with literals having different inferred precessions and scales

2016-07-25 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392765#comment-15392765
 ] 

Dongjoon Hyun commented on SPARK-16714:
---

Hi, [~yhuai].
May I create a PR for this?

> Fail to create a decimal arrays with literals having different inferred 
> precessions and scales
> --
>
> Key: SPARK-16714
> URL: https://issues.apache.org/jira/browse/SPARK-16714
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In Spark 2.0, we will parse float literals as decimals. However, it 
> introduces a side-effect, which is described below. 
>  
> {code}
> select array(0.001, 0.02)
> {code}
> causes
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve 'array(CAST(0.001 AS 
> DECIMAL(3,3)), CAST(0.02 AS DECIMAL(2,2)))' due to data type mismatch: input 
> to function array should all be the same type, but it's [decimal(3,3), 
> decimal(2,2)]; line 1 pos 7
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16628) OrcConversions should not convert an ORC table represented by MetastoreRelation to HadoopFsRelation if metastore schema does not match schema stored in ORC files

2016-07-25 Thread Nic Eggert (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392751#comment-15392751
 ] 

Nic Eggert commented on SPARK-16628:


Yeah, I attempted to fix this myself by having it just take the schema from the 
MetaStore instead of the file, but that doesn't work, because you're trying to 
read the file using the wrong schema. I think you'd probably need to make some 
sort of translation map. That's about the point where I realized I was in over 
my head.

> OrcConversions should not convert an ORC table represented by 
> MetastoreRelation to HadoopFsRelation if metastore schema does not match 
> schema stored in ORC files
> -
>
> Key: SPARK-16628
> URL: https://issues.apache.org/jira/browse/SPARK-16628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> When {{spark.sql.hive.convertMetastoreOrc}} is enabled, we will convert a ORC 
> table represented by a MetastoreRelation to HadoopFsRelation that uses 
> Spark's OrcFileFormat internally. This conversion aims to make table scanning 
> have a better performance since at runtime, the code path to scan 
> HadoopFsRelation's performance is better. However, OrcFileFormat's 
> implementation is based on the assumption that ORC files store their schema 
> with correct column names. However, before Hive 2.0, an ORC table created by 
> Hive does not store column name correctly in the ORC files (HIVE-4243). So, 
> for this kind of ORC datasets, we cannot really convert the code path. 
> Right now, if ORC tables are created by Hive 1.x or 0.x, enabling 
> {{spark.sql.hive.convertMetastoreOrc}} will introduce a runtime exception for 
> non-partitioned ORC tables and drop the metastore schema for partitioned ORC 
> tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16166) Correctly honor off heap memory usage in web ui and log display

2016-07-25 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-16166:
---
Assignee: Saisai Shao

> Correctly honor off heap memory usage in web ui and log display
> ---
>
> Key: SPARK-16166
> URL: https://issues.apache.org/jira/browse/SPARK-16166
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently in the log and UI display, only on-heap storage memory is 
> calculated and displayed, actually with SPARK-13992 off-heap memory is 
> supported for data persistence, so here change to also honor off-heap storage 
> memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 234 matches

Mail list logo