date:20160912


 [ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17123:


Assignee: (was: Apache Spark)

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile


[ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486354#comment-15486354
 ] 

Apache Spark commented on SPARK-17123:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15072

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile


 [ 
https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17123:


Assignee: Apache Spark

> Performing set operations that combine string and date / timestamp columns 
> may result in generated projection code which doesn't compile
> 
>
> Key: SPARK-17123
> URL: https://issues.apache.org/jira/browse/SPARK-17123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Minor
>
> The following example program causes SpecificSafeProjection code generation 
> to produce Java code which doesn't compile:
> {code}
> import org.apache.spark.sql.types._
> spark.sql("set spark.sql.codegen.fallback=false")
> val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new 
> java.sql.Date(0, StructType(StructField("value", DateType) :: Nil))
> val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF
> dateDF.union(longDF).collect()
> {code}
> This fails at runtime with the following error:
> {code}
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 28, Column 107: No applicable constructor/method found 
> for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates 
> are: "public static java.sql.Date 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)"
> /* 001 */ public java.lang.Object generate(Object[] references) {
> /* 002 */   return new SpecificSafeProjection(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
> /* 006 */
> /* 007 */   private Object[] references;
> /* 008 */   private MutableRow mutableRow;
> /* 009 */   private Object[] values;
> /* 010 */   private org.apache.spark.sql.types.StructType schema;
> /* 011 */
> /* 012 */
> /* 013 */   public SpecificSafeProjection(Object[] references) {
> /* 014 */ this.references = references;
> /* 015 */ mutableRow = (MutableRow) references[references.length - 1];
> /* 016 */
> /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) 
> references[0];
> /* 018 */   }
> /* 019 */
> /* 020 */   public java.lang.Object apply(java.lang.Object _i) {
> /* 021 */ InternalRow i = (InternalRow) _i;
> /* 022 */
> /* 023 */ values = new Object[1];
> /* 024 */
> /* 025 */ boolean isNull2 = i.isNullAt(0);
> /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0));
> /* 027 */ boolean isNull1 = isNull2;
> /* 028 */ final java.sql.Date value1 = isNull1 ? null : 
> org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2);
> /* 029 */ isNull1 = value1 == null;
> /* 030 */ if (isNull1) {
> /* 031 */   values[0] = null;
> /* 032 */ } else {
> /* 033 */   values[0] = value1;
> /* 034 */ }
> /* 035 */
> /* 036 */ final org.apache.spark.sql.Row value = new 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, 
> schema);
> /* 037 */ if (false) {
> /* 038 */   mutableRow.setNullAt(0);
> /* 039 */ } else {
> /* 040 */
> /* 041 */   mutableRow.update(0, value);
> /* 042 */ }
> /* 043 */
> /* 044 */ return mutableRow;
> /* 045 */   }
> /* 046 */ }
> {code}
> Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the 
> generated code tries to call it with a UTF8String while the method expects an 
> int instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec

2016-09-12 Thread Kent Yao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-17517:
-
Description: 
For current `BroadcastHashJoinExec`, we generate join code for key is not 
unique like this: 

{code:title=Bar.java|borderStyle=solid}
while (matches.hasnext) {
matched = matches.next
check and read stream side row fields
check and read build side row fieldes
reset result row
write stream side row fields to result row
write stream side row fields to result row
append(result row)
}
{code}

For some cases, we don't need to check/read/write the steam side repeatedly in 
such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  all 
left side fields are fixed length` and so on. we may generate the code as below:

```java
check and read stream side row fields
reset result row
write stream side row fields to result row
while (matches.hasnext)
{
matched = matches.next
check and read build side row fieldes
write stream side row fields to result row
append(result row)
}
```



  was:
For current `BroadcastHashJoinExec`, we generate join code for key is not 
unique like this: 

```java
while (matches.hasnext)
matched = matches.next
check and read stream side row fields
check and read build side row fieldes
reset result row
write stream side row fields to result row
write stream side row fields to result row
```

For some cases, we don't need to check/read/write the steam side repeatedly in 
such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  all 
left side fields are fixed length` and so on. we may generate the code as below:

```java
check and read stream side row fields
reset result row
write stream side row fields to result row
while (matches.hasnext)
matched = matches.next
check and read build side row fieldes
write stream side row fields to result row
```




> Improve generated Code for BroadcastHashJoinExec
> 
>
> Key: SPARK-17517
> URL: https://issues.apache.org/jira/browse/SPARK-17517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kent Yao
> Fix For: 2.1.0
>
>
> For current `BroadcastHashJoinExec`, we generate join code for key is not 
> unique like this: 
> {code:title=Bar.java|borderStyle=solid}
> while (matches.hasnext) {
> matched = matches.next
> check and read stream side row fields
> check and read build side row fieldes
> reset result row
> write stream side row fields to result row
> write stream side row fields to result row
> append(result row)
> }
> {code}
> For some cases, we don't need to check/read/write the steam side repeatedly 
> in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  
> all left side fields are fixed length` and so on. we may generate the code as 
> below:
> ```java
> check and read stream side row fields
> reset result row
> write stream side row fields to result row
> while (matches.hasnext)
> {
> matched = matches.next
> check and read build side row fieldes
> write stream side row fields to result row
> append(result row)
> }
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec


[ 
https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486335#comment-15486335
 ] 

Apache Spark commented on SPARK-17517:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/15071

> Improve generated Code for BroadcastHashJoinExec
> 
>
> Key: SPARK-17517
> URL: https://issues.apache.org/jira/browse/SPARK-17517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kent Yao
> Fix For: 2.1.0
>
>
> For current `BroadcastHashJoinExec`, we generate join code for key is not 
> unique like this: 
> ```java
> while (matches.hasnext)
> matched = matches.next
> check and read stream side row fields
> check and read build side row fieldes
> reset result row
> write stream side row fields to result row
> write stream side row fields to result row
> ```
> For some cases, we don't need to check/read/write the steam side repeatedly 
> in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  
> all left side fields are fixed length` and so on. we may generate the code as 
> below:
> ```java
> check and read stream side row fields
> reset result row
> write stream side row fields to result row
> while (matches.hasnext)
> matched = matches.next
> check and read build side row fieldes
> write stream side row fields to result row
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec


 [ 
https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17517:


Assignee: (was: Apache Spark)

> Improve generated Code for BroadcastHashJoinExec
> 
>
> Key: SPARK-17517
> URL: https://issues.apache.org/jira/browse/SPARK-17517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kent Yao
> Fix For: 2.1.0
>
>
> For current `BroadcastHashJoinExec`, we generate join code for key is not 
> unique like this: 
> ```java
> while (matches.hasnext)
> matched = matches.next
> check and read stream side row fields
> check and read build side row fieldes
> reset result row
> write stream side row fields to result row
> write stream side row fields to result row
> ```
> For some cases, we don't need to check/read/write the steam side repeatedly 
> in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  
> all left side fields are fixed length` and so on. we may generate the code as 
> below:
> ```java
> check and read stream side row fields
> reset result row
> write stream side row fields to result row
> while (matches.hasnext)
> matched = matches.next
> check and read build side row fieldes
> write stream side row fields to result row
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec


 [ 
https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17517:


Assignee: Apache Spark

> Improve generated Code for BroadcastHashJoinExec
> 
>
> Key: SPARK-17517
> URL: https://issues.apache.org/jira/browse/SPARK-17517
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Kent Yao
>Assignee: Apache Spark
> Fix For: 2.1.0
>
>
> For current `BroadcastHashJoinExec`, we generate join code for key is not 
> unique like this: 
> ```java
> while (matches.hasnext)
> matched = matches.next
> check and read stream side row fields
> check and read build side row fieldes
> reset result row
> write stream side row fields to result row
> write stream side row fields to result row
> ```
> For some cases, we don't need to check/read/write the steam side repeatedly 
> in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  
> all left side fields are fixed length` and so on. we may generate the code as 
> below:
> ```java
> check and read stream side row fields
> reset result row
> write stream side row fields to result row
> while (matches.hasnext)
> matched = matches.next
> check and read build side row fieldes
> write stream side row fields to result row
> ```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec

2016-09-12 Thread Kent Yao (JIRA)

Kent Yao created SPARK-17517:


 Summary: Improve generated Code for BroadcastHashJoinExec
 Key: SPARK-17517
 URL: https://issues.apache.org/jira/browse/SPARK-17517
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Kent Yao
 Fix For: 2.1.0


For current `BroadcastHashJoinExec`, we generate join code for key is not 
unique like this: 

```java
while (matches.hasnext)
matched = matches.next
check and read stream side row fields
check and read build side row fieldes
reset result row
write stream side row fields to result row
write stream side row fields to result row
```

For some cases, we don't need to check/read/write the steam side repeatedly in 
such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft &&  all 
left side fields are fixed length` and so on. we may generate the code as below:

```java
check and read stream side row fields
reset result row
write stream side row fields to result row
while (matches.hasnext)
matched = matches.next
check and read build side row fieldes
write stream side row fields to result row
```





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17502) Multiple Bugs in DDL Statements on Temporary Views

2016-09-12 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17502:

Description: 
- When the permanent tables/views do not exist but the temporary view exists, 
the expected error should be `NoSuchTableException` for partition-related ALTER 
TABLE commands. However, it always reports a confusing error message. For 
example, 
{noformat}
Partition spec is invalid. The spec (a, b) must match the partition spec () 
defined in table '`testview`';
{noformat}
- When the permanent tables/views do not exist but the temporary view exists, 
the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET 
TBLPROPERTIES`. However, it reports missing table property. However, the 
expected error should be `NoSuchTableException`. For example, 
{noformat}
Attempted to unset non-existent property 'p' in table '`testView`';
{noformat}
- When `ANALYZE TABLE` is called on a view or a temporary view, we should issue 
an error message. However, it reports a strange error:
{noformat}
ANALYZE TABLE is not supported for Project
{noformat}


  was:
- When the permanent tables/views do not exist but the temporary view exists, 
the expected error should be `NoSuchTableException` for partition-related ALTER 
TABLE commands. However, it always reports a confusing error message. For 
example, 
{noformat}
Partition spec is invalid. The spec (a, b) must match the partition spec () 
defined in table '`testview`';
{noformat}
- When the permanent tables/views do not exist but the temporary view exists, 
the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET 
TBLPROPERTIES`. However, it reports missing table property. However, the 
expected error should be `NoSuchTableException`. For example, 
{noformat}
Attempted to unset non-existent property 'p' in table '`testView`';
{noformat}


> Multiple Bugs in DDL Statements on Temporary Views 
> ---
>
> Key: SPARK-17502
> URL: https://issues.apache.org/jira/browse/SPARK-17502
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Xiao Li
>
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for partition-related 
> ALTER TABLE commands. However, it always reports a confusing error message. 
> For example, 
> {noformat}
> Partition spec is invalid. The spec (a, b) must match the partition spec () 
> defined in table '`testview`';
> {noformat}
> - When the permanent tables/views do not exist but the temporary view exists, 
> the expected error should be `NoSuchTableException` for `ALTER TABLE ... 
> UNSET TBLPROPERTIES`. However, it reports missing table property. However, 
> the expected error should be `NoSuchTableException`. For example, 
> {noformat}
> Attempted to unset non-existent property 'p' in table '`testView`';
> {noformat}
> - When `ANALYZE TABLE` is called on a view or a temporary view, we should 
> issue an error message. However, it reports a strange error:
> {noformat}
> ANALYZE TABLE is not supported for Project
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17516) Current user info is not checked on STS in DML queries

2016-09-12 Thread Tao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486192#comment-15486192
 ] 

Tao Li commented on SPARK-17516:


cc [~bikassaha], [~thejas]

> Current user info is not checked on STS in DML queries
> --
>
> Key: SPARK-17516
> URL: https://issues.apache.org/jira/browse/SPARK-17516
> Project: Spark
>  Issue Type: Bug
>Reporter: Tao Li
>Priority: Critical
>
> I have captured some issues related to doAs support from STS. I am using a 
> non-secure cluster as my test environment. Simply speaking, the end user info 
> is not being passed when STS talks to metastore, so the impersonation is not 
> happening on metastore.
> STS is using a ClientWarpper instance (which is wrapped in HiveContext) for 
> each session. However by design all ClientWarpper instances are sharing the 
> same Hive instance, which is responsible for talking to Metastore. A 
> singleton IsolatedClientLoader instance is initialized when STS starts up and 
> it contains the cachedHive instance. The cachedHive is associated “hive” UGI, 
> since no session has been set up so current user is “hive". Then each session 
> creates a ClientWarpper instance which is associated with the same cachedHive 
> instance.
> When we make queries after session is established, the code path to retrieve 
> the Hive instance is different for DML and DDL operation. Looks like DML 
> operation related code has less dependency on hive-exec module.
> For the DML operations (e.g. “select *”), STS calls into ClientWarpper code 
> and talks to metastore through the singleton Hive instance directly. There is 
> no code involved to check the current user. That’s why doAs is not being 
> respected, even though current user is already switched to the end user in 
> the thread context.
> For DDL operations (e.g. “ALTER table”), STS eventually calls into hive 
> driver code (e.g. BaseSemanticAnalyzer). From there Hive.get() is called to 
> get the thread local Hive instance and refresh it if necessary. If the 
> current user has changed, we refresh the Hive instance by recreating the 
> metastore connection with the current user info. So even though all thread 
> locals are actually referencing the singleton Hive instance, calling 
> Hive.get() is playing an important role here to take any UGI change into 
> account. That’s why the DDL operations respects doAs . 
> The fix should be calling Hive.get() for the DML operations, like the hive 
> driver code called from DDL operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17516) Current user info is not checked on STS in DML queries

2016-09-12 Thread Tao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486177#comment-15486177
 ] 

Tao Li commented on SPARK-17516:


I have verified that switching to Hive.get() to get the Hive instance can make 
impersonation function. However there is a performance hit. If multiple 
sessions are set up from various end users, then the singleton Hive instance 
keeps being refreshed and we keep re-connecting from STS to metastore. 
Hive.get() compared the user associated with the singleton Hive and the current 
user in the thread local context, and reset the connection if the users are 
different. The threads running for a session always have the end user's UGI 
associated. So when threads from different sessions access the singleton Hive 
instance interleavingly, the connection keeps being reset. That's the perf hit.

Also there is potential thread safely issue when the singleton Hive instance 
are used concurrently by multiple sessions. We can add a lock to synchronize, 
but that's also a perf hit.

So the more ideal fix is having thread local Hive instances.

> Current user info is not checked on STS in DML queries
> --
>
> Key: SPARK-17516
> URL: https://issues.apache.org/jira/browse/SPARK-17516
> Project: Spark
>  Issue Type: Bug
>Reporter: Tao Li
>Priority: Critical
>
> I have captured some issues related to doAs support from STS. I am using a 
> non-secure cluster as my test environment. Simply speaking, the end user info 
> is not being passed when STS talks to metastore, so the impersonation is not 
> happening on metastore.
> STS is using a ClientWarpper instance (which is wrapped in HiveContext) for 
> each session. However by design all ClientWarpper instances are sharing the 
> same Hive instance, which is responsible for talking to Metastore. A 
> singleton IsolatedClientLoader instance is initialized when STS starts up and 
> it contains the cachedHive instance. The cachedHive is associated “hive” UGI, 
> since no session has been set up so current user is “hive". Then each session 
> creates a ClientWarpper instance which is associated with the same cachedHive 
> instance.
> When we make queries after session is established, the code path to retrieve 
> the Hive instance is different for DML and DDL operation. Looks like DML 
> operation related code has less dependency on hive-exec module.
> For the DML operations (e.g. “select *”), STS calls into ClientWarpper code 
> and talks to metastore through the singleton Hive instance directly. There is 
> no code involved to check the current user. That’s why doAs is not being 
> respected, even though current user is already switched to the end user in 
> the thread context.
> For DDL operations (e.g. “ALTER table”), STS eventually calls into hive 
> driver code (e.g. BaseSemanticAnalyzer). From there Hive.get() is called to 
> get the thread local Hive instance and refresh it if necessary. If the 
> current user has changed, we refresh the Hive instance by recreating the 
> metastore connection with the current user info. So even though all thread 
> locals are actually referencing the singleton Hive instance, calling 
> Hive.get() is playing an important role here to take any UGI change into 
> account. That’s why the DDL operations respects doAs . 
> The fix should be calling Hive.get() for the DML operations, like the hive 
> driver code called from DDL operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-12 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486131#comment-15486131
 ] 

Cody Koeninger commented on SPARK-15406:


I've got a minimal working Source and SourceProvider, at least for topics that 
are String key and value only, at

https://github.com/apache/spark/compare/master...koeninger:SPARK-15406

If you haven't already attempted an implementation, I'd suggest at least 
looking at that before writing up a design doc that may or may not address some 
of the pragmatic issues.

The big thing I'm running into, and maybe I'm just not understanding the 
intention behind the SourceProvider interface, is that putting all 
configuration through a Map[String, String] makes it super awkward to configure 
types, or classes, or collections of offsets, or... anything really.

Another significant issue is that I have no idea how rate limiting is supposed 
to work.


> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17516) Current user info is not checked on STS in DML queries

2016-09-12 Thread Tao Li (JIRA)

Tao Li created SPARK-17516:
--

 Summary: Current user info is not checked on STS in DML queries
 Key: SPARK-17516
 URL: https://issues.apache.org/jira/browse/SPARK-17516
 Project: Spark
  Issue Type: Bug
Reporter: Tao Li
Priority: Critical


I have captured some issues related to doAs support from STS. I am using a 
non-secure cluster as my test environment. Simply speaking, the end user info 
is not being passed when STS talks to metastore, so the impersonation is not 
happening on metastore.

STS is using a ClientWarpper instance (which is wrapped in HiveContext) for 
each session. However by design all ClientWarpper instances are sharing the 
same Hive instance, which is responsible for talking to Metastore. A singleton 
IsolatedClientLoader instance is initialized when STS starts up and it contains 
the cachedHive instance. The cachedHive is associated “hive” UGI, since no 
session has been set up so current user is “hive". Then each session creates a 
ClientWarpper instance which is associated with the same cachedHive instance.

When we make queries after session is established, the code path to retrieve 
the Hive instance is different for DML and DDL operation. Looks like DML 
operation related code has less dependency on hive-exec module.

For the DML operations (e.g. “select *”), STS calls into ClientWarpper code and 
talks to metastore through the singleton Hive instance directly. There is no 
code involved to check the current user. That’s why doAs is not being 
respected, even though current user is already switched to the end user in the 
thread context.

For DDL operations (e.g. “ALTER table”), STS eventually calls into hive driver 
code (e.g. BaseSemanticAnalyzer). From there Hive.get() is called to get the 
thread local Hive instance and refresh it if necessary. If the current user has 
changed, we refresh the Hive instance by recreating the metastore connection 
with the current user info. So even though all thread locals are actually 
referencing the singleton Hive instance, calling Hive.get() is playing an 
important role here to take any UGI change into account. That’s why the DDL 
operations respects doAs . 

The fix should be calling Hive.get() for the DML operations, like the hive 
driver code called from DDL operation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17515) CollectLimit.execute() should perform per-partition limits


[ 
https://issues.apache.org/jira/browse/SPARK-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485986#comment-15485986
 ] 

Apache Spark commented on SPARK-17515:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15070

> CollectLimit.execute() should perform per-partition limits
> --
>
> Key: SPARK-17515
> URL: https://issues.apache.org/jira/browse/SPARK-17515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to 
> performance regressions in case this case is hit (which should not happen in 
> normal operation, but can occur in some pathological corner-cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17515) CollectLimit.execute() should perform per-partition limits


 [ 
https://issues.apache.org/jira/browse/SPARK-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17515:


Assignee: Apache Spark  (was: Josh Rosen)

> CollectLimit.execute() should perform per-partition limits
> --
>
> Key: SPARK-17515
> URL: https://issues.apache.org/jira/browse/SPARK-17515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to 
> performance regressions in case this case is hit (which should not happen in 
> normal operation, but can occur in some pathological corner-cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17515) CollectLimit.execute() should perform per-partition limits


 [ 
https://issues.apache.org/jira/browse/SPARK-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17515:


Assignee: Josh Rosen  (was: Apache Spark)

> CollectLimit.execute() should perform per-partition limits
> --
>
> Key: SPARK-17515
> URL: https://issues.apache.org/jira/browse/SPARK-17515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to 
> performance regressions in case this case is hit (which should not happen in 
> normal operation, but can occur in some pathological corner-cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17515) CollectLimit.execute() should perform per-partition limits

Josh Rosen created SPARK-17515:
--

 Summary: CollectLimit.execute() should perform per-partition limits
 Key: SPARK-17515
 URL: https://issues.apache.org/jira/browse/SPARK-17515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


{{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to 
performance regressions in case this case is hit (which should not happen in 
normal operation, but can occur in some pathological corner-cases



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing

2016-09-12 Thread Kishor Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kishor Patil updated SPARK-17511:
-
Description: 
While trying to reach launch multiple containers in pool, if running executors 
count reaches or goes beyond the target running executors, the container is 
released and marked failed. This can cause many jobs to be marked failed 
causing overall job failure.

I will have a patch up soon after completing testing.

{panel:title=Typical Exception found in Driver marking the container to Failed}
{code}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at 
org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
at 
org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
{panel}



  was:
While trying to reach launch multiple containers in pool, if running executors 
count reaches or goes beyond the target running executors, the container is 
released and marked failed. This can cause many jobs to be marked failed 
causing overall job failure.

I will have a patch up soon after completing testing.




> Dynamic allocation race condition: Containers getting marked failed while 
> releasing
> ---
>
> Key: SPARK-17511
> URL: https://issues.apache.org/jira/browse/SPARK-17511
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Kishor Patil
>
> While trying to reach launch multiple containers in pool, if running 
> executors count reaches or goes beyond the target running executors, the 
> container is released and marked failed. This can cause many jobs to be 
> marked failed causing overall job failure.
> I will have a patch up soon after completing testing.
> {panel:title=Typical Exception found in Driver marking the container to 
> Failed}
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:156)
> at 
> org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
> at 
> org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing


 [ 
https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17511:


Assignee: (was: Apache Spark)

> Dynamic allocation race condition: Containers getting marked failed while 
> releasing
> ---
>
> Key: SPARK-17511
> URL: https://issues.apache.org/jira/browse/SPARK-17511
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Kishor Patil
>
> While trying to reach launch multiple containers in pool, if running 
> executors count reaches or goes beyond the target running executors, the 
> container is released and marked failed. This can cause many jobs to be 
> marked failed causing overall job failure.
> I will have a patch up soon after completing testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing


 [ 
https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17511:


Assignee: Apache Spark

> Dynamic allocation race condition: Containers getting marked failed while 
> releasing
> ---
>
> Key: SPARK-17511
> URL: https://issues.apache.org/jira/browse/SPARK-17511
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Kishor Patil
>Assignee: Apache Spark
>
> While trying to reach launch multiple containers in pool, if running 
> executors count reaches or goes beyond the target running executors, the 
> container is released and marked failed. This can cause many jobs to be 
> marked failed causing overall job failure.
> I will have a patch up soon after completing testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing


[ 
https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485941#comment-15485941
 ] 

Apache Spark commented on SPARK-17511:
--

User 'kishorvpatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/15069

> Dynamic allocation race condition: Containers getting marked failed while 
> releasing
> ---
>
> Key: SPARK-17511
> URL: https://issues.apache.org/jira/browse/SPARK-17511
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Kishor Patil
>
> While trying to reach launch multiple containers in pool, if running 
> executors count reaches or goes beyond the target running executors, the 
> container is released and marked failed. This can cause many jobs to be 
> marked failed causing overall job failure.
> I will have a patch up soon after completing testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml

2016-09-12 Thread Vladimir Feinberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vladimir Feinberg updated SPARK-16728:
--
Description: 
Currently, spark.ml trees rely on spark.mllib implementations. There are two 
issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
(the previous ensemble's prediction, for instance) inside the TreePoints (this 
is necessary to have loss-based splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd 
order. These are useless for several impurity measures and inadequate for 
others (e.g., absolute loss or huber loss). It needs some renovation.
3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, and 
Impurity into a single class (and use virtual calls rather than case statements 
when toggling over impurity types).


  was:
Currently, spark.ml trees rely on spark.mllib implementations. There are two 
issues with this:

1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
(the previous ensemble's prediction, for instance) inside the TreePoints (this 
is necessary to have loss-based splits for complex loss functions).
2. The old impurity API only lets you use summary statistics up to the 2nd 
order. These are useless for several impurity measures and inadequate for 
others (e.g., absolute loss or huber loss). It needs some renovation.


> migrate internal API for MLlib trees from spark.mllib to spark.ml
> -
>
> Key: SPARK-16728
> URL: https://issues.apache.org/jira/browse/SPARK-16728
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Vladimir Feinberg
>
> Currently, spark.ml trees rely on spark.mllib implementations. There are two 
> issues with this:
> 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information 
> (the previous ensemble's prediction, for instance) inside the TreePoints 
> (this is necessary to have loss-based splits for complex loss functions).
> 2. The old impurity API only lets you use summary statistics up to the 2nd 
> order. These are useless for several impurity measures and inadequate for 
> others (e.g., absolute loss or huber loss). It needs some renovation.
> 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, 
> and Impurity into a single class (and use virtual calls rather than case 
> statements when toggling over impurity types).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python


[ 
https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485821#comment-15485821
 ] 

Apache Spark commented on SPARK-17514:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15068

> df.take(1) and df.limit(1).collect() perform differently in Python
> --
>
> Key: SPARK-17514
> URL: https://issues.apache.org/jira/browse/SPARK-17514
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In PySpark, {{df.take(1)}} ends up running a single-stage job which computes 
> only one partition of {{df}}, while {{df.limit(1).collect()}} ends up 
> computing all partitions of {{df}} and runs a two-stage job. This difference 
> in performance is confusing, so I think that we should generalize the fix 
> from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently 
> in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17514:


Assignee: Josh Rosen  (was: Apache Spark)

> df.take(1) and df.limit(1).collect() perform differently in Python
> --
>
> Key: SPARK-17514
> URL: https://issues.apache.org/jira/browse/SPARK-17514
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In PySpark, {{df.take(1)}} ends up running a single-stage job which computes 
> only one partition of {{df}}, while {{df.limit(1).collect()}} ends up 
> computing all partitions of {{df}} and runs a two-stage job. This difference 
> in performance is confusing, so I think that we should generalize the fix 
> from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently 
> in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17514:


Assignee: Apache Spark  (was: Josh Rosen)

> df.take(1) and df.limit(1).collect() perform differently in Python
> --
>
> Key: SPARK-17514
> URL: https://issues.apache.org/jira/browse/SPARK-17514
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In PySpark, {{df.take(1)}} ends up running a single-stage job which computes 
> only one partition of {{df}}, while {{df.limit(1).collect()}} ends up 
> computing all partitions of {{df}} and runs a two-stage job. This difference 
> in performance is confusing, so I think that we should generalize the fix 
> from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently 
> in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485782#comment-15485782
 ] 

Evan Zamir commented on SPARK-17508:


[~bryanc] Oh, that helps a lot! I've been writing very light wrappers around 
Spark functions and it wasn't clear to me whether I could keep weightCol as an 
optional parameter. At least now I can reason about how to do it better.

I guess this isn't so much a bug then, as it is a feature request. So if 
someone wants to close the issue or reclassify, that would make sense. I can 
only imagine I'm not the only Spark user who has been miffed by this.

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-12 Thread Tathagata Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485780#comment-15485780
 ] 

Tathagata Das commented on SPARK-15406:
---

Hey all, I am working on the design doc right now. I will upload it here soon. 

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python

Josh Rosen created SPARK-17514:
--

 Summary: df.take(1) and df.limit(1).collect() perform differently 
in Python
 Key: SPARK-17514
 URL: https://issues.apache.org/jira/browse/SPARK-17514
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Josh Rosen
Assignee: Josh Rosen


In PySpark, {{df.take(1)}} ends up running a single-stage job which computes 
only one partition of {{df}}, while {{df.limit(1).collect()}} ends up computing 
all partitions of {{df}} and runs a two-stage job. This difference in 
performance is confusing, so I think that we should generalize the fix from 
SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently in 
Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17513) StreamExecution should discard unneeded metadata


 [ 
https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17513:


Assignee: Apache Spark

> StreamExecution should discard unneeded metadata
> 
>
> Key: SPARK-17513
> URL: https://issues.apache.org/jira/browse/SPARK-17513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>Assignee: Apache Spark
>
> The StreamExecution maintains a write-ahead log of batch metadata in order to 
> allow repeating previously in-flight batches if the driver is restarted. 
> StreamExecution does not garbage-collect or compact this log in any way.
> Since the log is implemented with HDFSMetadataLog, these files will consume 
> memory on the HDFS NameNode. Specifically, each log file will consume about 
> 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the 
> block of file contents; see 
> [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
>  An application with a 100 msec batch interval will increase the NameNode's 
> heap usage by about 250MB per day.
> There is also the matter of recovery. StreamExecution reads its entire log 
> when restarting. This read operation will be very expensive if the log 
> contains millions of entries spread over millions of files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17513) StreamExecution should discard unneeded metadata


 [ 
https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17513:


Assignee: (was: Apache Spark)

> StreamExecution should discard unneeded metadata
> 
>
> Key: SPARK-17513
> URL: https://issues.apache.org/jira/browse/SPARK-17513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> The StreamExecution maintains a write-ahead log of batch metadata in order to 
> allow repeating previously in-flight batches if the driver is restarted. 
> StreamExecution does not garbage-collect or compact this log in any way.
> Since the log is implemented with HDFSMetadataLog, these files will consume 
> memory on the HDFS NameNode. Specifically, each log file will consume about 
> 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the 
> block of file contents; see 
> [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
>  An application with a 100 msec batch interval will increase the NameNode's 
> heap usage by about 250MB per day.
> There is also the matter of recovery. StreamExecution reads its entire log 
> when restarting. This read operation will be very expensive if the log 
> contains millions of entries spread over millions of files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17513) StreamExecution should discard unneeded metadata


[ 
https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485735#comment-15485735
 ] 

Apache Spark commented on SPARK-17513:
--

User 'frreiss' has created a pull request for this issue:
https://github.com/apache/spark/pull/15067

> StreamExecution should discard unneeded metadata
> 
>
> Key: SPARK-17513
> URL: https://issues.apache.org/jira/browse/SPARK-17513
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> The StreamExecution maintains a write-ahead log of batch metadata in order to 
> allow repeating previously in-flight batches if the driver is restarted. 
> StreamExecution does not garbage-collect or compact this log in any way.
> Since the log is implemented with HDFSMetadataLog, these files will consume 
> memory on the HDFS NameNode. Specifically, each log file will consume about 
> 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the 
> block of file contents; see 
> [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
>  An application with a 100 msec batch interval will increase the NameNode's 
> heap usage by about 250MB per day.
> There is also the matter of recovery. StreamExecution reads its entire log 
> when restarting. This read operation will be very expensive if the log 
> contains millions of entries spread over millions of files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17513) StreamExecution should discard unneeded metadata

2016-09-12 Thread Frederick Reiss (JIRA)

Frederick Reiss created SPARK-17513:
---

 Summary: StreamExecution should discard unneeded metadata
 Key: SPARK-17513
 URL: https://issues.apache.org/jira/browse/SPARK-17513
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Frederick Reiss


The StreamExecution maintains a write-ahead log of batch metadata in order to 
allow repeating previously in-flight batches if the driver is restarted. 
StreamExecution does not garbage-collect or compact this log in any way.

Since the log is implemented with HDFSMetadataLog, these files will consume 
memory on the HDFS NameNode. Specifically, each log file will consume about 300 
bytes of NameNode memory (150 bytes for the inode and 150 bytes for the block 
of file contents; see 
[https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html].
 An application with a 100 msec batch interval will increase the NameNode's 
heap usage by about 250MB per day.

There is also the matter of recovery. StreamExecution reads its entire log when 
restarting. This read operation will be very expensive if the log contains 
millions of entries spread over millions of files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working

2016-09-12 Thread Udit Mehrotra (JIRA)

Udit Mehrotra created SPARK-17512:
-

 Summary: Specifying remote files for Python based Spark jobs in 
Yarn cluster mode not working
 Key: SPARK-17512
 URL: https://issues.apache.org/jira/browse/SPARK-17512
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 2.0.0
Reporter: Udit Mehrotra


When I run a python application, and specify a remote path for the extra files 
to be included in the PYTHON_PATH using the '--py-files' or 
'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the 
following error:

Exception in thread "main" java.lang.IllegalArgumentException: Launching Python 
applications through spark-submit is currently only supported for local files: 
s3:///app.py
at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636)
at 
org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 

Here are sample commands which would throw this error in Spark 2.0 (sparkApp.py 
requires app.py):

spark-submit --deploy-mode cluster --py-files s3:///app.py 
s3:///sparkApp.py (works fine in 1.6)

spark-submit --deploy-mode cluster --conf spark.submit.pyFiles=s3:///app.py 
s3:///sparkApp1.py (not working in 1.6)

This would work fine if app.py is downloaded locally and specified.

This was working correctly using ‘—py-files’ option in earlier version of 
Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But now, 
it does not work through either of the ways.

The following diff shows the comment which states that it should work with 
‘non-local’ paths for the YARN cluster mode, and we are specifically doing 
separate validation to fail if YARN client mode is used with remote paths:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309

And then this code gets triggered at the end of each run, irrespective of 
whether we are using Client or Cluster mode, and internally validates that the 
paths should be non-local:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634

This above validation was not getting triggered in earlier version of Spark 
using ‘—py-files’ option because we were not storing the arguments passed to 
‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, the 
following code was newly added in 2.0 which now stores it and hence this 
validation gets triggered even if we specify files through ‘—py-files’ option:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545

Also, we changed the logic in YARN client, to read values directly from 
‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier):

https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543

So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the 
validation gets triggered in both cases irrespective of whether we use Client 
or Cluster mode with YARN.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16750) ML GaussianMixture training failed due to feature column type mistake

2016-09-12 Thread Pramit Choudhary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485718#comment-15485718
 ] 

Pramit Choudhary commented on SPARK-16750:
--

It seems the release branch was cut on 19th July and this change made it post 
that. Any work around guys ?

> ML GaussianMixture training failed due to feature column type mistake
> -
>
> Key: SPARK-16750
> URL: https://issues.apache.org/jira/browse/SPARK-16750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.1, 2.1.0
>
>
> ML GaussianMixture training failed due to feature column type mistake. The 
> feature column type should be {{ml.linalg.VectorUDT}} but got 
> {{mllib.linalg.VectorUDT}} by mistake.
> This bug is easy to reproduce by the following code:
> {code}
> val df = spark.createDataFrame(
>   Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0)),
> (2, Vectors.dense(1.0, 0.0, 4.0)),
> (3, Vectors.dense(1.0, 0.0, 5.0)),
> (4, Vectors.dense(0.0, 0.0, 5.0)))
> ).toDF("id", "features")
> val scaler = new MinMaxScaler()
>   .setInputCol("features")
>   .setOutputCol("features_scaled")
>   .setMin(0.0)
>   .setMax(5.0)
> val gmm = new GaussianMixture()
>   .setFeaturesCol("features_scaled")
>   .setK(2)
> val pipeline = new Pipeline().setStages(Array(scaler, gmm))
> pipeline.fit(df)
> requirement failed: Column features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> java.lang.IllegalArgumentException: requirement failed: Column 
> features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>   at 
> org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
>   at 
> org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
>   at 
> org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
> {code}
> Why the unit tests did not complain this errors? Because some 
> estimators/transformers missed calling {{transformSchema(dataset.schema)}} 
> firstly during {{fit}} or {{transform}}. I will also add this function to all 
> estimators/transformers who missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled

2016-09-12 Thread Ashwin Shankar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485711#comment-15485711
 ] 

Ashwin Shankar commented on SPARK-16441:


Hey [~Dhruve Ashar], we hit the same issue at Netflix when dynamic allocation 
is enabled and have thousands of executors running. Do you have any insights or 
resolution for this ticket? 

> Spark application hang when dynamic allocation is enabled
> -
>
> Key: SPARK-16441
> URL: https://issues.apache.org/jira/browse/SPARK-16441
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
> Environment: hadoop 2.7.2  spark1.6.2
>Reporter: cen yuhai
>
> spark application are waiting for rpc response all the time and spark 
> listener are blocked by dynamic allocation. Executors can not connect to 
> driver and lost.
> "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 
> tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000]
>java.lang.Thread.State: TIMED_WAITING (parking)
>   at sun.misc.Unsafe.park(Native Method)
>   - parking to wait for  <0x00070fdb94f8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
>   at 
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
>   at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:107)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
>   at 
> org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436)
>   - locked <0x828a8960> (a 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend)
>   at 
> org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438)
>   at 
> org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359)
>   at 
> org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264)
>   - locked <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223)
> "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 
> nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at 
> org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618)
>   - waiting to lock <0x880e6308> (a 
> org.apache.spark.ExecutorAllocationManager)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
>   at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
>   at

[jira] [Resolved] (SPARK-17474) Python UDF does not work between Sort and Limit

2016-09-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17474.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15030
[https://github.com/apache/spark/pull/15030]

> Python UDF does not work between Sort and Limit
> ---
>
> Key: SPARK-17474
> URL: https://issues.apache.org/jira/browse/SPARK-17474
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.1, 2.1.0
>
>
> Because of this bug, Python UDF will not work with ORDER BY and LIMIT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16750) ML GaussianMixture training failed due to feature column type mistake

2016-09-12 Thread Pramit Choudhary (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485531#comment-15485531
 ] 

Pramit Choudhary commented on SPARK-16750:
--

Did this fix make it to the release version 2.0.0. If not, is there a work 
around for the same ?

> ML GaussianMixture training failed due to feature column type mistake
> -
>
> Key: SPARK-16750
> URL: https://issues.apache.org/jira/browse/SPARK-16750
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.0.1, 2.1.0
>
>
> ML GaussianMixture training failed due to feature column type mistake. The 
> feature column type should be {{ml.linalg.VectorUDT}} but got 
> {{mllib.linalg.VectorUDT}} by mistake.
> This bug is easy to reproduce by the following code:
> {code}
> val df = spark.createDataFrame(
>   Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0)),
> (2, Vectors.dense(1.0, 0.0, 4.0)),
> (3, Vectors.dense(1.0, 0.0, 5.0)),
> (4, Vectors.dense(0.0, 0.0, 5.0)))
> ).toDF("id", "features")
> val scaler = new MinMaxScaler()
>   .setInputCol("features")
>   .setOutputCol("features_scaled")
>   .setMin(0.0)
>   .setMax(5.0)
> val gmm = new GaussianMixture()
>   .setFeaturesCol("features_scaled")
>   .setK(2)
> val pipeline = new Pipeline().setStages(Array(scaler, gmm))
> pipeline.fit(df)
> requirement failed: Column features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> java.lang.IllegalArgumentException: requirement failed: Column 
> features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>   at 
> org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
>   at 
> org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
>   at 
> org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>   at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>   at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>   at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>   at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
> {code}
> Why the unit tests did not complain this errors? Because some 
> estimators/transformers missed calling {{transformSchema(dataset.schema)}} 
> firstly during {{fit}} or {{transform}}. I will also add this function to all 
> estimators/transformers who missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-12 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485530#comment-15485530
 ] 

Bryan Cutler commented on SPARK-17508:
--

[~zamir.e...@gmail.com] This is a gripe I have with ML Params in PySpark.  The 
API makes it seem like setting params such as {{weightCol}} to {{None}} is 
supported, however this actually assigns a value of {{null}} to the param on 
the JVM side, which is not the same thing as "not set or empty" from the 
pydocs.  So the supported ways to initialize this are

{noformat}
LogisticRegression(maxIter=5, regParam=0.0, weightCol="")  # as you pointed 
out, or
LogisticRegression(maxIter=5, regParam=0.0)# just omit the param
{noformat}

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing

2016-09-12 Thread Kishor Patil (JIRA)

Kishor Patil created SPARK-17511:


 Summary: Dynamic allocation race condition: Containers getting 
marked failed while releasing
 Key: SPARK-17511
 URL: https://issues.apache.org/jira/browse/SPARK-17511
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.0.0, 2.0.1, 2.1.0
Reporter: Kishor Patil


While trying to reach launch multiple containers in pool, if running executors 
count reaches or goes beyond the target running executors, the container is 
released and marked failed. This can cause many jobs to be marked failed 
causing overall job failure.

I will have a patch up soon after completing testing.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17485) Failed remote cached block reads can lead to whole job failure


 [ 
https://issues.apache.org/jira/browse/SPARK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17485.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15037
[https://github.com/apache/spark/pull/15037]

> Failed remote cached block reads can lead to whole job failure
> --
>
> Key: SPARK-17485
> URL: https://issues.apache.org/jira/browse/SPARK-17485
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> In Spark's RDD.getOrCompute we first try to read a local copy of a cached 
> block, then a remote copy, and only fall back to recomputing the block if no 
> cached copy (local or remote) can be read. This logic works correctly in the 
> case where no remote copies of the block exist, but if there _are_ remote 
> copies but reads of those copies fail (due to network issues or internal 
> Spark bugs) then the BlockManager will throw a {{BlockFetchException}} error 
> that fails the entire job.
> In the case of torrent broadcast we really _do_ want to fail the entire job 
> in case no remote blocks can be fetched, but this logic is inappropriate for 
> cached blocks because those can/should be recomputed.
> Therefore, I think that this exception should be thrown higher up the call 
> stack by the BlockManager client code and not the block manager itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-12 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485505#comment-15485505
 ] 

Hyukjin Kwon commented on SPARK-17477:
--

I left a related commnet 
https://github.com/apache/spark/pull/15035#issuecomment-246516653

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
>

[jira] [Resolved] (SPARK-13406) NPE in LazilyGeneratedOrdering


 [ 
https://issues.apache.org/jira/browse/SPARK-13406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-13406.

Resolution: Duplicate

> NPE in LazilyGeneratedOrdering
> --
>
> Key: SPARK-13406
> URL: https://issues.apache.org/jira/browse/SPARK-13406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Josh Rosen
>
> {code}
> File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy
> Failed example:
> sampled.groupBy("key").count().orderBy("key").show()
> Exception raised:
> Traceback (most recent call last):
>   File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run
> compileflags, 1) in test.globs
>   File " pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy[3]>", line 1, in 
> 
> sampled.groupBy("key").count().orderBy("key").show()
>   File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line 217, in show
> print(self._jdf.showString(n, truncate))
>   File 
> "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py",
>  line 835, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
> 45, in deco
> return f(*a, **kw)
>   File 
> "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", 
> line 310, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o681.showString.
> : org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
>   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.reduce(RDD.scala:919)
>   at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305)
>   at 
> org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477)
>   at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167)
>   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at

[jira] [Updated] (SPARK-13406) NPE in LazilyGeneratedOrdering


 [ 
https://issues.apache.org/jira/browse/SPARK-13406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-13406:
---
Assignee: (was: Josh Rosen)

> NPE in LazilyGeneratedOrdering
> --
>
> Key: SPARK-13406
> URL: https://issues.apache.org/jira/browse/SPARK-13406
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>
> {code}
> File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy
> Failed example:
> sampled.groupBy("key").count().orderBy("key").show()
> Exception raised:
> Traceback (most recent call last):
>   File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run
> compileflags, 1) in test.globs
>   File " pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy[3]>", line 1, in 
> 
> sampled.groupBy("key").count().orderBy("key").show()
>   File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", 
> line 217, in show
> print(self._jdf.showString(n, truncate))
>   File 
> "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py",
>  line 835, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File 
> "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 
> 45, in deco
> return f(*a, **kw)
>   File 
> "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", 
> line 310, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o681.showString.
> : org.apache.spark.SparkDriverExecutionException: Execution error
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
>   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.reduce(RDD.scala:919)
>   at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305)
>   at 
> org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395)
>   at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477)
>   at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167)
>   at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at

[jira] [Resolved] (SPARK-2424) ApplicationState.MAX_NUM_RETRY should be configurable


 [ 
https://issues.apache.org/jira/browse/SPARK-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2424.
---
   Resolution: Duplicate
Fix Version/s: 2.1.0
   2.0.1
   1.6.3

This was finally done in SPARK-16956

> ApplicationState.MAX_NUM_RETRY should be configurable
> -
>
> Key: SPARK-2424
> URL: https://issues.apache.org/jira/browse/SPARK-2424
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Mark Hamstra
>Assignee: Josh Rosen
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> ApplicationState.MAX_NUM_RETRY, controlling the number of times standalone 
> Executors can exit unsuccessfully before Master will remove the Application 
> that the Executors are trying to run, is currently hard-coded to 10.  There's 
> no reason why this should be a single, fixed value for all standalone 
> clusters (e.g., it should probably scale with the number of Executors), so it 
> should be SparkConf-able. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion


 [ 
https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-14818.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15061
[https://github.com/apache/spark/pull/15061]

> Move sketch and mllibLocal out from mima exclusion
> --
>
> Key: SPARK-14818
> URL: https://issues.apache.org/jira/browse/SPARK-14818
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 2.0.1, 2.1.0
>
>
> In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see 
> the definition of mimaProjects). After we release 2.0, we should move them 
> out from this list. So, we can check binary compatibility for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe


[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485420#comment-15485420
 ] 

Apache Spark commented on SPARK-17463:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15065

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at

[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-12 Thread Alexander Ulanov (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alexander Ulanov updated SPARK-5575:

Description:
*Goal:* Implement various types of artificial neural networks

*Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
Having deep learning within Spark's ML library is a question of convenience.
Spark has broad analytic capabilities and it is useful to have deep learning as
one of these tools at hand. Deep learning is a model of choice for several
important modern use-cases, and Spark ML might want to cover them. Eventually,
it is hard to explain, why do we have PCA in ML but don't provide Autoencoder.
To summarize this, Spark should have at least the most widely used deep
learning models, such as fully connected artificial neural network,
convolutional network and autoencoder. Advanced and experimental deep learning
features might reside within packages or as pluggable external tools. These 3
will provide a comprehensive deep learning set for Spark ML. We might also
include recurrent networks as well.

*Requirements:*
# Extensible API compatible with Spark ML. Basic abstractions such as Neuron,
Layer, Error, Regularization, Forward and Backpropagation etc. should be
implemented as traits or interfaces, so they can be easily extended or reused.
Define the Spark ML API for deep learning. This interface is similar to the
other analytics tools in Spark and supports ML pipelines. This makes deep
learning easy to use and plug in into analytics workloads for Spark users.
# Efficiency. The current implementation of multilayer perceptron in Spark is
less than 2x slower than Caffe, both measured on CPU. The main overhead sources
are JVM and Spark's communication layer. For more details, please refer to
https://github.com/avulanov/ann-benchmark. Having said that, the efficient
implementation of deep learning in Spark should be only few times slower than
in specialized tool. This is very reasonable for the platform that does much
more than deep learning and I believe it is understood by the community.
# Scalability. Implement efficient distributed training. It relies heavily on
the efficient communication and scheduling mechanisms. The default
implementation is based on Spark. More efficient implementations might include
some external libraries but use the same interface defined.

*Main features:*
# Multilayer perceptron classifier (MLP)
# Autoencoder
# Convolutional neural networks for computer vision. The interface has to
provide few architectures for deep learning that are widely used in practice,
such as AlexNet

*Additional features:*
# Other architectures, such as Recurrent neural network (RNN), Long-short term
memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN),
MLP multivariate regression
# Regularizers, such as L1, L2, drop-out
# Normalizers
# Network customization. The internal API of Spark ANN is designed to be
flexible and can handle different types of layers. However, only a part of the
API is made public. We have to limit the number of public classes in order to
make it simpler to support other languages. This forces us to use (String or
Number) parameters instead of introducing of new public classes. One of the
options to specify the architecture of ANN is to use text configuration with
layer-wise description. We have considered using Caffe format for this. It
gives the benefit of compatibility with well known deep learning tool and
simplifies the support of other languages in Spark. Implementation of a parser
for the subset of Caffe format might be the first step towards the support of
general ANN architectures in Spark.
# Hardware specific optimization. One can wrap other deep learning
implementations with this interface allowing users to pick a particular
back-end, e.g. Caffe or TensorFlow, along with the default one. The interface
has to provide few architectures for deep learning that are widely used in
practice, such as AlexNet. The main motivation for using specialized libraries
for deep learning would be to fully take advantage of the hardware where Spark
runs, in particular GPUs. Having the default interface in Spark, we will need
to wrap only a subset of functions from a given specialized library. It does
require an effort, however it is not the same as wrapping all functions.
Wrappers can be provided as packages without the need to pull new dependencies
into Spark.

*Completed (merged to the main Spark branch):*
* Requirements: https://issues.apache.org/jira/browse/SPARK-9471
** API
https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/
** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark
* Features:
** Multilayer perceptron classifier
https://issues.apache.org/jira/browse/SPARK-9471

*In progress (pull request):*
* Features:

[jira] [Assigned] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching


 [ 
https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17509:


Assignee: (was: Apache Spark)

> When wrapping catalyst datatype to Hive data type avoid pattern matching
> 
>
> Key: SPARK-17509
> URL: https://issues.apache.org/jira/browse/SPARK-17509
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Profiling a job, we saw that patten matching in wrap function of 
> HiveInspector is consuming around 10% of the time which can be avoided.  A 
> similar change in the unwrap function was made in SPARK-15956. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching


[ 
https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485327#comment-15485327
 ] 

Apache Spark commented on SPARK-17509:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/15064

> When wrapping catalyst datatype to Hive data type avoid pattern matching
> 
>
> Key: SPARK-17509
> URL: https://issues.apache.org/jira/browse/SPARK-17509
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Profiling a job, we saw that patten matching in wrap function of 
> HiveInspector is consuming around 10% of the time which can be avoided.  A 
> similar change in the unwrap function was made in SPARK-15956. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching


 [ 
https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17509:


Assignee: Apache Spark

> When wrapping catalyst datatype to Hive data type avoid pattern matching
> 
>
> Key: SPARK-17509
> URL: https://issues.apache.org/jira/browse/SPARK-17509
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> Profiling a job, we saw that patten matching in wrap function of 
> HiveInspector is consuming around 10% of the time which can be avoided.  A 
> similar change in the unwrap function was made in SPARK-15956. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15621) BatchEvalPythonExec fails with OOM

2016-09-12 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-15621:
--

Assignee: Davies Liu

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Assignee: Davies Liu
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17483) Minor refactoring and cleanup in BlockManager block status reporting and block removal


 [ 
https://issues.apache.org/jira/browse/SPARK-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17483.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15036
[https://github.com/apache/spark/pull/15036]

> Minor refactoring and cleanup in BlockManager block status reporting and 
> block removal
> --
>
> Key: SPARK-17483
> URL: https://issues.apache.org/jira/browse/SPARK-17483
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.1.0
>
>
> As a precursor to fixing a block fetch bug, I'd like to split a few small 
> refactorings in BlockManager into their own patch (hence this JIRA).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark

2016-09-12 Thread Alessio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485130#comment-15485130
 ] 

Alessio commented on SPARK-2352:


Pretty strange that this post with such hype is still "In progress" after 1 
year.
If Apache Spark does not (want to?) include your ANNs, can you consider 
releasing it as an independent toolbox?

> [MLLIB] Add Artificial Neural Network (ANN) to Spark
> 
>
> Key: SPARK-2352
> URL: https://issues.apache.org/jira/browse/SPARK-2352
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
> Environment: MLLIB code
>Reporter: Bert Greevenbosch
>Assignee: Bert Greevenbosch
>
> It would be good if the Machine Learning Library contained Artificial Neural 
> Networks (ANNs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2016-09-12 Thread Alessio (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485131#comment-15485131
 ] 

Alessio commented on SPARK-5575:


Pretty strange that this post with such hype is still "In progress" after 1 
year.
If Apache Spark does not (want to?) include your ANNs, can you consider 
releasing it as an independent toolbox?

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> *Goal:* Implement various types of artificial neural networks
> *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581)
> Having deep learning within Spark's ML library is a question of convenience. 
> Spark has broad analytic capabilities and it is useful to have deep learning 
> as one of these tools at hand. Deep learning is a model of choice for several 
> important modern use-cases, and Spark ML might want to cover them. 
> Eventually, it is hard to explain, why do we have PCA in ML but don't provide 
> Autoencoder. To summarize this, Spark should have at least the most widely 
> used deep learning models, such as fully connected artificial neural network, 
> convolutional network and autoencoder. Advanced and experimental deep 
> learning features might reside within packages or as pluggable external 
> tools. These 3 will provide a comprehensive deep learning set for Spark ML. 
> We might also include recurrent networks as well.
> *Requirements:*
> # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, 
> Layer, Error, Regularization, Forward and Backpropagation etc. should be 
> implemented as traits or interfaces, so they can be easily extended or 
> reused. Define the Spark ML API for deep learning. This interface is similar 
> to the other analytics tools in Spark and supports ML pipelines. This makes 
> deep learning easy to use and plug in into analytics workloads for Spark 
> users. 
> # Efficiency. The current implementation of multilayer perceptron in Spark is 
> less than 2x slower than Caffe, both measured on CPU. The main overhead 
> sources are JVM and Spark's communication layer. For more details, please 
> refer to https://github.com/avulanov/ann-benchmark. Having said that, the 
> efficient implementation of deep learning in Spark should be only few times 
> slower than in specialized tool. This is very reasonable for the platform 
> that does much more than deep learning and I believe it is understood by the 
> community.
> # Scalability. Implement efficient distributed training. It relies heavily on 
> the efficient communication and scheduling mechanisms. The default 
> implementation is based on Spark. More efficient implementations might 
> include some external libraries but use the same interface defined.
> *Main features:* 
> # Multilayer perceptron classifier (MLP)
> # Autoencoder
> # Convolutional neural networks for computer vision. The interface has to 
> provide few architectures for deep learning that are widely used in practice, 
> such as AlexNet
> *Additional features:*
> # Other architectures, such as Recurrent neural network (RNN), Long-short 
> term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network 
> (DBN), MLP multivariate regression
> # Regularizers, such as L1, L2, drop-out
> # Normalizers
> # Network customization. The internal API of Spark ANN is designed to be 
> flexible and can handle different types of layers. However, only a part of 
> the API is made public. We have to limit the number of public classes in 
> order to make it simpler to support other languages. This forces us to use 
> (String or Number) parameters instead of introducing of new public classes. 
> One of the options to specify the architecture of ANN is to use text 
> configuration with layer-wise description. We have considered using Caffe 
> format for this. It gives the benefit of compatibility with well known deep 
> learning tool and simplifies the support of other languages in Spark. 
> Implementation of a parser for the subset of Caffe format might be the first 
> step towards the support of general ANN architectures in Spark. 
> # Hardware specific optimization. One can wrap other deep learning 
> implementations with this interface allowing users to pick a particular 
> back-end, e.g. Caffe or TensorFlow, along with the default one. The interface 
> has to provide few architectures for deep learning that are widely used in 
> practice, such as AlexNet. The main motivation for using specialized 
> libraries for deep learning would be to fully take advantage of the hardware 
> where Spark runs, in particular GPUs. Having the default

[jira] [Created] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-12 Thread Jeff Nadler (JIRA)

Jeff Nadler created SPARK-17510:
---

 Summary: Set Streaming MaxRate Independently For Multiple Streams
 Key: SPARK-17510
 URL: https://issues.apache.org/jira/browse/SPARK-17510
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Jeff Nadler


We use multiple DStreams coming from different Kafka topics in a Streaming 
application.

Some settings like maxrate and backpressure enabled/disabled would be better 
passed as config to KafkaUtils.createStream and KafkaUtils.createDirectStream, 
instead of setting them in SparkConf.

Being able to set a different maxrate for different streams is an important 
requirement for us; we currently work-around the problem by using one 
receiver-based stream and one direct stream.   

We would like to be able to turn on backpressure for only one of the streams as 
well.   






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17409) Query in CTAS is Optimized Twice


 [ 
https://issues.apache.org/jira/browse/SPARK-17409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17409:
---
Labels: correctness  (was: )

> Query in CTAS is Optimized Twice
> 
>
> Key: SPARK-17409
> URL: https://issues.apache.org/jira/browse/SPARK-17409
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
>  Labels: correctness
>
> The query in CTAS is optimized twice, as reported in the PR: 
> https://github.com/apache/spark/pull/14797
> {quote}
> Some analyzer rules have assumptions on logical plans, optimizer may break 
> these assumption, we should not pass an optimized query plan into 
> QueryExecution (will be analyzed again), otherwise we may some weird bugs.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17494) Floor function rounds up during join


[ 
https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484952#comment-15484952
 ] 

Josh Rosen commented on SPARK-17494:


This also seems to affect Spark 2.0, except there it always returns {{11}} when 
selecting a floor of {{num}}, while taking the floor of a casted literal seems 
to work as expected.

Specifically,

{code}
select floor(cast(10.5 as decimal(15,6))), num, floor(num) from a
{code}

returns {{10, 10.5, 11}}.

> Floor function rounds up during join
> 
>
> Key: SPARK-17494
> URL: https://issues.apache.org/jira/browse/SPARK-17494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Gokhan Civan
>  Labels: correctness
>
> If you create tables as follows:
> create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num;
> create table b as select 'A' as str;
> Then
> select floor(num) from a;
> returns 10
> but
> select floor(num) from a join b on a.str = b.str;
> returns 11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching

2016-09-12 Thread Sital Kedia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-17509:

Affects Version/s: 2.0.0
  Description: Profiling a job, we saw that patten matching in wrap 
function of HiveInspector is consuming around 10% of the time which can be 
avoided.  A similar change in the unwrap function was made in SPARK-15956. 
  Component/s: SQL
   Issue Type: Improvement  (was: Bug)

> When wrapping catalyst datatype to Hive data type avoid pattern matching
> 
>
> Key: SPARK-17509
> URL: https://issues.apache.org/jira/browse/SPARK-17509
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Profiling a job, we saw that patten matching in wrap function of 
> HiveInspector is consuming around 10% of the time which can be avoided.  A 
> similar change in the unwrap function was made in SPARK-15956. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17494) Floor function rounds up during join


 [ 
https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17494:
---
Affects Version/s: 2.0.0

> Floor function rounds up during join
> 
>
> Key: SPARK-17494
> URL: https://issues.apache.org/jira/browse/SPARK-17494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Gokhan Civan
>  Labels: correctness
>
> If you create tables as follows:
> create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num;
> create table b as select 'A' as str;
> Then
> select floor(num) from a;
> returns 10
> but
> select floor(num) from a join b on a.str = b.str;
> returns 11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17494) Floor function rounds up during join


 [ 
https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17494:
---
Labels: correctness  (was: )

> Floor function rounds up during join
> 
>
> Key: SPARK-17494
> URL: https://issues.apache.org/jira/browse/SPARK-17494
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Gokhan Civan
>  Labels: correctness
>
> If you create tables as follows:
> create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num;
> create table b as select 'A' as str;
> Then
> select floor(num) from a;
> returns 10
> but
> select floor(num) from a join b on a.str = b.str;
> returns 11



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching

2016-09-12 Thread Sital Kedia (JIRA)

Sital Kedia created SPARK-17509:
---

 Summary: When wrapping catalyst datatype to Hive data type avoid 
pattern matching
 Key: SPARK-17509
 URL: https://issues.apache.org/jira/browse/SPARK-17509
 Project: Spark
  Issue Type: Bug
Reporter: Sital Kedia






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory


 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17503.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Fixed for 2.0.1 / 2.1.0 by Sean's PR.

> Memory leak in Memory store when unable to cache the whole RDD in memory
> 
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Fix For: 2.0.1, 2.1.0
>
> Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot 
> 2016-09-12 at 4.34.19 PM.png
>
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory


 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17503:
---
Assignee: Sean Zhong

> Memory leak in Memory store when unable to cache the whole RDD in memory
> 
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Sean Zhong
>Assignee: Sean Zhong
> Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot 
> 2016-09-12 at 4.34.19 PM.png
>
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>   ... 14 more
> 16/09/11 17:28:15 ERROR

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484861#comment-15484861
 ] 

Evan Zamir commented on SPARK-17508:


Just ran the same snippet of code setting weightCol="" and that runs without 
error. It's only when I set weightCol=None that I get an error.

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484850#comment-15484850
 ] 

Evan Zamir commented on SPARK-17508:


Yep, I'm running 2.0.0. You can see in the error messages above that it's 
running 2.0.0. Can you try running the same code snippet and see if it works 
for you?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory


 [ 
https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17503:
---
Target Version/s: 2.0.1, 2.1.0  (was: 2.1.0)

> Memory leak in Memory store when unable to cache the whole RDD in memory
> 
>
> Key: SPARK-17503
> URL: https://issues.apache.org/jira/browse/SPARK-17503
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Sean Zhong
> Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot 
> 2016-09-12 at 4.34.19 PM.png
>
>
> h2.Problem description:
> The following query triggers out of memory error.  
> {code}
> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> {code}
> This is not expected, we should fallback to use disk instead if there is not 
> enough memory for cache.
> Stacktrace:
> {code}
> scala> sc.parallelize(1 to 10, 100).map(x => new 
> Array[Long](1000)).cache().count()
> [Stage 0:>  (0 + 5) / 
> 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in 
> memory! (computed 631.5 MB so far)
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed
> 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed
> 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in 
> memory! (computed 947.3 MB so far)
> 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed
> 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in 
> memory! (computed 1423.7 MB so far)
> 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed
> java.lang.OutOfMemoryError: Java heap space
> Dumping heap to java_pid26528.hprof ...
> Heap dump file created [6551021666 bytes in 9.876 secs]
> 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false)
> 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, 
> 55360))] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.executor.heartbeatInterval
>   at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
>   at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
>   at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
>   at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857)
>   at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 
> seconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
>   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
>   at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
>   at scala.concurrent.Await$.result(package.scala:190)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81)
>   ... 14 more
> 16/09/11 17:28:15 ERROR Executor:

[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class

2016-09-12 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484775#comment-15484775
 ] 

Seth Hendrickson commented on SPARK-17471:
--

[~yanboliang] Do you have any updates on this? We need to make implementation 
of the {{compressed}} method for matrices high priority. I can look into 
implementing it, but I don't want to overlap work. Thanks!

> Add compressed method for Matrix class
> --
>
> Key: SPARK-17471
> URL: https://issues.apache.org/jira/browse/SPARK-17471
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Seth Hendrickson
>
> Vectors in Spark have a {{compressed}} method which selects either sparse or 
> dense representation by minimizing storage requirements. Matrices should also 
> have this method, which is now explicitly needed in {{LogisticRegression}} 
> since we have implemented multiclass regression.
> The compressed method should also give the option to store row major or 
> column major, and if nothing is specified should select the lower storage 
> representation (for sparse).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe


 [ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17463:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
>

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe


[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484769#comment-15484769
 ] 

Apache Spark commented on SPARK-17463:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15063

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at

[jira] [Assigned] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe


 [ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17463:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
>

[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs

2016-09-12 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484748#comment-15484748
 ] 

Thomas Graves commented on SPARK-17321:
---

Not sure I follow this comment.  So you are using NM recovery with the recovery 
path specified?
And you saw an error in the spark shuffle creating or writing to the DB but the 
NM stayed up ok writing its recovery data to the same disk?

> YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
> --
>
> Key: SPARK-17321
> URL: https://issues.apache.org/jira/browse/SPARK-17321
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.6.2, 2.0.0
>Reporter: yunjiong zhao
>
> We run spark on yarn, after enabled spark dynamic allocation, we notice some 
> spark application failed randomly due to YarnShuffleService.
> From log I found
> {quote}
> 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: 
> Error while initializing Netty pipeline
> java.lang.NullPointerException
> at 
> org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77)
> at 
> org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159)
> at 
> org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123)
> at 
> org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116)
> at 
> io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> {quote} 
> Which caused by the first disk in yarn.nodemanager.local-dirs was broken.
> If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost 
> hundred nodes which is unacceptable.
> We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good 
> disks if the first one is broken?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-09-12 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-17463:


Assignee: Shixiong Zhu

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
>

[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages

2016-09-12 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484732#comment-15484732
 ] 

Matei Zaharia commented on SPARK-17445:
---

Sounds good to me.

> Reference an ASF page as the main place to find third-party packages
> 
>
> Key: SPARK-17445
> URL: https://issues.apache.org/jira/browse/SPARK-17445
> Project: Spark
>  Issue Type: Improvement
>Reporter: Matei Zaharia
>
> Some comments and docs like 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151
>  say to go to spark-packages.org, but since this is a package index 
> maintained by a third party, it would be better to reference an ASF page that 
> we can keep updated and own the URL for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16742) Kerberos support for Spark on Mesos

2016-09-12 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-16742:

Description: 
We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
contributing it to Apache Spark soon.

Mesosphere design doc: 
https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
Mesosphere code: 
https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa

  was:
We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
contributing it to Apache Spark soon.


https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa


> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16742) Kerberos support for Spark on Mesos

2016-09-12 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-16742:

Description: 
We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
contributing it to Apache Spark soon.


https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa

  was:
We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
contributing it to Apache Spark soon.

https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa


> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-09-12 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484720#comment-15484720
 ] 

Shixiong Zhu commented on SPARK-17463:
--

[~joshrosen] I think we can just leave LongAccum as it is. It's no worse than 
Spark 1.6. In Spark 1.6, `sum` and `count` are different accumulators and have 
the same inconsistent issue.

We definitely should fix CollectionAccumulator and SetAccumulator. I will 
submit a PR to add the necessary `synchronized` for them.

By the way, I didn't notice that AccumulatorV2 sends the whole object back to 
the driver. Do you know any special reason? I remember previously we only send 
the values of accumulators back to driver.

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Priority: Critical
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
>

[jira] [Commented] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect

2016-09-12 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484718#comment-15484718
 ] 

Ryan Blue commented on SPARK-17424:
---

I'm adding the above fix in a PR. This fix works for us (the job succeeds) and 
doesn't change the behavior of cases where the concrete type arguments are 
known.

> Dataset job fails from unsound substitution in ScalaReflect
> ---
>
> Key: SPARK-17424
> URL: https://issues.apache.org/jira/browse/SPARK-17424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
>
> I have a job that uses datasets in 1.6.1 and is failing with this error:
> {code}
> 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
> at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
> at scala.reflect.internal.Types$Type.subst(Types.scala:796)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
> at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
>

[jira] [Commented] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect


[ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484715#comment-15484715
 ] 

Apache Spark commented on SPARK-17424:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/15062

> Dataset job fails from unsound substitution in ScalaReflect
> ---
>
> Key: SPARK-17424
> URL: https://issues.apache.org/jira/browse/SPARK-17424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
>
> I have a job that uses datasets in 1.6.1 and is failing with this error:
> {code}
> 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
> at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
> at scala.reflect.internal.Types$Type.subst(Types.scala:796)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
> at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at

[jira] [Assigned] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect


 [ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17424:


Assignee: (was: Apache Spark)

> Dataset job fails from unsound substitution in ScalaReflect
> ---
>
> Key: SPARK-17424
> URL: https://issues.apache.org/jira/browse/SPARK-17424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
>
> I have a job that uses datasets in 1.6.1 and is failing with this error:
> {code}
> 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
> at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
> at scala.reflect.internal.Types$Type.subst(Types.scala:796)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
> at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at

[jira] [Assigned] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect


 [ 
https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17424:


Assignee: Apache Spark

> Dataset job fails from unsound substitution in ScalaReflect
> ---
>
> Key: SPARK-17424
> URL: https://issues.apache.org/jira/browse/SPARK-17424
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> I have a job that uses datasets in 1.6.1 and is failing with this error:
> {code}
> 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: 
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> java.lang.AssertionError: assertion failed: Unsound substitution from 
> List(type T, type U) to List()
> at scala.reflect.internal.Types$SubstMap.(Types.scala:4644)
> at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761)
> at scala.reflect.internal.Types$Type.subst(Types.scala:796)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321)
> at 
> scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30)
> at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279)
> at scala.collection.AbstractIterator.toMap(Iterator.scala:1157)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44)
> at 
> org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44)
> at 
>

[jira] [Updated] (SPARK-16742) Kerberos support for Spark on Mesos

2016-09-12 Thread Michael Gummelt (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Gummelt updated SPARK-16742:

Description: 
We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
contributing it to Apache Spark soon.

https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa

  was:We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll 
be contributing it to Apache Spark soon.


> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-12 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484704#comment-15484704
 ] 

Sean Owen commented on SPARK-17508:
---

This looks a lot like the problem solved in SPARK-14931 / 
https://github.com/apache/spark/pull/12816  -- you sure it's an issue in 2.0.0?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion


 [ 
https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-14818:
--

Assignee: Josh Rosen

> Move sketch and mllibLocal out from mima exclusion
> --
>
> Key: SPARK-14818
> URL: https://issues.apache.org/jira/browse/SPARK-14818
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
>
> In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see 
> the definition of mimaProjects). After we release 2.0, we should move them 
> out from this list. So, we can check binary compatibility for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion


 [ 
https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14818:


Assignee: Apache Spark  (was: Josh Rosen)

> Move sketch and mllibLocal out from mima exclusion
> --
>
> Key: SPARK-14818
> URL: https://issues.apache.org/jira/browse/SPARK-14818
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see 
> the definition of mimaProjects). After we release 2.0, we should move them 
> out from this list. So, we can check binary compatibility for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion


 [ 
https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-14818:


Assignee: Josh Rosen  (was: Apache Spark)

> Move sketch and mllibLocal out from mima exclusion
> --
>
> Key: SPARK-14818
> URL: https://issues.apache.org/jira/browse/SPARK-14818
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
>
> In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see 
> the definition of mimaProjects). After we release 2.0, we should move them 
> out from this list. So, we can check binary compatibility for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion


[ 
https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484691#comment-15484691
 ] 

Apache Spark commented on SPARK-14818:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15061

> Move sketch and mllibLocal out from mima exclusion
> --
>
> Key: SPARK-14818
> URL: https://issues.apache.org/jira/browse/SPARK-14818
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Blocker
>
> In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see 
> the definition of mimaProjects). After we release 2.0, we should move them 
> out from this list. So, we can check binary compatibility for them



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-12 Thread Gang Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484629#comment-15484629
 ] 

Gang Wu commented on SPARK-17477:
-

[~hyukjin.kwon] I agree with you. But both issues are targeting at parquet data 
sources. I think it applies to all data sources.

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
>

[jira] [Created] (SPARK-17508) Setting weightCol to None in ML library causes an error

Evan Zamir created SPARK-17508:
--

 Summary: Setting weightCol to None in ML library causes an error
 Key: SPARK-17508
 URL: https://issues.apache.org/jira/browse/SPARK-17508
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Evan Zamir


The following code runs without error:

{code}
spark = SparkSession.builder.appName('WeightBug').getOrCreate()
df = spark.createDataFrame(
[
(1.0, 1.0, Vectors.dense(1.0)),
(0.0, 1.0, Vectors.dense(-1.0))
],
["label", "weight", "features"])
lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
model = lr.fit(df)
{code}

My expectation from reading the documentation is that setting weightCol=None 
should treat all weights as 1.0 (regardless of whether a column exists). 
However, the same code with weightCol set to None causes the following errors:

Traceback (most recent call last):

  File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
model = lr.fit(df)
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
64, in fit
return self._fit(dataset)
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
line 213, in _fit
java_model = self._fit_java(dataset)
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
line 210, in _fit_java
return self._java_obj.fit(dataset._jdf)
  File 
"/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
 line 933, in __call__
  File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
63, in deco
return f(*a, **kw)
  File 
"/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
 line 312, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
: java.lang.NullPointerException
at 
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
at 
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
at 
org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:211)
at java.lang.Thread.run(Thread.java:745)


Process finished with exit code 1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type

2016-09-12 Thread Gang Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated SPARK-17477:

Target Version/s:   (was: 2.1.0)

> SparkSQL cannot handle schema evolution from Int -> Long when parquet files 
> have Int as its type while hive metastore has Long as its type
> --
>
> Key: SPARK-17477
> URL: https://issues.apache.org/jira/browse/SPARK-17477
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Gang Wu
>
> When using SparkSession to read a Hive table which is stored as parquet 
> files. If there has been a schema evolution from int to long of a column. 
> There are some old parquet files use int for the column while some new 
> parquet files use long. In Hive metastore, the type is long (bigint).
> Therefore when I use the following:
> {quote}
> sparkSession.sql("select * from table").show()
> {quote}
> I got the following exception:
> {quote}
> 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
> (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read 
> value at 0 in block 0 in file 
> hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36)
>   at 
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
> org.apache.spark.sql.catalyst.expressions.MutableInt
>   at 
> org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249)
>   at 
> org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365)
>   at 
> org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209)
>   ... 22 more
> {quote}
> But this kind of schema evolution (int =>

[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka

2016-09-12 Thread Chris Parmer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484602#comment-15484602
 ] 

Chris Parmer commented on SPARK-15406:
--

For my team, we are just primarily interested in the SQL / table-like interface 
to Kafka with time indexing.
Thanks for the discussion!

> Structured streaming support for consuming from Kafka
> -
>
> Key: SPARK-15406
> URL: https://issues.apache.org/jira/browse/SPARK-15406
> Project: Spark
>  Issue Type: New Feature
>Reporter: Cody Koeninger
>
> Structured streaming doesn't have support for kafka yet.  I personally feel 
> like time based indexing would make for a much better interface, but it's 
> been pushed back to kafka 0.10.1
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17507) check weight vector size in ANN


 [ 
https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17507:


Assignee: (was: Apache Spark)

> check weight vector size in ANN
> ---
>
> Key: SPARK-17507
> URL: https://issues.apache.org/jira/browse/SPARK-17507
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> check weight vector(specified by user) size in ANN.
> and if not right throw exception in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17507) check weight vector size in ANN


 [ 
https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17507:


Assignee: Apache Spark

> check weight vector size in ANN
> ---
>
> Key: SPARK-17507
> URL: https://issues.apache.org/jira/browse/SPARK-17507
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> check weight vector(specified by user) size in ANN.
> and if not right throw exception in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17507) check weight vector size in ANN


[ 
https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484586#comment-15484586
 ] 

Apache Spark commented on SPARK-17507:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15060

> check weight vector size in ANN
> ---
>
> Key: SPARK-17507
> URL: https://issues.apache.org/jira/browse/SPARK-17507
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> check weight vector(specified by user) size in ANN.
> and if not right throw exception in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17507) check weight vector size in ANN

2016-09-12 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-17507:
--

 Summary: check weight vector size in ANN
 Key: SPARK-17507
 URL: https://issues.apache.org/jira/browse/SPARK-17507
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Weichen Xu


check weight vector(specified by user) size in ANN.
and if not right throw exception in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4633) Support gzip in spark.compression.io.codec

2016-09-12 Thread Adam Roberts (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484488#comment-15484488
 ] 

Adam Roberts commented on SPARK-4633:
-

Very interested in this and I know Nasser Ebrahim is also (full disclosure that 
we both work for IBM).

https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/ shows 
promising results

Would be interesting to code up a quick prototype (perhaps based on the pull 
request here) and to see what performance difference we can gain, looks like 
Takeshi has done the starting work for us

> Support gzip in spark.compression.io.codec
> --
>
> Key: SPARK-4633
> URL: https://issues.apache.org/jira/browse/SPARK-4633
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Takeshi Yamamuro
>Priority: Trivial
>
> gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and 
> also
> I think that gizip is more stable than other codecs in terms of both 
> performance
> and space overheads.
> I have one open question; current spark configuratios have a block size option
> for each codec (spark.io.compression.[gzip|lz4|snappy].block.size).
> As # of codecs increases, the configurations have more options and
> I think that it is sort of complicated for non-expert users.
> To mitigate it, my thought follows;
> the three configurations are replaced with a single option for block size
> (spark.io.compression.block.size). Then, 'Meaning' in configurations
> will describe "This option makes an effect on gzip, lz4, and snappy. 
> Block size (in bytes) used in compression, in the case when these compression
> codecs are used. Lowering...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17506) Improve the check double values equality rule

2016-09-12 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17506:
--
Priority: Minor  (was: Critical)

> Improve the check double values equality rule
> -
>
> Key: SPARK-17506
> URL: https://issues.apache.org/jira/browse/SPARK-17506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Minor
>
> In `ExpressionEvalHelper`, we check the equality between two double values by 
> comparing whether the expected value is within the range [target - tolerance, 
> target + tolerance], but this can cause a negative false when the compared 
> numerics are very large. 
> For example：
> {code}
> val1 = 1.6358558070241E306
> val2 = 1.6358558070240974E306
> ExpressionEvalHelper.compareResults(val1, val2)
> false
> {code}
> In fact, val1 and val2 are but with different precisions, we should tolerant 
> this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17506) Improve the check double values equality rule


 [ 
https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17506:


Assignee: Apache Spark

> Improve the check double values equality rule
> -
>
> Key: SPARK-17506
> URL: https://issues.apache.org/jira/browse/SPARK-17506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Assignee: Apache Spark
>Priority: Critical
>
> In `ExpressionEvalHelper`, we check the equality between two double values by 
> comparing whether the expected value is within the range [target - tolerance, 
> target + tolerance], but this can cause a negative false when the compared 
> numerics are very large. 
> For example：
> {code}
> val1 = 1.6358558070241E306
> val2 = 1.6358558070240974E306
> ExpressionEvalHelper.compareResults(val1, val2)
> false
> {code}
> In fact, val1 and val2 are but with different precisions, we should tolerant 
> this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17506) Improve the check double values equality rule


[ 
https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484443#comment-15484443
 ] 

Apache Spark commented on SPARK-17506:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/15059

> Improve the check double values equality rule
> -
>
> Key: SPARK-17506
> URL: https://issues.apache.org/jira/browse/SPARK-17506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Jiang Xingbo
>Priority: Critical
>
> In `ExpressionEvalHelper`, we check the equality between two double values by 
> comparing whether the expected value is within the range [target - tolerance, 
> target + tolerance], but this can cause a negative false when the compared 
> numerics are very large. 
> For example：
> {code}
> val1 = 1.6358558070241E306
> val2 = 1.6358558070240974E306
> ExpressionEvalHelper.compareResults(val1, val2)
> false
> {code}
> In fact, val1 and val2 are but with different precisions, we should tolerant 
> this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17506) Improve the check double values equality rule