[jira] [Updated] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec
[ https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-17517: - Target Version/s: (was: 2.1.0) External issue ID: (was: 12795) Fix Version/s: (was: 2.1.0) Description: For current `BroadcastHashJoinExec`, we generate join code for key is not unique like this: {code:title=processNext.java|borderStyle=solid} while (matches.hasnext) { matched = matches.next check and read stream side row fields check and read build side row fieldes reset result row write stream side row fields to result row write stream side row fields to result row append(result row) } {code} For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && all left side fields are fixed length` and so on. we may generate the code as below: {code:title=processNext.java|borderStyle=solid} check and read stream side row fields reset result row write stream side row fields to result row while (matches.hasnext) { matched = matches.next check and read build side row fieldes write stream side row fields to result row append(result row) } {code} was: For current `BroadcastHashJoinExec`, we generate join code for key is not unique like this: {code:title=Bar.java|borderStyle=solid} while (matches.hasnext) { matched = matches.next check and read stream side row fields check and read build side row fieldes reset result row write stream side row fields to result row write stream side row fields to result row append(result row) } {code} For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && all left side fields are fixed length` and so on. we may generate the code as below: ```java check and read stream side row fields reset result row write stream side row fields to result row while (matches.hasnext) { matched = matches.next check and read build side row fieldes write stream side row fields to result row append(result row) } ``` > Improve generated Code for BroadcastHashJoinExec > > > Key: SPARK-17517 > URL: https://issues.apache.org/jira/browse/SPARK-17517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kent Yao > > For current `BroadcastHashJoinExec`, we generate join code for key is not > unique like this: > {code:title=processNext.java|borderStyle=solid} > while (matches.hasnext) { > matched = matches.next > check and read stream side row fields > check and read build side row fieldes > reset result row > write stream side row fields to result row > write stream side row fields to result row > append(result row) > } > {code} > For some cases, we don't need to check/read/write the steam side repeatedly > in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && > all left side fields are fixed length` and so on. we may generate the code as > below: > {code:title=processNext.java|borderStyle=solid} > check and read stream side row fields > reset result row > write stream side row fields to result row > while (matches.hasnext) > { > matched = matches.next > check and read build side row fieldes > write stream side row fields to result row > append(result row) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile
[ https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17123: Assignee: (was: Apache Spark) > Performing set operations that combine string and date / timestamp columns > may result in generated projection code which doesn't compile > > > Key: SPARK-17123 > URL: https://issues.apache.org/jira/browse/SPARK-17123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Minor > > The following example program causes SpecificSafeProjection code generation > to produce Java code which doesn't compile: > {code} > import org.apache.spark.sql.types._ > spark.sql("set spark.sql.codegen.fallback=false") > val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new > java.sql.Date(0, StructType(StructField("value", DateType) :: Nil)) > val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF > dateDF.union(longDF).collect() > {code} > This fails at runtime with the following error: > {code} > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 28, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates > are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private org.apache.spark.sql.types.StructType schema; > /* 011 */ > /* 012 */ > /* 013 */ public SpecificSafeProjection(Object[] references) { > /* 014 */ this.references = references; > /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 016 */ > /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 018 */ } > /* 019 */ > /* 020 */ public java.lang.Object apply(java.lang.Object _i) { > /* 021 */ InternalRow i = (InternalRow) _i; > /* 022 */ > /* 023 */ values = new Object[1]; > /* 024 */ > /* 025 */ boolean isNull2 = i.isNullAt(0); > /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 027 */ boolean isNull1 = isNull2; > /* 028 */ final java.sql.Date value1 = isNull1 ? null : > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); > /* 029 */ isNull1 = value1 == null; > /* 030 */ if (isNull1) { > /* 031 */ values[0] = null; > /* 032 */ } else { > /* 033 */ values[0] = value1; > /* 034 */ } > /* 035 */ > /* 036 */ final org.apache.spark.sql.Row value = new > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, > schema); > /* 037 */ if (false) { > /* 038 */ mutableRow.setNullAt(0); > /* 039 */ } else { > /* 040 */ > /* 041 */ mutableRow.update(0, value); > /* 042 */ } > /* 043 */ > /* 044 */ return mutableRow; > /* 045 */ } > /* 046 */ } > {code} > Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the > generated code tries to call it with a UTF8String while the method expects an > int instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile
[ https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486354#comment-15486354 ] Apache Spark commented on SPARK-17123: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15072 > Performing set operations that combine string and date / timestamp columns > may result in generated projection code which doesn't compile > > > Key: SPARK-17123 > URL: https://issues.apache.org/jira/browse/SPARK-17123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Minor > > The following example program causes SpecificSafeProjection code generation > to produce Java code which doesn't compile: > {code} > import org.apache.spark.sql.types._ > spark.sql("set spark.sql.codegen.fallback=false") > val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new > java.sql.Date(0, StructType(StructField("value", DateType) :: Nil)) > val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF > dateDF.union(longDF).collect() > {code} > This fails at runtime with the following error: > {code} > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 28, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates > are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private org.apache.spark.sql.types.StructType schema; > /* 011 */ > /* 012 */ > /* 013 */ public SpecificSafeProjection(Object[] references) { > /* 014 */ this.references = references; > /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 016 */ > /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 018 */ } > /* 019 */ > /* 020 */ public java.lang.Object apply(java.lang.Object _i) { > /* 021 */ InternalRow i = (InternalRow) _i; > /* 022 */ > /* 023 */ values = new Object[1]; > /* 024 */ > /* 025 */ boolean isNull2 = i.isNullAt(0); > /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 027 */ boolean isNull1 = isNull2; > /* 028 */ final java.sql.Date value1 = isNull1 ? null : > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); > /* 029 */ isNull1 = value1 == null; > /* 030 */ if (isNull1) { > /* 031 */ values[0] = null; > /* 032 */ } else { > /* 033 */ values[0] = value1; > /* 034 */ } > /* 035 */ > /* 036 */ final org.apache.spark.sql.Row value = new > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, > schema); > /* 037 */ if (false) { > /* 038 */ mutableRow.setNullAt(0); > /* 039 */ } else { > /* 040 */ > /* 041 */ mutableRow.update(0, value); > /* 042 */ } > /* 043 */ > /* 044 */ return mutableRow; > /* 045 */ } > /* 046 */ } > {code} > Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the > generated code tries to call it with a UTF8String while the method expects an > int instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17123) Performing set operations that combine string and date / timestamp columns may result in generated projection code which doesn't compile
[ https://issues.apache.org/jira/browse/SPARK-17123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17123: Assignee: Apache Spark > Performing set operations that combine string and date / timestamp columns > may result in generated projection code which doesn't compile > > > Key: SPARK-17123 > URL: https://issues.apache.org/jira/browse/SPARK-17123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Minor > > The following example program causes SpecificSafeProjection code generation > to produce Java code which doesn't compile: > {code} > import org.apache.spark.sql.types._ > spark.sql("set spark.sql.codegen.fallback=false") > val dateDF = spark.createDataFrame(sc.parallelize(Seq(Row(new > java.sql.Date(0, StructType(StructField("value", DateType) :: Nil)) > val longDF = sc.parallelize(Seq(new java.sql.Date(0).toString)).toDF > dateDF.union(longDF).collect() > {code} > This fails at runtime with the following error: > {code} > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 28, Column 107: No applicable constructor/method found > for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates > are: "public static java.sql.Date > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(int)" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private org.apache.spark.sql.types.StructType schema; > /* 011 */ > /* 012 */ > /* 013 */ public SpecificSafeProjection(Object[] references) { > /* 014 */ this.references = references; > /* 015 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 016 */ > /* 017 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 018 */ } > /* 019 */ > /* 020 */ public java.lang.Object apply(java.lang.Object _i) { > /* 021 */ InternalRow i = (InternalRow) _i; > /* 022 */ > /* 023 */ values = new Object[1]; > /* 024 */ > /* 025 */ boolean isNull2 = i.isNullAt(0); > /* 026 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 027 */ boolean isNull1 = isNull2; > /* 028 */ final java.sql.Date value1 = isNull1 ? null : > org.apache.spark.sql.catalyst.util.DateTimeUtils.toJavaDate(value2); > /* 029 */ isNull1 = value1 == null; > /* 030 */ if (isNull1) { > /* 031 */ values[0] = null; > /* 032 */ } else { > /* 033 */ values[0] = value1; > /* 034 */ } > /* 035 */ > /* 036 */ final org.apache.spark.sql.Row value = new > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema(values, > schema); > /* 037 */ if (false) { > /* 038 */ mutableRow.setNullAt(0); > /* 039 */ } else { > /* 040 */ > /* 041 */ mutableRow.update(0, value); > /* 042 */ } > /* 043 */ > /* 044 */ return mutableRow; > /* 045 */ } > /* 046 */ } > {code} > Here, the invocation of {{DateTimeUtils.toJavaDate}} is incorrect because the > generated code tries to call it with a UTF8String while the method expects an > int instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec
[ https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-17517: - Description: For current `BroadcastHashJoinExec`, we generate join code for key is not unique like this: {code:title=Bar.java|borderStyle=solid} while (matches.hasnext) { matched = matches.next check and read stream side row fields check and read build side row fieldes reset result row write stream side row fields to result row write stream side row fields to result row append(result row) } {code} For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && all left side fields are fixed length` and so on. we may generate the code as below: ```java check and read stream side row fields reset result row write stream side row fields to result row while (matches.hasnext) { matched = matches.next check and read build side row fieldes write stream side row fields to result row append(result row) } ``` was: For current `BroadcastHashJoinExec`, we generate join code for key is not unique like this: ```java while (matches.hasnext) matched = matches.next check and read stream side row fields check and read build side row fieldes reset result row write stream side row fields to result row write stream side row fields to result row ``` For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && all left side fields are fixed length` and so on. we may generate the code as below: ```java check and read stream side row fields reset result row write stream side row fields to result row while (matches.hasnext) matched = matches.next check and read build side row fieldes write stream side row fields to result row ``` > Improve generated Code for BroadcastHashJoinExec > > > Key: SPARK-17517 > URL: https://issues.apache.org/jira/browse/SPARK-17517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kent Yao > Fix For: 2.1.0 > > > For current `BroadcastHashJoinExec`, we generate join code for key is not > unique like this: > {code:title=Bar.java|borderStyle=solid} > while (matches.hasnext) { > matched = matches.next > check and read stream side row fields > check and read build side row fieldes > reset result row > write stream side row fields to result row > write stream side row fields to result row > append(result row) > } > {code} > For some cases, we don't need to check/read/write the steam side repeatedly > in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && > all left side fields are fixed length` and so on. we may generate the code as > below: > ```java > check and read stream side row fields > reset result row > write stream side row fields to result row > while (matches.hasnext) > { > matched = matches.next > check and read build side row fieldes > write stream side row fields to result row > append(result row) > } > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec
[ https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486335#comment-15486335 ] Apache Spark commented on SPARK-17517: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/15071 > Improve generated Code for BroadcastHashJoinExec > > > Key: SPARK-17517 > URL: https://issues.apache.org/jira/browse/SPARK-17517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kent Yao > Fix For: 2.1.0 > > > For current `BroadcastHashJoinExec`, we generate join code for key is not > unique like this: > ```java > while (matches.hasnext) > matched = matches.next > check and read stream side row fields > check and read build side row fieldes > reset result row > write stream side row fields to result row > write stream side row fields to result row > ``` > For some cases, we don't need to check/read/write the steam side repeatedly > in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && > all left side fields are fixed length` and so on. we may generate the code as > below: > ```java > check and read stream side row fields > reset result row > write stream side row fields to result row > while (matches.hasnext) > matched = matches.next > check and read build side row fieldes > write stream side row fields to result row > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec
[ https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17517: Assignee: (was: Apache Spark) > Improve generated Code for BroadcastHashJoinExec > > > Key: SPARK-17517 > URL: https://issues.apache.org/jira/browse/SPARK-17517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kent Yao > Fix For: 2.1.0 > > > For current `BroadcastHashJoinExec`, we generate join code for key is not > unique like this: > ```java > while (matches.hasnext) > matched = matches.next > check and read stream side row fields > check and read build side row fieldes > reset result row > write stream side row fields to result row > write stream side row fields to result row > ``` > For some cases, we don't need to check/read/write the steam side repeatedly > in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && > all left side fields are fixed length` and so on. we may generate the code as > below: > ```java > check and read stream side row fields > reset result row > write stream side row fields to result row > while (matches.hasnext) > matched = matches.next > check and read build side row fieldes > write stream side row fields to result row > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec
[ https://issues.apache.org/jira/browse/SPARK-17517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17517: Assignee: Apache Spark > Improve generated Code for BroadcastHashJoinExec > > > Key: SPARK-17517 > URL: https://issues.apache.org/jira/browse/SPARK-17517 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Kent Yao >Assignee: Apache Spark > Fix For: 2.1.0 > > > For current `BroadcastHashJoinExec`, we generate join code for key is not > unique like this: > ```java > while (matches.hasnext) > matched = matches.next > check and read stream side row fields > check and read build side row fieldes > reset result row > write stream side row fields to result row > write stream side row fields to result row > ``` > For some cases, we don't need to check/read/write the steam side repeatedly > in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && > all left side fields are fixed length` and so on. we may generate the code as > below: > ```java > check and read stream side row fields > reset result row > write stream side row fields to result row > while (matches.hasnext) > matched = matches.next > check and read build side row fieldes > write stream side row fields to result row > ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17517) Improve generated Code for BroadcastHashJoinExec
Kent Yao created SPARK-17517: Summary: Improve generated Code for BroadcastHashJoinExec Key: SPARK-17517 URL: https://issues.apache.org/jira/browse/SPARK-17517 Project: Spark Issue Type: Improvement Components: SQL Reporter: Kent Yao Fix For: 2.1.0 For current `BroadcastHashJoinExec`, we generate join code for key is not unique like this: ```java while (matches.hasnext) matched = matches.next check and read stream side row fields check and read build side row fieldes reset result row write stream side row fields to result row write stream side row fields to result row ``` For some cases, we don't need to check/read/write the steam side repeatedly in such while circle, e.g. `Inner Join with BuildRight`, or `BuildLeft && all left side fields are fixed length` and so on. we may generate the code as below: ```java check and read stream side row fields reset result row write stream side row fields to result row while (matches.hasnext) matched = matches.next check and read build side row fieldes write stream side row fields to result row ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17502) Multiple Bugs in DDL Statements on Temporary Views
[ https://issues.apache.org/jira/browse/SPARK-17502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-17502: Description: - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example, {noformat} Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`'; {noformat} - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports missing table property. However, the expected error should be `NoSuchTableException`. For example, {noformat} Attempted to unset non-existent property 'p' in table '`testView`'; {noformat} - When `ANALYZE TABLE` is called on a view or a temporary view, we should issue an error message. However, it reports a strange error: {noformat} ANALYZE TABLE is not supported for Project {noformat} was: - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for partition-related ALTER TABLE commands. However, it always reports a confusing error message. For example, {noformat} Partition spec is invalid. The spec (a, b) must match the partition spec () defined in table '`testview`'; {noformat} - When the permanent tables/views do not exist but the temporary view exists, the expected error should be `NoSuchTableException` for `ALTER TABLE ... UNSET TBLPROPERTIES`. However, it reports missing table property. However, the expected error should be `NoSuchTableException`. For example, {noformat} Attempted to unset non-existent property 'p' in table '`testView`'; {noformat} > Multiple Bugs in DDL Statements on Temporary Views > --- > > Key: SPARK-17502 > URL: https://issues.apache.org/jira/browse/SPARK-17502 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Xiao Li > > - When the permanent tables/views do not exist but the temporary view exists, > the expected error should be `NoSuchTableException` for partition-related > ALTER TABLE commands. However, it always reports a confusing error message. > For example, > {noformat} > Partition spec is invalid. The spec (a, b) must match the partition spec () > defined in table '`testview`'; > {noformat} > - When the permanent tables/views do not exist but the temporary view exists, > the expected error should be `NoSuchTableException` for `ALTER TABLE ... > UNSET TBLPROPERTIES`. However, it reports missing table property. However, > the expected error should be `NoSuchTableException`. For example, > {noformat} > Attempted to unset non-existent property 'p' in table '`testView`'; > {noformat} > - When `ANALYZE TABLE` is called on a view or a temporary view, we should > issue an error message. However, it reports a strange error: > {noformat} > ANALYZE TABLE is not supported for Project > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17516) Current user info is not checked on STS in DML queries
[ https://issues.apache.org/jira/browse/SPARK-17516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486192#comment-15486192 ] Tao Li commented on SPARK-17516: cc [~bikassaha], [~thejas] > Current user info is not checked on STS in DML queries > -- > > Key: SPARK-17516 > URL: https://issues.apache.org/jira/browse/SPARK-17516 > Project: Spark > Issue Type: Bug >Reporter: Tao Li >Priority: Critical > > I have captured some issues related to doAs support from STS. I am using a > non-secure cluster as my test environment. Simply speaking, the end user info > is not being passed when STS talks to metastore, so the impersonation is not > happening on metastore. > STS is using a ClientWarpper instance (which is wrapped in HiveContext) for > each session. However by design all ClientWarpper instances are sharing the > same Hive instance, which is responsible for talking to Metastore. A > singleton IsolatedClientLoader instance is initialized when STS starts up and > it contains the cachedHive instance. The cachedHive is associated “hive” UGI, > since no session has been set up so current user is “hive". Then each session > creates a ClientWarpper instance which is associated with the same cachedHive > instance. > When we make queries after session is established, the code path to retrieve > the Hive instance is different for DML and DDL operation. Looks like DML > operation related code has less dependency on hive-exec module. > For the DML operations (e.g. “select *”), STS calls into ClientWarpper code > and talks to metastore through the singleton Hive instance directly. There is > no code involved to check the current user. That’s why doAs is not being > respected, even though current user is already switched to the end user in > the thread context. > For DDL operations (e.g. “ALTER table”), STS eventually calls into hive > driver code (e.g. BaseSemanticAnalyzer). From there Hive.get() is called to > get the thread local Hive instance and refresh it if necessary. If the > current user has changed, we refresh the Hive instance by recreating the > metastore connection with the current user info. So even though all thread > locals are actually referencing the singleton Hive instance, calling > Hive.get() is playing an important role here to take any UGI change into > account. That’s why the DDL operations respects doAs . > The fix should be calling Hive.get() for the DML operations, like the hive > driver code called from DDL operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17516) Current user info is not checked on STS in DML queries
[ https://issues.apache.org/jira/browse/SPARK-17516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486177#comment-15486177 ] Tao Li commented on SPARK-17516: I have verified that switching to Hive.get() to get the Hive instance can make impersonation function. However there is a performance hit. If multiple sessions are set up from various end users, then the singleton Hive instance keeps being refreshed and we keep re-connecting from STS to metastore. Hive.get() compared the user associated with the singleton Hive and the current user in the thread local context, and reset the connection if the users are different. The threads running for a session always have the end user's UGI associated. So when threads from different sessions access the singleton Hive instance interleavingly, the connection keeps being reset. That's the perf hit. Also there is potential thread safely issue when the singleton Hive instance are used concurrently by multiple sessions. We can add a lock to synchronize, but that's also a perf hit. So the more ideal fix is having thread local Hive instances. > Current user info is not checked on STS in DML queries > -- > > Key: SPARK-17516 > URL: https://issues.apache.org/jira/browse/SPARK-17516 > Project: Spark > Issue Type: Bug >Reporter: Tao Li >Priority: Critical > > I have captured some issues related to doAs support from STS. I am using a > non-secure cluster as my test environment. Simply speaking, the end user info > is not being passed when STS talks to metastore, so the impersonation is not > happening on metastore. > STS is using a ClientWarpper instance (which is wrapped in HiveContext) for > each session. However by design all ClientWarpper instances are sharing the > same Hive instance, which is responsible for talking to Metastore. A > singleton IsolatedClientLoader instance is initialized when STS starts up and > it contains the cachedHive instance. The cachedHive is associated “hive” UGI, > since no session has been set up so current user is “hive". Then each session > creates a ClientWarpper instance which is associated with the same cachedHive > instance. > When we make queries after session is established, the code path to retrieve > the Hive instance is different for DML and DDL operation. Looks like DML > operation related code has less dependency on hive-exec module. > For the DML operations (e.g. “select *”), STS calls into ClientWarpper code > and talks to metastore through the singleton Hive instance directly. There is > no code involved to check the current user. That’s why doAs is not being > respected, even though current user is already switched to the end user in > the thread context. > For DDL operations (e.g. “ALTER table”), STS eventually calls into hive > driver code (e.g. BaseSemanticAnalyzer). From there Hive.get() is called to > get the thread local Hive instance and refresh it if necessary. If the > current user has changed, we refresh the Hive instance by recreating the > metastore connection with the current user info. So even though all thread > locals are actually referencing the singleton Hive instance, calling > Hive.get() is playing an important role here to take any UGI change into > account. That’s why the DDL operations respects doAs . > The fix should be calling Hive.get() for the DML operations, like the hive > driver code called from DDL operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15486131#comment-15486131 ] Cody Koeninger commented on SPARK-15406: I've got a minimal working Source and SourceProvider, at least for topics that are String key and value only, at https://github.com/apache/spark/compare/master...koeninger:SPARK-15406 If you haven't already attempted an implementation, I'd suggest at least looking at that before writing up a design doc that may or may not address some of the pragmatic issues. The big thing I'm running into, and maybe I'm just not understanding the intention behind the SourceProvider interface, is that putting all configuration through a Map[String, String] makes it super awkward to configure types, or classes, or collections of offsets, or... anything really. Another significant issue is that I have no idea how rate limiting is supposed to work. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17516) Current user info is not checked on STS in DML queries
Tao Li created SPARK-17516: -- Summary: Current user info is not checked on STS in DML queries Key: SPARK-17516 URL: https://issues.apache.org/jira/browse/SPARK-17516 Project: Spark Issue Type: Bug Reporter: Tao Li Priority: Critical I have captured some issues related to doAs support from STS. I am using a non-secure cluster as my test environment. Simply speaking, the end user info is not being passed when STS talks to metastore, so the impersonation is not happening on metastore. STS is using a ClientWarpper instance (which is wrapped in HiveContext) for each session. However by design all ClientWarpper instances are sharing the same Hive instance, which is responsible for talking to Metastore. A singleton IsolatedClientLoader instance is initialized when STS starts up and it contains the cachedHive instance. The cachedHive is associated “hive” UGI, since no session has been set up so current user is “hive". Then each session creates a ClientWarpper instance which is associated with the same cachedHive instance. When we make queries after session is established, the code path to retrieve the Hive instance is different for DML and DDL operation. Looks like DML operation related code has less dependency on hive-exec module. For the DML operations (e.g. “select *”), STS calls into ClientWarpper code and talks to metastore through the singleton Hive instance directly. There is no code involved to check the current user. That’s why doAs is not being respected, even though current user is already switched to the end user in the thread context. For DDL operations (e.g. “ALTER table”), STS eventually calls into hive driver code (e.g. BaseSemanticAnalyzer). From there Hive.get() is called to get the thread local Hive instance and refresh it if necessary. If the current user has changed, we refresh the Hive instance by recreating the metastore connection with the current user info. So even though all thread locals are actually referencing the singleton Hive instance, calling Hive.get() is playing an important role here to take any UGI change into account. That’s why the DDL operations respects doAs . The fix should be calling Hive.get() for the DML operations, like the hive driver code called from DDL operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17515) CollectLimit.execute() should perform per-partition limits
[ https://issues.apache.org/jira/browse/SPARK-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485986#comment-15485986 ] Apache Spark commented on SPARK-17515: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15070 > CollectLimit.execute() should perform per-partition limits > -- > > Key: SPARK-17515 > URL: https://issues.apache.org/jira/browse/SPARK-17515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to > performance regressions in case this case is hit (which should not happen in > normal operation, but can occur in some pathological corner-cases -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17515) CollectLimit.execute() should perform per-partition limits
[ https://issues.apache.org/jira/browse/SPARK-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17515: Assignee: Apache Spark (was: Josh Rosen) > CollectLimit.execute() should perform per-partition limits > -- > > Key: SPARK-17515 > URL: https://issues.apache.org/jira/browse/SPARK-17515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to > performance regressions in case this case is hit (which should not happen in > normal operation, but can occur in some pathological corner-cases -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17515) CollectLimit.execute() should perform per-partition limits
[ https://issues.apache.org/jira/browse/SPARK-17515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17515: Assignee: Josh Rosen (was: Apache Spark) > CollectLimit.execute() should perform per-partition limits > -- > > Key: SPARK-17515 > URL: https://issues.apache.org/jira/browse/SPARK-17515 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to > performance regressions in case this case is hit (which should not happen in > normal operation, but can occur in some pathological corner-cases -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17515) CollectLimit.execute() should perform per-partition limits
Josh Rosen created SPARK-17515: -- Summary: CollectLimit.execute() should perform per-partition limits Key: SPARK-17515 URL: https://issues.apache.org/jira/browse/SPARK-17515 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Josh Rosen Assignee: Josh Rosen {{CollectLimit.execute()}} incorrectly omits per-partition limits, leading to performance regressions in case this case is hit (which should not happen in normal operation, but can occur in some pathological corner-cases -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing
[ https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kishor Patil updated SPARK-17511: - Description: While trying to reach launch multiple containers in pool, if running executors count reaches or goes beyond the target running executors, the container is released and marked failed. This can cause many jobs to be marked failed causing overall job failure. I will have a patch up soon after completing testing. {panel:title=Typical Exception found in Driver marking the container to Failed} {code} java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:156) at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489) at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} {panel} was: While trying to reach launch multiple containers in pool, if running executors count reaches or goes beyond the target running executors, the container is released and marked failed. This can cause many jobs to be marked failed causing overall job failure. I will have a patch up soon after completing testing. > Dynamic allocation race condition: Containers getting marked failed while > releasing > --- > > Key: SPARK-17511 > URL: https://issues.apache.org/jira/browse/SPARK-17511 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Kishor Patil > > While trying to reach launch multiple containers in pool, if running > executors count reaches or goes beyond the target running executors, the > container is released and marked failed. This can cause many jobs to be > marked failed causing overall job failure. > I will have a patch up soon after completing testing. > {panel:title=Typical Exception found in Driver marking the container to > Failed} > {code} > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:156) > at > org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489) > at > org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing
[ https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17511: Assignee: (was: Apache Spark) > Dynamic allocation race condition: Containers getting marked failed while > releasing > --- > > Key: SPARK-17511 > URL: https://issues.apache.org/jira/browse/SPARK-17511 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Kishor Patil > > While trying to reach launch multiple containers in pool, if running > executors count reaches or goes beyond the target running executors, the > container is released and marked failed. This can cause many jobs to be > marked failed causing overall job failure. > I will have a patch up soon after completing testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing
[ https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17511: Assignee: Apache Spark > Dynamic allocation race condition: Containers getting marked failed while > releasing > --- > > Key: SPARK-17511 > URL: https://issues.apache.org/jira/browse/SPARK-17511 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Kishor Patil >Assignee: Apache Spark > > While trying to reach launch multiple containers in pool, if running > executors count reaches or goes beyond the target running executors, the > container is released and marked failed. This can cause many jobs to be > marked failed causing overall job failure. > I will have a patch up soon after completing testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing
[ https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485941#comment-15485941 ] Apache Spark commented on SPARK-17511: -- User 'kishorvpatil' has created a pull request for this issue: https://github.com/apache/spark/pull/15069 > Dynamic allocation race condition: Containers getting marked failed while > releasing > --- > > Key: SPARK-17511 > URL: https://issues.apache.org/jira/browse/SPARK-17511 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Kishor Patil > > While trying to reach launch multiple containers in pool, if running > executors count reaches or goes beyond the target running executors, the > container is released and marked failed. This can cause many jobs to be > marked failed causing overall job failure. > I will have a patch up soon after completing testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16728) migrate internal API for MLlib trees from spark.mllib to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-16728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vladimir Feinberg updated SPARK-16728: -- Description: Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this: 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based splits for complex loss functions). 2. The old impurity API only lets you use summary statistics up to the 2nd order. These are useless for several impurity measures and inadequate for others (e.g., absolute loss or huber loss). It needs some renovation. 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, and Impurity into a single class (and use virtual calls rather than case statements when toggling over impurity types). was: Currently, spark.ml trees rely on spark.mllib implementations. There are two issues with this: 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information (the previous ensemble's prediction, for instance) inside the TreePoints (this is necessary to have loss-based splits for complex loss functions). 2. The old impurity API only lets you use summary statistics up to the 2nd order. These are useless for several impurity measures and inadequate for others (e.g., absolute loss or huber loss). It needs some renovation. > migrate internal API for MLlib trees from spark.mllib to spark.ml > - > > Key: SPARK-16728 > URL: https://issues.apache.org/jira/browse/SPARK-16728 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Vladimir Feinberg > > Currently, spark.ml trees rely on spark.mllib implementations. There are two > issues with this: > 1. Spark.ML's GBT TreeBoost algorithm requires storing additional information > (the previous ensemble's prediction, for instance) inside the TreePoints > (this is necessary to have loss-based splits for complex loss functions). > 2. The old impurity API only lets you use summary statistics up to the 2nd > order. These are useless for several impurity measures and inadequate for > others (e.g., absolute loss or huber loss). It needs some renovation. > 3. We should probably coalesce the ImpurityAggregator, ImpurityCalculator, > and Impurity into a single class (and use virtual calls rather than case > statements when toggling over impurity types). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python
[ https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485821#comment-15485821 ] Apache Spark commented on SPARK-17514: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15068 > df.take(1) and df.limit(1).collect() perform differently in Python > -- > > Key: SPARK-17514 > URL: https://issues.apache.org/jira/browse/SPARK-17514 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > In PySpark, {{df.take(1)}} ends up running a single-stage job which computes > only one partition of {{df}}, while {{df.limit(1).collect()}} ends up > computing all partitions of {{df}} and runs a two-stage job. This difference > in performance is confusing, so I think that we should generalize the fix > from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently > in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python
[ https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17514: Assignee: Josh Rosen (was: Apache Spark) > df.take(1) and df.limit(1).collect() perform differently in Python > -- > > Key: SPARK-17514 > URL: https://issues.apache.org/jira/browse/SPARK-17514 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > > In PySpark, {{df.take(1)}} ends up running a single-stage job which computes > only one partition of {{df}}, while {{df.limit(1).collect()}} ends up > computing all partitions of {{df}} and runs a two-stage job. This difference > in performance is confusing, so I think that we should generalize the fix > from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently > in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python
[ https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17514: Assignee: Apache Spark (was: Josh Rosen) > df.take(1) and df.limit(1).collect() perform differently in Python > -- > > Key: SPARK-17514 > URL: https://issues.apache.org/jira/browse/SPARK-17514 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Assignee: Apache Spark > > In PySpark, {{df.take(1)}} ends up running a single-stage job which computes > only one partition of {{df}}, while {{df.limit(1).collect()}} ends up > computing all partitions of {{df}} and runs a two-stage job. This difference > in performance is confusing, so I think that we should generalize the fix > from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently > in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485782#comment-15485782 ] Evan Zamir commented on SPARK-17508: [~bryanc] Oh, that helps a lot! I've been writing very light wrappers around Spark functions and it wasn't clear to me whether I could keep weightCol as an optional parameter. At least now I can reason about how to do it better. I guess this isn't so much a bug then, as it is a feature request. So if someone wants to close the issue or reclassify, that would make sense. I can only imagine I'm not the only Spark user who has been miffed by this. > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485780#comment-15485780 ] Tathagata Das commented on SPARK-15406: --- Hey all, I am working on the design doc right now. I will upload it here soon. > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python
Josh Rosen created SPARK-17514: -- Summary: df.take(1) and df.limit(1).collect() perform differently in Python Key: SPARK-17514 URL: https://issues.apache.org/jira/browse/SPARK-17514 Project: Spark Issue Type: Bug Components: PySpark, SQL Reporter: Josh Rosen Assignee: Josh Rosen In PySpark, {{df.take(1)}} ends up running a single-stage job which computes only one partition of {{df}}, while {{df.limit(1).collect()}} ends up computing all partitions of {{df}} and runs a two-stage job. This difference in performance is confusing, so I think that we should generalize the fix from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17513) StreamExecution should discard unneeded metadata
[ https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17513: Assignee: Apache Spark > StreamExecution should discard unneeded metadata > > > Key: SPARK-17513 > URL: https://issues.apache.org/jira/browse/SPARK-17513 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss >Assignee: Apache Spark > > The StreamExecution maintains a write-ahead log of batch metadata in order to > allow repeating previously in-flight batches if the driver is restarted. > StreamExecution does not garbage-collect or compact this log in any way. > Since the log is implemented with HDFSMetadataLog, these files will consume > memory on the HDFS NameNode. Specifically, each log file will consume about > 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the > block of file contents; see > [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html]. > An application with a 100 msec batch interval will increase the NameNode's > heap usage by about 250MB per day. > There is also the matter of recovery. StreamExecution reads its entire log > when restarting. This read operation will be very expensive if the log > contains millions of entries spread over millions of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17513) StreamExecution should discard unneeded metadata
[ https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17513: Assignee: (was: Apache Spark) > StreamExecution should discard unneeded metadata > > > Key: SPARK-17513 > URL: https://issues.apache.org/jira/browse/SPARK-17513 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > The StreamExecution maintains a write-ahead log of batch metadata in order to > allow repeating previously in-flight batches if the driver is restarted. > StreamExecution does not garbage-collect or compact this log in any way. > Since the log is implemented with HDFSMetadataLog, these files will consume > memory on the HDFS NameNode. Specifically, each log file will consume about > 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the > block of file contents; see > [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html]. > An application with a 100 msec batch interval will increase the NameNode's > heap usage by about 250MB per day. > There is also the matter of recovery. StreamExecution reads its entire log > when restarting. This read operation will be very expensive if the log > contains millions of entries spread over millions of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17513) StreamExecution should discard unneeded metadata
[ https://issues.apache.org/jira/browse/SPARK-17513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485735#comment-15485735 ] Apache Spark commented on SPARK-17513: -- User 'frreiss' has created a pull request for this issue: https://github.com/apache/spark/pull/15067 > StreamExecution should discard unneeded metadata > > > Key: SPARK-17513 > URL: https://issues.apache.org/jira/browse/SPARK-17513 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > The StreamExecution maintains a write-ahead log of batch metadata in order to > allow repeating previously in-flight batches if the driver is restarted. > StreamExecution does not garbage-collect or compact this log in any way. > Since the log is implemented with HDFSMetadataLog, these files will consume > memory on the HDFS NameNode. Specifically, each log file will consume about > 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the > block of file contents; see > [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html]. > An application with a 100 msec batch interval will increase the NameNode's > heap usage by about 250MB per day. > There is also the matter of recovery. StreamExecution reads its entire log > when restarting. This read operation will be very expensive if the log > contains millions of entries spread over millions of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17513) StreamExecution should discard unneeded metadata
Frederick Reiss created SPARK-17513: --- Summary: StreamExecution should discard unneeded metadata Key: SPARK-17513 URL: https://issues.apache.org/jira/browse/SPARK-17513 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Frederick Reiss The StreamExecution maintains a write-ahead log of batch metadata in order to allow repeating previously in-flight batches if the driver is restarted. StreamExecution does not garbage-collect or compact this log in any way. Since the log is implemented with HDFSMetadataLog, these files will consume memory on the HDFS NameNode. Specifically, each log file will consume about 300 bytes of NameNode memory (150 bytes for the inode and 150 bytes for the block of file contents; see [https://www.cloudera.com/documentation/enterprise/latest/topics/admin_nn_memory_config.html]. An application with a 100 msec batch interval will increase the NameNode's heap usage by about 250MB per day. There is also the matter of recovery. StreamExecution reads its entire log when restarting. This read operation will be very expensive if the log contains millions of entries spread over millions of files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17512) Specifying remote files for Python based Spark jobs in Yarn cluster mode not working
Udit Mehrotra created SPARK-17512: - Summary: Specifying remote files for Python based Spark jobs in Yarn cluster mode not working Key: SPARK-17512 URL: https://issues.apache.org/jira/browse/SPARK-17512 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit Affects Versions: 2.0.0 Reporter: Udit Mehrotra When I run a python application, and specify a remote path for the extra files to be included in the PYTHON_PATH using the '--py-files' or 'spark.submit.pyFiles' configuration option in YARN Cluster mode I get the following error: Exception in thread "main" java.lang.IllegalArgumentException: Launching Python applications through spark-submit is currently only supported for local files: s3:///app.py at org.apache.spark.deploy.PythonRunner$.formatPath(PythonRunner.scala:104) at org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136) at org.apache.spark.deploy.PythonRunner$$anonfun$formatPaths$3.apply(PythonRunner.scala:136) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at org.apache.spark.deploy.PythonRunner$.formatPaths(PythonRunner.scala:136) at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:636) at org.apache.spark.deploy.SparkSubmit$$anonfun$prepareSubmitEnvironment$10.apply(SparkSubmit.scala:634) at scala.Option.foreach(Option.scala:257) at org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:634) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Here are sample commands which would throw this error in Spark 2.0 (sparkApp.py requires app.py): spark-submit --deploy-mode cluster --py-files s3:///app.py s3:///sparkApp.py (works fine in 1.6) spark-submit --deploy-mode cluster --conf spark.submit.pyFiles=s3:///app.py s3:///sparkApp1.py (not working in 1.6) This would work fine if app.py is downloaded locally and specified. This was working correctly using ‘—py-files’ option in earlier version of Spark, but not using the ‘spark.submit.pyFiles’ configuration option. But now, it does not work through either of the ways. The following diff shows the comment which states that it should work with ‘non-local’ paths for the YARN cluster mode, and we are specifically doing separate validation to fail if YARN client mode is used with remote paths: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L309 And then this code gets triggered at the end of each run, irrespective of whether we are using Client or Cluster mode, and internally validates that the paths should be non-local: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L634 This above validation was not getting triggered in earlier version of Spark using ‘—py-files’ option because we were not storing the arguments passed to ‘—py-files’ in the ‘spark.submit.pyFiles’ configuration for YARN. However, the following code was newly added in 2.0 which now stores it and hence this validation gets triggered even if we specify files through ‘—py-files’ option: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L545 Also, we changed the logic in YARN client, to read values directly from ‘spark.submit.pyFiles’ configuration instead of from ‘—py-files’ (earlier): https://github.com/apache/spark/commit/8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e#diff-b050df3f55b82065803d6e83453b9706R543 So now its broken whether we use ‘—py-files’ or ‘spark.submit.pyFiles’ as the validation gets triggered in both cases irrespective of whether we use Client or Cluster mode with YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16750) ML GaussianMixture training failed due to feature column type mistake
[ https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485718#comment-15485718 ] Pramit Choudhary commented on SPARK-16750: -- It seems the release branch was cut on 19th July and this change made it post that. Any work around guys ? > ML GaussianMixture training failed due to feature column type mistake > - > > Key: SPARK-16750 > URL: https://issues.apache.org/jira/browse/SPARK-16750 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.0.1, 2.1.0 > > > ML GaussianMixture training failed due to feature column type mistake. The > feature column type should be {{ml.linalg.VectorUDT}} but got > {{mllib.linalg.VectorUDT}} by mistake. > This bug is easy to reproduce by the following code: > {code} > val df = spark.createDataFrame( > Seq( > (1, Vectors.dense(0.0, 1.0, 4.0)), > (2, Vectors.dense(1.0, 0.0, 4.0)), > (3, Vectors.dense(1.0, 0.0, 5.0)), > (4, Vectors.dense(0.0, 0.0, 5.0))) > ).toDF("id", "features") > val scaler = new MinMaxScaler() > .setInputCol("features") > .setOutputCol("features_scaled") > .setMin(0.0) > .setMax(5.0) > val gmm = new GaussianMixture() > .setFeaturesCol("features_scaled") > .setK(2) > val pipeline = new Pipeline().setStages(Array(scaler, gmm)) > pipeline.fit(df) > requirement failed: Column features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > java.lang.IllegalArgumentException: requirement failed: Column > features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) > at > org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) > at > org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) > {code} > Why the unit tests did not complain this errors? Because some > estimators/transformers missed calling {{transformSchema(dataset.schema)}} > firstly during {{fit}} or {{transform}}. I will also add this function to all > estimators/transformers who missed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16441) Spark application hang when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-16441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485711#comment-15485711 ] Ashwin Shankar commented on SPARK-16441: Hey [~Dhruve Ashar], we hit the same issue at Netflix when dynamic allocation is enabled and have thousands of executors running. Do you have any insights or resolution for this ticket? > Spark application hang when dynamic allocation is enabled > - > > Key: SPARK-16441 > URL: https://issues.apache.org/jira/browse/SPARK-16441 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.2, 2.0.0 > Environment: hadoop 2.7.2 spark1.6.2 >Reporter: cen yuhai > > spark application are waiting for rpc response all the time and spark > listener are blocked by dynamic allocation. Executors can not connect to > driver and lost. > "spark-dynamic-executor-allocation" #239 daemon prio=5 os_prio=0 > tid=0x7fa304438000 nid=0xcec6 waiting on condition [0x7fa2b81e4000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00070fdb94f8> (a > scala.concurrent.impl.Promise$CompletionLatch) > at > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) > at > org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:59) > at > org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:436) > - locked <0x828a8960> (a > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend) > at > org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1438) > at > org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:359) > at > org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:310) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:264) > - locked <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:223) > "SparkListenerBus" #161 daemon prio=5 os_prio=0 tid=0x7fa3053be000 > nid=0xcec9 waiting for monitor entry [0x7fa2b3dfc000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:618) > - waiting to lock <0x880e6308> (a > org.apache.spark.ExecutorAllocationManager) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at > org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) > at
[jira] [Resolved] (SPARK-17474) Python UDF does not work between Sort and Limit
[ https://issues.apache.org/jira/browse/SPARK-17474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17474. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15030 [https://github.com/apache/spark/pull/15030] > Python UDF does not work between Sort and Limit > --- > > Key: SPARK-17474 > URL: https://issues.apache.org/jira/browse/SPARK-17474 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 2.0.1, 2.1.0 > > > Because of this bug, Python UDF will not work with ORDER BY and LIMIT. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16750) ML GaussianMixture training failed due to feature column type mistake
[ https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485531#comment-15485531 ] Pramit Choudhary commented on SPARK-16750: -- Did this fix make it to the release version 2.0.0. If not, is there a work around for the same ? > ML GaussianMixture training failed due to feature column type mistake > - > > Key: SPARK-16750 > URL: https://issues.apache.org/jira/browse/SPARK-16750 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang > Fix For: 2.0.1, 2.1.0 > > > ML GaussianMixture training failed due to feature column type mistake. The > feature column type should be {{ml.linalg.VectorUDT}} but got > {{mllib.linalg.VectorUDT}} by mistake. > This bug is easy to reproduce by the following code: > {code} > val df = spark.createDataFrame( > Seq( > (1, Vectors.dense(0.0, 1.0, 4.0)), > (2, Vectors.dense(1.0, 0.0, 4.0)), > (3, Vectors.dense(1.0, 0.0, 5.0)), > (4, Vectors.dense(0.0, 0.0, 5.0))) > ).toDF("id", "features") > val scaler = new MinMaxScaler() > .setInputCol("features") > .setOutputCol("features_scaled") > .setMin(0.0) > .setMax(5.0) > val gmm = new GaussianMixture() > .setFeaturesCol("features_scaled") > .setK(2) > val pipeline = new Pipeline().setStages(Array(scaler, gmm)) > pipeline.fit(df) > requirement failed: Column features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > java.lang.IllegalArgumentException: requirement failed: Column > features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) > at > org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) > at > org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) > {code} > Why the unit tests did not complain this errors? Because some > estimators/transformers missed calling {{transformSchema(dataset.schema)}} > firstly during {{fit}} or {{transform}}. I will also add this function to all > estimators/transformers who missed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485530#comment-15485530 ] Bryan Cutler commented on SPARK-17508: -- [~zamir.e...@gmail.com] This is a gripe I have with ML Params in PySpark. The API makes it seem like setting params such as {{weightCol}} to {{None}} is supported, however this actually assigns a value of {{null}} to the param on the JVM side, which is not the same thing as "not set or empty" from the pydocs. So the supported ways to initialize this are {noformat} LogisticRegression(maxIter=5, regParam=0.0, weightCol="") # as you pointed out, or LogisticRegression(maxIter=5, regParam=0.0)# just omit the param {noformat} > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing
Kishor Patil created SPARK-17511: Summary: Dynamic allocation race condition: Containers getting marked failed while releasing Key: SPARK-17511 URL: https://issues.apache.org/jira/browse/SPARK-17511 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.0.0, 2.0.1, 2.1.0 Reporter: Kishor Patil While trying to reach launch multiple containers in pool, if running executors count reaches or goes beyond the target running executors, the container is released and marked failed. This can cause many jobs to be marked failed causing overall job failure. I will have a patch up soon after completing testing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17485) Failed remote cached block reads can lead to whole job failure
[ https://issues.apache.org/jira/browse/SPARK-17485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17485. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15037 [https://github.com/apache/spark/pull/15037] > Failed remote cached block reads can lead to whole job failure > -- > > Key: SPARK-17485 > URL: https://issues.apache.org/jira/browse/SPARK-17485 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > In Spark's RDD.getOrCompute we first try to read a local copy of a cached > block, then a remote copy, and only fall back to recomputing the block if no > cached copy (local or remote) can be read. This logic works correctly in the > case where no remote copies of the block exist, but if there _are_ remote > copies but reads of those copies fail (due to network issues or internal > Spark bugs) then the BlockManager will throw a {{BlockFetchException}} error > that fails the entire job. > In the case of torrent broadcast we really _do_ want to fail the entire job > in case no remote blocks can be fetched, but this logic is inappropriate for > cached blocks because those can/should be recomputed. > Therefore, I think that this exception should be thrown higher up the call > stack by the BlockManager client code and not the block manager itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485505#comment-15485505 ] Hyukjin Kwon commented on SPARK-17477: -- I left a related commnet https://github.com/apache/spark/pull/15035#issuecomment-246516653 > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at >
[jira] [Resolved] (SPARK-13406) NPE in LazilyGeneratedOrdering
[ https://issues.apache.org/jira/browse/SPARK-13406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-13406. Resolution: Duplicate > NPE in LazilyGeneratedOrdering > -- > > Key: SPARK-13406 > URL: https://issues.apache.org/jira/browse/SPARK-13406 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Josh Rosen > > {code} > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy > Failed example: > sampled.groupBy("key").count().orderBy("key").show() > Exception raised: > Traceback (most recent call last): > File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run > compileflags, 1) in test.globs > File " pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy[3]>", line 1, in > > sampled.groupBy("key").count().orderBy("key").show() > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 217, in show > print(self._jdf.showString(n, truncate)) > File > "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py", > line 835, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line > 45, in deco > return f(*a, **kw) > File > "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", > line 310, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o681.showString. > : org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:919) > at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) > at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305) > at > org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477) > at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167) > at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at
[jira] [Updated] (SPARK-13406) NPE in LazilyGeneratedOrdering
[ https://issues.apache.org/jira/browse/SPARK-13406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-13406: --- Assignee: (was: Josh Rosen) > NPE in LazilyGeneratedOrdering > -- > > Key: SPARK-13406 > URL: https://issues.apache.org/jira/browse/SPARK-13406 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > {code} > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line ?, in pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy > Failed example: > sampled.groupBy("key").count().orderBy("key").show() > Exception raised: > Traceback (most recent call last): > File "//anaconda/lib/python2.7/doctest.py", line 1315, in __run > compileflags, 1) in test.globs > File " pyspark.sql.dataframe.DataFrameStatFunctions.sampleBy[3]>", line 1, in > > sampled.groupBy("key").count().orderBy("key").show() > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", > line 217, in show > print(self._jdf.showString(n, truncate)) > File > "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/java_gateway.py", > line 835, in __call__ > answer, self.gateway_client, self.target_id, self.name) > File > "/Users/davies/work/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line > 45, in deco > return f(*a, **kw) > File > "/Users/davies/work/spark/python/lib/py4j-0.9.1-src.zip/py4j/protocol.py", > line 310, in get_return_value > format(target_id, ".", name), value) > Py4JJavaError: An error occurred while calling o681.showString. > : org.apache.spark.SparkDriverExecutionException: Execution error > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1189) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1658) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:623) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1782) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) > at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:937) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) > at org.apache.spark.rdd.RDD.reduce(RDD.scala:919) > at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1318) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:323) > at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1305) > at > org.apache.spark.sql.execution.TakeOrderedAndProject.executeCollect(limit.scala:94) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:157) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1520) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:53) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1769) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1519) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1526) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1396) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1395) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:1782) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1395) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1477) > at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:167) > at sun.reflect.GeneratedMethodAccessor63.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at
[jira] [Resolved] (SPARK-2424) ApplicationState.MAX_NUM_RETRY should be configurable
[ https://issues.apache.org/jira/browse/SPARK-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2424. --- Resolution: Duplicate Fix Version/s: 2.1.0 2.0.1 1.6.3 This was finally done in SPARK-16956 > ApplicationState.MAX_NUM_RETRY should be configurable > - > > Key: SPARK-2424 > URL: https://issues.apache.org/jira/browse/SPARK-2424 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Mark Hamstra >Assignee: Josh Rosen > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > ApplicationState.MAX_NUM_RETRY, controlling the number of times standalone > Executors can exit unsuccessfully before Master will remove the Application > that the Executors are trying to run, is currently hard-coded to 10. There's > no reason why this should be a single, fixed value for all standalone > clusters (e.g., it should probably scale with the number of Executors), so it > should be SparkConf-able. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion
[ https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-14818. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15061 [https://github.com/apache/spark/pull/15061] > Move sketch and mllibLocal out from mima exclusion > -- > > Key: SPARK-14818 > URL: https://issues.apache.org/jira/browse/SPARK-14818 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > Fix For: 2.0.1, 2.1.0 > > > In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see > the definition of mimaProjects). After we release 2.0, we should move them > out from this list. So, we can check binary compatibility for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485420#comment-15485420 ] Apache Spark commented on SPARK-17463: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15065 > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at
[jira] [Updated] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Ulanov updated SPARK-5575: Description: *Goal:* Implement various types of artificial neural networks *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) Having deep learning within Spark's ML library is a question of convenience. Spark has broad analytic capabilities and it is useful to have deep learning as one of these tools at hand. Deep learning is a model of choice for several important modern use-cases, and Spark ML might want to cover them. Eventually, it is hard to explain, why do we have PCA in ML but don't provide Autoencoder. To summarize this, Spark should have at least the most widely used deep learning models, such as fully connected artificial neural network, convolutional network and autoencoder. Advanced and experimental deep learning features might reside within packages or as pluggable external tools. These 3 will provide a comprehensive deep learning set for Spark ML. We might also include recurrent networks as well. *Requirements:* # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, Layer, Error, Regularization, Forward and Backpropagation etc. should be implemented as traits or interfaces, so they can be easily extended or reused. Define the Spark ML API for deep learning. This interface is similar to the other analytics tools in Spark and supports ML pipelines. This makes deep learning easy to use and plug in into analytics workloads for Spark users. # Efficiency. The current implementation of multilayer perceptron in Spark is less than 2x slower than Caffe, both measured on CPU. The main overhead sources are JVM and Spark's communication layer. For more details, please refer to https://github.com/avulanov/ann-benchmark. Having said that, the efficient implementation of deep learning in Spark should be only few times slower than in specialized tool. This is very reasonable for the platform that does much more than deep learning and I believe it is understood by the community. # Scalability. Implement efficient distributed training. It relies heavily on the efficient communication and scheduling mechanisms. The default implementation is based on Spark. More efficient implementations might include some external libraries but use the same interface defined. *Main features:* # Multilayer perceptron classifier (MLP) # Autoencoder # Convolutional neural networks for computer vision. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet *Additional features:* # Other architectures, such as Recurrent neural network (RNN), Long-short term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network (DBN), MLP multivariate regression # Regularizers, such as L1, L2, drop-out # Normalizers # Network customization. The internal API of Spark ANN is designed to be flexible and can handle different types of layers. However, only a part of the API is made public. We have to limit the number of public classes in order to make it simpler to support other languages. This forces us to use (String or Number) parameters instead of introducing of new public classes. One of the options to specify the architecture of ANN is to use text configuration with layer-wise description. We have considered using Caffe format for this. It gives the benefit of compatibility with well known deep learning tool and simplifies the support of other languages in Spark. Implementation of a parser for the subset of Caffe format might be the first step towards the support of general ANN architectures in Spark. # Hardware specific optimization. One can wrap other deep learning implementations with this interface allowing users to pick a particular back-end, e.g. Caffe or TensorFlow, along with the default one. The interface has to provide few architectures for deep learning that are widely used in practice, such as AlexNet. The main motivation for using specialized libraries for deep learning would be to fully take advantage of the hardware where Spark runs, in particular GPUs. Having the default interface in Spark, we will need to wrap only a subset of functions from a given specialized library. It does require an effort, however it is not the same as wrapping all functions. Wrappers can be provided as packages without the need to pull new dependencies into Spark. *Completed (merged to the main Spark branch):* * Requirements: https://issues.apache.org/jira/browse/SPARK-9471 ** API https://spark-summit.org/eu-2015/events/a-scalable-implementation-of-deep-learning-on-spark/ ** Efficiency & Scalability: https://github.com/avulanov/ann-benchmark * Features: ** Multilayer perceptron classifier https://issues.apache.org/jira/browse/SPARK-9471 *In progress (pull request):* * Features:
[jira] [Assigned] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching
[ https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17509: Assignee: (was: Apache Spark) > When wrapping catalyst datatype to Hive data type avoid pattern matching > > > Key: SPARK-17509 > URL: https://issues.apache.org/jira/browse/SPARK-17509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Profiling a job, we saw that patten matching in wrap function of > HiveInspector is consuming around 10% of the time which can be avoided. A > similar change in the unwrap function was made in SPARK-15956. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching
[ https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485327#comment-15485327 ] Apache Spark commented on SPARK-17509: -- User 'sitalkedia' has created a pull request for this issue: https://github.com/apache/spark/pull/15064 > When wrapping catalyst datatype to Hive data type avoid pattern matching > > > Key: SPARK-17509 > URL: https://issues.apache.org/jira/browse/SPARK-17509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Profiling a job, we saw that patten matching in wrap function of > HiveInspector is consuming around 10% of the time which can be avoided. A > similar change in the unwrap function was made in SPARK-15956. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching
[ https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17509: Assignee: Apache Spark > When wrapping catalyst datatype to Hive data type avoid pattern matching > > > Key: SPARK-17509 > URL: https://issues.apache.org/jira/browse/SPARK-17509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Apache Spark > > Profiling a job, we saw that patten matching in wrap function of > HiveInspector is consuming around 10% of the time which can be avoided. A > similar change in the unwrap function was made in SPARK-15956. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-15621: -- Assignee: Davies Liu > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Assignee: Davies Liu > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17483) Minor refactoring and cleanup in BlockManager block status reporting and block removal
[ https://issues.apache.org/jira/browse/SPARK-17483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17483. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15036 [https://github.com/apache/spark/pull/15036] > Minor refactoring and cleanup in BlockManager block status reporting and > block removal > -- > > Key: SPARK-17483 > URL: https://issues.apache.org/jira/browse/SPARK-17483 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.1.0 > > > As a precursor to fixing a block fetch bug, I'd like to split a few small > refactorings in BlockManager into their own patch (hence this JIRA). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2352) [MLLIB] Add Artificial Neural Network (ANN) to Spark
[ https://issues.apache.org/jira/browse/SPARK-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485130#comment-15485130 ] Alessio commented on SPARK-2352: Pretty strange that this post with such hype is still "In progress" after 1 year. If Apache Spark does not (want to?) include your ANNs, can you consider releasing it as an independent toolbox? > [MLLIB] Add Artificial Neural Network (ANN) to Spark > > > Key: SPARK-2352 > URL: https://issues.apache.org/jira/browse/SPARK-2352 > Project: Spark > Issue Type: New Feature > Components: MLlib > Environment: MLLIB code >Reporter: Bert Greevenbosch >Assignee: Bert Greevenbosch > > It would be good if the Machine Learning Library contained Artificial Neural > Networks (ANNs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485131#comment-15485131 ] Alessio commented on SPARK-5575: Pretty strange that this post with such hype is still "In progress" after 1 year. If Apache Spark does not (want to?) include your ANNs, can you consider releasing it as an independent toolbox? > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > *Goal:* Implement various types of artificial neural networks > *Motivation:* (from https://issues.apache.org/jira/browse/SPARK-15581) > Having deep learning within Spark's ML library is a question of convenience. > Spark has broad analytic capabilities and it is useful to have deep learning > as one of these tools at hand. Deep learning is a model of choice for several > important modern use-cases, and Spark ML might want to cover them. > Eventually, it is hard to explain, why do we have PCA in ML but don't provide > Autoencoder. To summarize this, Spark should have at least the most widely > used deep learning models, such as fully connected artificial neural network, > convolutional network and autoencoder. Advanced and experimental deep > learning features might reside within packages or as pluggable external > tools. These 3 will provide a comprehensive deep learning set for Spark ML. > We might also include recurrent networks as well. > *Requirements:* > # Extensible API compatible with Spark ML. Basic abstractions such as Neuron, > Layer, Error, Regularization, Forward and Backpropagation etc. should be > implemented as traits or interfaces, so they can be easily extended or > reused. Define the Spark ML API for deep learning. This interface is similar > to the other analytics tools in Spark and supports ML pipelines. This makes > deep learning easy to use and plug in into analytics workloads for Spark > users. > # Efficiency. The current implementation of multilayer perceptron in Spark is > less than 2x slower than Caffe, both measured on CPU. The main overhead > sources are JVM and Spark's communication layer. For more details, please > refer to https://github.com/avulanov/ann-benchmark. Having said that, the > efficient implementation of deep learning in Spark should be only few times > slower than in specialized tool. This is very reasonable for the platform > that does much more than deep learning and I believe it is understood by the > community. > # Scalability. Implement efficient distributed training. It relies heavily on > the efficient communication and scheduling mechanisms. The default > implementation is based on Spark. More efficient implementations might > include some external libraries but use the same interface defined. > *Main features:* > # Multilayer perceptron classifier (MLP) > # Autoencoder > # Convolutional neural networks for computer vision. The interface has to > provide few architectures for deep learning that are widely used in practice, > such as AlexNet > *Additional features:* > # Other architectures, such as Recurrent neural network (RNN), Long-short > term memory (LSTM), Restricted boltzmann machine (RBM), deep belief network > (DBN), MLP multivariate regression > # Regularizers, such as L1, L2, drop-out > # Normalizers > # Network customization. The internal API of Spark ANN is designed to be > flexible and can handle different types of layers. However, only a part of > the API is made public. We have to limit the number of public classes in > order to make it simpler to support other languages. This forces us to use > (String or Number) parameters instead of introducing of new public classes. > One of the options to specify the architecture of ANN is to use text > configuration with layer-wise description. We have considered using Caffe > format for this. It gives the benefit of compatibility with well known deep > learning tool and simplifies the support of other languages in Spark. > Implementation of a parser for the subset of Caffe format might be the first > step towards the support of general ANN architectures in Spark. > # Hardware specific optimization. One can wrap other deep learning > implementations with this interface allowing users to pick a particular > back-end, e.g. Caffe or TensorFlow, along with the default one. The interface > has to provide few architectures for deep learning that are widely used in > practice, such as AlexNet. The main motivation for using specialized > libraries for deep learning would be to fully take advantage of the hardware > where Spark runs, in particular GPUs. Having the default
[jira] [Created] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
Jeff Nadler created SPARK-17510: --- Summary: Set Streaming MaxRate Independently For Multiple Streams Key: SPARK-17510 URL: https://issues.apache.org/jira/browse/SPARK-17510 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 2.0.0 Reporter: Jeff Nadler We use multiple DStreams coming from different Kafka topics in a Streaming application. Some settings like maxrate and backpressure enabled/disabled would be better passed as config to KafkaUtils.createStream and KafkaUtils.createDirectStream, instead of setting them in SparkConf. Being able to set a different maxrate for different streams is an important requirement for us; we currently work-around the problem by using one receiver-based stream and one direct stream. We would like to be able to turn on backpressure for only one of the streams as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17409) Query in CTAS is Optimized Twice
[ https://issues.apache.org/jira/browse/SPARK-17409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17409: --- Labels: correctness (was: ) > Query in CTAS is Optimized Twice > > > Key: SPARK-17409 > URL: https://issues.apache.org/jira/browse/SPARK-17409 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Critical > Labels: correctness > > The query in CTAS is optimized twice, as reported in the PR: > https://github.com/apache/spark/pull/14797 > {quote} > Some analyzer rules have assumptions on logical plans, optimizer may break > these assumption, we should not pass an optimized query plan into > QueryExecution (will be analyzed again), otherwise we may some weird bugs. > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17494) Floor function rounds up during join
[ https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484952#comment-15484952 ] Josh Rosen commented on SPARK-17494: This also seems to affect Spark 2.0, except there it always returns {{11}} when selecting a floor of {{num}}, while taking the floor of a casted literal seems to work as expected. Specifically, {code} select floor(cast(10.5 as decimal(15,6))), num, floor(num) from a {code} returns {{10, 10.5, 11}}. > Floor function rounds up during join > > > Key: SPARK-17494 > URL: https://issues.apache.org/jira/browse/SPARK-17494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Gokhan Civan > Labels: correctness > > If you create tables as follows: > create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num; > create table b as select 'A' as str; > Then > select floor(num) from a; > returns 10 > but > select floor(num) from a join b on a.str = b.str; > returns 11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching
[ https://issues.apache.org/jira/browse/SPARK-17509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-17509: Affects Version/s: 2.0.0 Description: Profiling a job, we saw that patten matching in wrap function of HiveInspector is consuming around 10% of the time which can be avoided. A similar change in the unwrap function was made in SPARK-15956. Component/s: SQL Issue Type: Improvement (was: Bug) > When wrapping catalyst datatype to Hive data type avoid pattern matching > > > Key: SPARK-17509 > URL: https://issues.apache.org/jira/browse/SPARK-17509 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Profiling a job, we saw that patten matching in wrap function of > HiveInspector is consuming around 10% of the time which can be avoided. A > similar change in the unwrap function was made in SPARK-15956. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17494) Floor function rounds up during join
[ https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17494: --- Affects Version/s: 2.0.0 > Floor function rounds up during join > > > Key: SPARK-17494 > URL: https://issues.apache.org/jira/browse/SPARK-17494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Gokhan Civan > Labels: correctness > > If you create tables as follows: > create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num; > create table b as select 'A' as str; > Then > select floor(num) from a; > returns 10 > but > select floor(num) from a join b on a.str = b.str; > returns 11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17494) Floor function rounds up during join
[ https://issues.apache.org/jira/browse/SPARK-17494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17494: --- Labels: correctness (was: ) > Floor function rounds up during join > > > Key: SPARK-17494 > URL: https://issues.apache.org/jira/browse/SPARK-17494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1, 2.0.0 >Reporter: Gokhan Civan > Labels: correctness > > If you create tables as follows: > create table a as select 'A' as str, cast(10.5 as decimal(15,6)) as num; > create table b as select 'A' as str; > Then > select floor(num) from a; > returns 10 > but > select floor(num) from a join b on a.str = b.str; > returns 11 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17509) When wrapping catalyst datatype to Hive data type avoid pattern matching
Sital Kedia created SPARK-17509: --- Summary: When wrapping catalyst datatype to Hive data type avoid pattern matching Key: SPARK-17509 URL: https://issues.apache.org/jira/browse/SPARK-17509 Project: Spark Issue Type: Bug Reporter: Sital Kedia -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17503. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Fixed for 2.0.1 / 2.1.0 by Sean's PR. > Memory leak in Memory store when unable to cache the whole RDD in memory > > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Sean Zhong >Assignee: Sean Zhong > Fix For: 2.0.1, 2.1.0 > > Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot > 2016-09-12 at 4.34.19 PM.png > > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17503: --- Assignee: Sean Zhong > Memory leak in Memory store when unable to cache the whole RDD in memory > > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Sean Zhong >Assignee: Sean Zhong > Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot > 2016-09-12 at 4.34.19 PM.png > > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 14 more > 16/09/11 17:28:15 ERROR
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484861#comment-15484861 ] Evan Zamir commented on SPARK-17508: Just ran the same snippet of code setting weightCol="" and that runs without error. It's only when I set weightCol=None that I get an error. > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484850#comment-15484850 ] Evan Zamir commented on SPARK-17508: Yep, I'm running 2.0.0. You can see in the error messages above that it's running 2.0.0. Can you try running the same code snippet and see if it works for you? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17503) Memory leak in Memory store when unable to cache the whole RDD in memory
[ https://issues.apache.org/jira/browse/SPARK-17503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17503: --- Target Version/s: 2.0.1, 2.1.0 (was: 2.1.0) > Memory leak in Memory store when unable to cache the whole RDD in memory > > > Key: SPARK-17503 > URL: https://issues.apache.org/jira/browse/SPARK-17503 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Sean Zhong > Attachments: Screen Shot 2016-09-12 at 4.16.15 PM.png, Screen Shot > 2016-09-12 at 4.34.19 PM.png > > > h2.Problem description: > The following query triggers out of memory error. > {code} > sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > {code} > This is not expected, we should fallback to use disk instead if there is not > enough memory for cache. > Stacktrace: > {code} > scala> sc.parallelize(1 to 10, 100).map(x => new > Array[Long](1000)).cache().count() > [Stage 0:> (0 + 5) / > 5]16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_4 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN MemoryStore: Not enough space to cache rdd_1_0 in > memory! (computed 631.5 MB so far) > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_0 failed > 16/09/11 17:27:20 WARN BlockManager: Putting block rdd_1_4 failed > 16/09/11 17:27:21 WARN MemoryStore: Not enough space to cache rdd_1_1 in > memory! (computed 947.3 MB so far) > 16/09/11 17:27:21 WARN BlockManager: Putting block rdd_1_1 failed > 16/09/11 17:27:22 WARN MemoryStore: Not enough space to cache rdd_1_3 in > memory! (computed 1423.7 MB so far) > 16/09/11 17:27:22 WARN BlockManager: Putting block rdd_1_3 failed > java.lang.OutOfMemoryError: Java heap space > Dumping heap to java_pid26528.hprof ... > Heap dump file created [6551021666 bytes in 9.876 secs] > 16/09/11 17:28:15 WARN NettyRpcEnv: Ignored message: HeartbeatResponse(false) > 16/09/11 17:28:15 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(driver,[Lscala.Tuple2;@46c9ce96,BlockManagerId(driver, 127.0.0.1, > 55360))] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.executor.heartbeatInterval > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:523) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:552) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1857) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:552) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 > seconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:190) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:81) > ... 14 more > 16/09/11 17:28:15 ERROR Executor:
[jira] [Commented] (SPARK-17471) Add compressed method for Matrix class
[ https://issues.apache.org/jira/browse/SPARK-17471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484775#comment-15484775 ] Seth Hendrickson commented on SPARK-17471: -- [~yanboliang] Do you have any updates on this? We need to make implementation of the {{compressed}} method for matrices high priority. I can look into implementing it, but I don't want to overlap work. Thanks! > Add compressed method for Matrix class > -- > > Key: SPARK-17471 > URL: https://issues.apache.org/jira/browse/SPARK-17471 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Seth Hendrickson > > Vectors in Spark have a {{compressed}} method which selects either sparse or > dense representation by minimizing storage requirements. Matrices should also > have this method, which is now explicitly needed in {{LogisticRegression}} > since we have implemented multiclass regression. > The compressed method should also give the option to store row major or > column major, and if nothing is specified should select the lower storage > representation (for sparse). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17463: Assignee: Shixiong Zhu (was: Apache Spark) > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at >
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484769#comment-15484769 ] Apache Spark commented on SPARK-17463: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15063 > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at
[jira] [Assigned] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17463: Assignee: Apache Spark (was: Shixiong Zhu) > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at >
[jira] [Commented] (SPARK-17321) YARN shuffle service should use good disk from yarn.nodemanager.local-dirs
[ https://issues.apache.org/jira/browse/SPARK-17321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484748#comment-15484748 ] Thomas Graves commented on SPARK-17321: --- Not sure I follow this comment. So you are using NM recovery with the recovery path specified? And you saw an error in the spark shuffle creating or writing to the DB but the NM stayed up ok writing its recovery data to the same disk? > YARN shuffle service should use good disk from yarn.nodemanager.local-dirs > -- > > Key: SPARK-17321 > URL: https://issues.apache.org/jira/browse/SPARK-17321 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.2, 2.0.0 >Reporter: yunjiong zhao > > We run spark on yarn, after enabled spark dynamic allocation, we notice some > spark application failed randomly due to YarnShuffleService. > From log I found > {quote} > 2016-08-29 11:33:03,450 ERROR org.apache.spark.network.TransportContext: > Error while initializing Netty pipeline > java.lang.NullPointerException > at > org.apache.spark.network.server.TransportRequestHandler.(TransportRequestHandler.java:77) > at > org.apache.spark.network.TransportContext.createChannelHandler(TransportContext.java:159) > at > org.apache.spark.network.TransportContext.initializePipeline(TransportContext.java:135) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:123) > at > org.apache.spark.network.server.TransportServer$1.initChannel(TransportServer.java:116) > at > io.netty.channel.ChannelInitializer.channelRegistered(ChannelInitializer.java:69) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRegistered(AbstractChannelHandlerContext.java:133) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRegistered(AbstractChannelHandlerContext.java:119) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRegistered(DefaultChannelPipeline.java:733) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:450) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$100(AbstractChannel.java:378) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:424) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) > {quote} > Which caused by the first disk in yarn.nodemanager.local-dirs was broken. > If we enabled spark.yarn.shuffle.stopOnFailure(SPARK-16505) we might lost > hundred nodes which is unacceptable. > We have 12 disks in yarn.nodemanager.local-dirs, so why not use other good > disks if the first one is broken? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-17463: Assignee: Shixiong Zhu > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at >
[jira] [Commented] (SPARK-17445) Reference an ASF page as the main place to find third-party packages
[ https://issues.apache.org/jira/browse/SPARK-17445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484732#comment-15484732 ] Matei Zaharia commented on SPARK-17445: --- Sounds good to me. > Reference an ASF page as the main place to find third-party packages > > > Key: SPARK-17445 > URL: https://issues.apache.org/jira/browse/SPARK-17445 > Project: Spark > Issue Type: Improvement >Reporter: Matei Zaharia > > Some comments and docs like > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L148-L151 > say to go to spark-packages.org, but since this is a package index > maintained by a third party, it would be better to reference an ASF page that > we can keep updated and own the URL for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-16742: Description: We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. Mesosphere design doc: https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 Mesosphere code: https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa was: We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-16742: Description: We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa was: We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484720#comment-15484720 ] Shixiong Zhu commented on SPARK-17463: -- [~joshrosen] I think we can just leave LongAccum as it is. It's no worse than Spark 1.6. In Spark 1.6, `sum` and `count` are different accumulators and have the same inconsistent issue. We definitely should fix CollectionAccumulator and SetAccumulator. I will submit a PR to add the necessary `synchronized` for them. By the way, I didn't notice that AccumulatorV2 sends the whole object back to the driver. Do you know any special reason? I remember previously we only send the values of accumulators back to driver. > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Critical > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at >
[jira] [Commented] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect
[ https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484718#comment-15484718 ] Ryan Blue commented on SPARK-17424: --- I'm adding the above fix in a PR. This fix works for us (the job succeeds) and doesn't change the behavior of cases where the concrete type arguments are known. > Dataset job fails from unsound substitution in ScalaReflect > --- > > Key: SPARK-17424 > URL: https://issues.apache.org/jira/browse/SPARK-17424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Ryan Blue > > I have a job that uses datasets in 1.6.1 and is failing with this error: > {code} > 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > at scala.reflect.internal.Types$SubstMap.(Types.scala:4644) > at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761) > at scala.reflect.internal.Types$Type.subst(Types.scala:796) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279) > at scala.collection.AbstractIterator.toMap(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at >
[jira] [Commented] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect
[ https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484715#comment-15484715 ] Apache Spark commented on SPARK-17424: -- User 'rdblue' has created a pull request for this issue: https://github.com/apache/spark/pull/15062 > Dataset job fails from unsound substitution in ScalaReflect > --- > > Key: SPARK-17424 > URL: https://issues.apache.org/jira/browse/SPARK-17424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Ryan Blue > > I have a job that uses datasets in 1.6.1 and is failing with this error: > {code} > 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > at scala.reflect.internal.Types$SubstMap.(Types.scala:4644) > at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761) > at scala.reflect.internal.Types$Type.subst(Types.scala:796) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279) > at scala.collection.AbstractIterator.toMap(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at
[jira] [Assigned] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect
[ https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17424: Assignee: (was: Apache Spark) > Dataset job fails from unsound substitution in ScalaReflect > --- > > Key: SPARK-17424 > URL: https://issues.apache.org/jira/browse/SPARK-17424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Ryan Blue > > I have a job that uses datasets in 1.6.1 and is failing with this error: > {code} > 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > at scala.reflect.internal.Types$SubstMap.(Types.scala:4644) > at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761) > at scala.reflect.internal.Types$Type.subst(Types.scala:796) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279) > at scala.collection.AbstractIterator.toMap(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at
[jira] [Assigned] (SPARK-17424) Dataset job fails from unsound substitution in ScalaReflect
[ https://issues.apache.org/jira/browse/SPARK-17424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17424: Assignee: Apache Spark > Dataset job fails from unsound substitution in ScalaReflect > --- > > Key: SPARK-17424 > URL: https://issues.apache.org/jira/browse/SPARK-17424 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 2.0.0 >Reporter: Ryan Blue >Assignee: Apache Spark > > I have a job that uses datasets in 1.6.1 and is failing with this error: > {code} > 16/09/02 17:02:56 ERROR Driver ApplicationMaster: User class threw exception: > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > java.lang.AssertionError: assertion failed: Unsound substitution from > List(type T, type U) to List() > at scala.reflect.internal.Types$SubstMap.(Types.scala:4644) > at scala.reflect.internal.Types$SubstTypeMap.(Types.scala:4761) > at scala.reflect.internal.Types$Type.subst(Types.scala:796) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:321) > at > scala.reflect.internal.Types$TypeApiImpl.substituteTypes(Types.scala:298) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:769) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$getConstructorParameters$1.apply(ScalaReflection.scala:768) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.getConstructorParameters(ScalaReflection.scala:768) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:30) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:610) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames$lzycompute(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode.org$apache$spark$sql$catalyst$trees$TreeNode$$argNames(TreeNode.scala:418) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:415) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$argsMap$1.apply(TreeNode.scala:414) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at scala.collection.TraversableOnce$class.toMap(TraversableOnce.scala:279) > at scala.collection.AbstractIterator.toMap(Iterator.scala:1157) > at > org.apache.spark.sql.catalyst.trees.TreeNode.argsMap(TreeNode.scala:416) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:46) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.execution.SparkPlanInfo$.fromSparkPlan(SparkPlanInfo.scala:44) > at > org.apache.spark.sql.execution.SparkPlanInfo$$anonfun$2.apply(SparkPlanInfo.scala:44) > at >
[jira] [Updated] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Gummelt updated SPARK-16742: Description: We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa was:We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be contributing it to Apache Spark soon. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484704#comment-15484704 ] Sean Owen commented on SPARK-17508: --- This looks a lot like the problem solved in SPARK-14931 / https://github.com/apache/spark/pull/12816 -- you sure it's an issue in 2.0.0? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion
[ https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen reassigned SPARK-14818: -- Assignee: Josh Rosen > Move sketch and mllibLocal out from mima exclusion > -- > > Key: SPARK-14818 > URL: https://issues.apache.org/jira/browse/SPARK-14818 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > > In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see > the definition of mimaProjects). After we release 2.0, we should move them > out from this list. So, we can check binary compatibility for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion
[ https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14818: Assignee: Apache Spark (was: Josh Rosen) > Move sketch and mllibLocal out from mima exclusion > -- > > Key: SPARK-14818 > URL: https://issues.apache.org/jira/browse/SPARK-14818 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Blocker > > In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see > the definition of mimaProjects). After we release 2.0, we should move them > out from this list. So, we can check binary compatibility for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion
[ https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14818: Assignee: Josh Rosen (was: Apache Spark) > Move sketch and mllibLocal out from mima exclusion > -- > > Key: SPARK-14818 > URL: https://issues.apache.org/jira/browse/SPARK-14818 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > > In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see > the definition of mimaProjects). After we release 2.0, we should move them > out from this list. So, we can check binary compatibility for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14818) Move sketch and mllibLocal out from mima exclusion
[ https://issues.apache.org/jira/browse/SPARK-14818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484691#comment-15484691 ] Apache Spark commented on SPARK-14818: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15061 > Move sketch and mllibLocal out from mima exclusion > -- > > Key: SPARK-14818 > URL: https://issues.apache.org/jira/browse/SPARK-14818 > Project: Spark > Issue Type: Bug > Components: Build >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Blocker > > In SparkBuild.scala, we exclude sketch and mllibLocal from mima check (see > the definition of mimaProjects). After we release 2.0, we should move them > out from this list. So, we can check binary compatibility for them -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484629#comment-15484629 ] Gang Wu commented on SPARK-17477: - [~hyukjin.kwon] I agree with you. But both issues are targeting at parquet data sources. I think it applies to all data sources. > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at >
[jira] [Created] (SPARK-17508) Setting weightCol to None in ML library causes an error
Evan Zamir created SPARK-17508: -- Summary: Setting weightCol to None in ML library causes an error Key: SPARK-17508 URL: https://issues.apache.org/jira/browse/SPARK-17508 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.0.0 Reporter: Evan Zamir The following code runs without error: {code} spark = SparkSession.builder.appName('WeightBug').getOrCreate() df = spark.createDataFrame( [ (1.0, 1.0, Vectors.dense(1.0)), (0.0, 1.0, Vectors.dense(-1.0)) ], ["label", "weight", "features"]) lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") model = lr.fit(df) {code} My expectation from reading the documentation is that setting weightCol=None should treat all weights as 1.0 (regardless of whether a column exists). However, the same code with weightCol set to None causes the following errors: Traceback (most recent call last): File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in model = lr.fit(df) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 213, in _fit java_model = self._fit_java(dataset) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", line 210, in _fit_java return self._java_obj.fit(dataset._jdf) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. : java.lang.NullPointerException at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) at org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17477) SparkSQL cannot handle schema evolution from Int -> Long when parquet files have Int as its type while hive metastore has Long as its type
[ https://issues.apache.org/jira/browse/SPARK-17477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Wu updated SPARK-17477: Target Version/s: (was: 2.1.0) > SparkSQL cannot handle schema evolution from Int -> Long when parquet files > have Int as its type while hive metastore has Long as its type > -- > > Key: SPARK-17477 > URL: https://issues.apache.org/jira/browse/SPARK-17477 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Gang Wu > > When using SparkSession to read a Hive table which is stored as parquet > files. If there has been a schema evolution from int to long of a column. > There are some old parquet files use int for the column while some new > parquet files use long. In Hive metastore, the type is long (bigint). > Therefore when I use the following: > {quote} > sparkSession.sql("select * from table").show() > {quote} > I got the following exception: > {quote} > 16/08/29 17:50:20 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 > (TID 91, XXX): org.apache.parquet.io.ParquetDecodingException: Can not read > value at 0 in block 0 in file > hdfs://path/to/parquet/1-part-r-0-d8e4f5aa-b6b9-4cad-8432-a7ae7a590a93.gz.parquet > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:228) > at > org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:36) > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:128) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:91) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to > org.apache.spark.sql.catalyst.expressions.MutableInt > at > org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:246) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RowUpdater.setInt(ParquetRowConverter.scala:161) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter.addInt(ParquetRowConverter.scala:85) > at > org.apache.parquet.column.impl.ColumnReaderImpl$2$3.writeValue(ColumnReaderImpl.java:249) > at > org.apache.parquet.column.impl.ColumnReaderImpl.writeCurrentValueToConverter(ColumnReaderImpl.java:365) > at > org.apache.parquet.io.RecordReaderImplementation.read(RecordReaderImplementation.java:405) > at > org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:209) > ... 22 more > {quote} > But this kind of schema evolution (int =>
[jira] [Commented] (SPARK-15406) Structured streaming support for consuming from Kafka
[ https://issues.apache.org/jira/browse/SPARK-15406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484602#comment-15484602 ] Chris Parmer commented on SPARK-15406: -- For my team, we are just primarily interested in the SQL / table-like interface to Kafka with time indexing. Thanks for the discussion! > Structured streaming support for consuming from Kafka > - > > Key: SPARK-15406 > URL: https://issues.apache.org/jira/browse/SPARK-15406 > Project: Spark > Issue Type: New Feature >Reporter: Cody Koeninger > > Structured streaming doesn't have support for kafka yet. I personally feel > like time based indexing would make for a much better interface, but it's > been pushed back to kafka 0.10.1 > https://cwiki.apache.org/confluence/display/KAFKA/KIP-33+-+Add+a+time+based+log+index -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17507) check weight vector size in ANN
[ https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17507: Assignee: (was: Apache Spark) > check weight vector size in ANN > --- > > Key: SPARK-17507 > URL: https://issues.apache.org/jira/browse/SPARK-17507 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > check weight vector(specified by user) size in ANN. > and if not right throw exception in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17507) check weight vector size in ANN
[ https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17507: Assignee: Apache Spark > check weight vector size in ANN > --- > > Key: SPARK-17507 > URL: https://issues.apache.org/jira/browse/SPARK-17507 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > check weight vector(specified by user) size in ANN. > and if not right throw exception in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17507) check weight vector size in ANN
[ https://issues.apache.org/jira/browse/SPARK-17507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484586#comment-15484586 ] Apache Spark commented on SPARK-17507: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/15060 > check weight vector size in ANN > --- > > Key: SPARK-17507 > URL: https://issues.apache.org/jira/browse/SPARK-17507 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > check weight vector(specified by user) size in ANN. > and if not right throw exception in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17507) check weight vector size in ANN
Weichen Xu created SPARK-17507: -- Summary: check weight vector size in ANN Key: SPARK-17507 URL: https://issues.apache.org/jira/browse/SPARK-17507 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Weichen Xu check weight vector(specified by user) size in ANN. and if not right throw exception in time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4633) Support gzip in spark.compression.io.codec
[ https://issues.apache.org/jira/browse/SPARK-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484488#comment-15484488 ] Adam Roberts commented on SPARK-4633: - Very interested in this and I know Nasser Ebrahim is also (full disclosure that we both work for IBM). https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/ shows promising results Would be interesting to code up a quick prototype (perhaps based on the pull request here) and to see what performance difference we can gain, looks like Takeshi has done the starting work for us > Support gzip in spark.compression.io.codec > -- > > Key: SPARK-4633 > URL: https://issues.apache.org/jira/browse/SPARK-4633 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Reporter: Takeshi Yamamuro >Priority: Trivial > > gzip is widely used in other frameowrks such as hadoop mapreduce and tez, and > also > I think that gizip is more stable than other codecs in terms of both > performance > and space overheads. > I have one open question; current spark configuratios have a block size option > for each codec (spark.io.compression.[gzip|lz4|snappy].block.size). > As # of codecs increases, the configurations have more options and > I think that it is sort of complicated for non-expert users. > To mitigate it, my thought follows; > the three configurations are replaced with a single option for block size > (spark.io.compression.block.size). Then, 'Meaning' in configurations > will describe "This option makes an effect on gzip, lz4, and snappy. > Block size (in bytes) used in compression, in the case when these compression > codecs are used. Lowering...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17506) Improve the check double values equality rule
[ https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17506: -- Priority: Minor (was: Critical) > Improve the check double values equality rule > - > > Key: SPARK-17506 > URL: https://issues.apache.org/jira/browse/SPARK-17506 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Priority: Minor > > In `ExpressionEvalHelper`, we check the equality between two double values by > comparing whether the expected value is within the range [target - tolerance, > target + tolerance], but this can cause a negative false when the compared > numerics are very large. > For example: > {code} > val1 = 1.6358558070241E306 > val2 = 1.6358558070240974E306 > ExpressionEvalHelper.compareResults(val1, val2) > false > {code} > In fact, val1 and val2 are but with different precisions, we should tolerant > this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17506) Improve the check double values equality rule
[ https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17506: Assignee: Apache Spark > Improve the check double values equality rule > - > > Key: SPARK-17506 > URL: https://issues.apache.org/jira/browse/SPARK-17506 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Assignee: Apache Spark >Priority: Critical > > In `ExpressionEvalHelper`, we check the equality between two double values by > comparing whether the expected value is within the range [target - tolerance, > target + tolerance], but this can cause a negative false when the compared > numerics are very large. > For example: > {code} > val1 = 1.6358558070241E306 > val2 = 1.6358558070240974E306 > ExpressionEvalHelper.compareResults(val1, val2) > false > {code} > In fact, val1 and val2 are but with different precisions, we should tolerant > this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17506) Improve the check double values equality rule
[ https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15484443#comment-15484443 ] Apache Spark commented on SPARK-17506: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/15059 > Improve the check double values equality rule > - > > Key: SPARK-17506 > URL: https://issues.apache.org/jira/browse/SPARK-17506 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Priority: Critical > > In `ExpressionEvalHelper`, we check the equality between two double values by > comparing whether the expected value is within the range [target - tolerance, > target + tolerance], but this can cause a negative false when the compared > numerics are very large. > For example: > {code} > val1 = 1.6358558070241E306 > val2 = 1.6358558070240974E306 > ExpressionEvalHelper.compareResults(val1, val2) > false > {code} > In fact, val1 and val2 are but with different precisions, we should tolerant > this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17506) Improve the check double values equality rule
[ https://issues.apache.org/jira/browse/SPARK-17506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17506: Assignee: (was: Apache Spark) > Improve the check double values equality rule > - > > Key: SPARK-17506 > URL: https://issues.apache.org/jira/browse/SPARK-17506 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Jiang Xingbo >Priority: Critical > > In `ExpressionEvalHelper`, we check the equality between two double values by > comparing whether the expected value is within the range [target - tolerance, > target + tolerance], but this can cause a negative false when the compared > numerics are very large. > For example: > {code} > val1 = 1.6358558070241E306 > val2 = 1.6358558070240974E306 > ExpressionEvalHelper.compareResults(val1, val2) > false > {code} > In fact, val1 and val2 are but with different precisions, we should tolerant > this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org