[jira] [Commented] (SPARK-20209) Execute next trigger immediately if previous batch took longer than trigger interval
[ https://issues.apache.org/jira/browse/SPARK-20209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954665#comment-15954665 ] Apache Spark commented on SPARK-20209: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/17525 > Execute next trigger immediately if previous batch took longer than trigger > interval > > > Key: SPARK-20209 > URL: https://issues.apache.org/jira/browse/SPARK-20209 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, > then it will wait for 9 mins before starting the next batch. This does not > make sense. The processing time based trigger policy should be to do process > batches as fast as possible, but no faster than 1 in every trigger interval. > If batches are taking longer than trigger interval anyways, then no point > waiting extra trigger interval. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20209) Execute next trigger immediately if previous batch took longer than trigger interval
[ https://issues.apache.org/jira/browse/SPARK-20209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20209: Assignee: Tathagata Das (was: Apache Spark) > Execute next trigger immediately if previous batch took longer than trigger > interval > > > Key: SPARK-20209 > URL: https://issues.apache.org/jira/browse/SPARK-20209 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Tathagata Das > > For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, > then it will wait for 9 mins before starting the next batch. This does not > make sense. The processing time based trigger policy should be to do process > batches as fast as possible, but no faster than 1 in every trigger interval. > If batches are taking longer than trigger interval anyways, then no point > waiting extra trigger interval. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20209) Execute next trigger immediately if previous batch took longer than trigger interval
[ https://issues.apache.org/jira/browse/SPARK-20209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20209: Assignee: Apache Spark (was: Tathagata Das) > Execute next trigger immediately if previous batch took longer than trigger > interval > > > Key: SPARK-20209 > URL: https://issues.apache.org/jira/browse/SPARK-20209 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tathagata Das >Assignee: Apache Spark > > For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, > then it will wait for 9 mins before starting the next batch. This does not > make sense. The processing time based trigger policy should be to do process > batches as fast as possible, but no faster than 1 in every trigger interval. > If batches are taking longer than trigger interval anyways, then no point > waiting extra trigger interval. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20209) Execute next trigger immediately if previous batch took longer than trigger interval
Tathagata Das created SPARK-20209: - Summary: Execute next trigger immediately if previous batch took longer than trigger interval Key: SPARK-20209 URL: https://issues.apache.org/jira/browse/SPARK-20209 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Tathagata Das Assignee: Tathagata Das For large trigger intervals (e.g. 10 minutes), if a batch takes 11 minutes, then it will wait for 9 mins before starting the next batch. This does not make sense. The processing time based trigger policy should be to do process batches as fast as possible, but no faster than 1 in every trigger interval. If batches are taking longer than trigger interval anyways, then no point waiting extra trigger interval. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20208) Document R fpGrowth support in vignettes, programming guide and code example
Felix Cheung created SPARK-20208: Summary: Document R fpGrowth support in vignettes, programming guide and code example Key: SPARK-20208 URL: https://issues.apache.org/jira/browse/SPARK-20208 Project: Spark Issue Type: Bug Components: Documentation, SparkR Affects Versions: 2.2.0 Reporter: Felix Cheung -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20067) Unify and Clean Up Desc Commands Using Catalog Interface
[ https://issues.apache.org/jira/browse/SPARK-20067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20067. - Resolution: Fixed Fix Version/s: 2.2.0 > Unify and Clean Up Desc Commands Using Catalog Interface > > > Key: SPARK-20067 > URL: https://issues.apache.org/jira/browse/SPARK-20067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > We should unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and > `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The > output formats are improved. We also add the missing attributes. It impacts > the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC > FORMATTED`. > In addition, by following what we did in Dataset API `printSchema`, we can > use `treeString` to show the schema in the more readable way. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20067) Unify and Clean Up Desc Commands Using Catalog Interface
[ https://issues.apache.org/jira/browse/SPARK-20067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20067: Summary: Unify and Clean Up Desc Commands Using Catalog Interface (was: Use treeString to print out the table schema for CatalogTable) > Unify and Clean Up Desc Commands Using Catalog Interface > > > Key: SPARK-20067 > URL: https://issues.apache.org/jira/browse/SPARK-20067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > We should unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and > `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The > output formats are improved. We also add the missing attributes. It impacts > the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC > FORMATTED`. > In addition, by following what we did in Dataset API `printSchema`, we can > use `treeString` to show the schema in the more readable way. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20067) Use treeString to print out the table schema for CatalogTable
[ https://issues.apache.org/jira/browse/SPARK-20067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-20067: Description: We should unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`. In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way. was: Currently, we are using {{sql}} to print the schema. To make the schema more readable, we should use {{treeString}}, like what we did in Dataset API {{printSchema}} Below is the current way: {noformat} Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)> {noformat} After the change, it should look like {noformat} Schema: root |-- a: string (nullable = true) |-- b: integer (nullable = true) |-- c: string (nullable = true) |-- d: string (nullable = true) {noformat} > Use treeString to print out the table schema for CatalogTable > - > > Key: SPARK-20067 > URL: https://issues.apache.org/jira/browse/SPARK-20067 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > We should unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and > `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The > output formats are improved. We also add the missing attributes. It impacts > the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC > FORMATTED`. > In addition, by following what we did in Dataset API `printSchema`, we can > use `treeString` to show the schema in the more readable way. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20026) Document R GLM Tweedie family support in programming guide and code example
[ https://issues.apache.org/jira/browse/SPARK-20026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954626#comment-15954626 ] Felix Cheung commented on SPARK-20026: -- [~actuaryzhang] would you like to work on this for the 2.2 release? > Document R GLM Tweedie family support in programming guide and code example > --- > > Key: SPARK-20026 > URL: https://issues.apache.org/jira/browse/SPARK-20026 > Project: Spark > Issue Type: Bug > Components: Documentation, SparkR >Affects Versions: 2.2.0 >Reporter: Felix Cheung > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19235) Enable Test Cases in DDLSuite with Hive Metastore
[ https://issues.apache.org/jira/browse/SPARK-19235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954617#comment-15954617 ] Apache Spark commented on SPARK-19235: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/17524 > Enable Test Cases in DDLSuite with Hive Metastore > - > > Key: SPARK-19235 > URL: https://issues.apache.org/jira/browse/SPARK-19235 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > So far, the test cases in DDLSuites only verify the behaviors of > InMemoryCatalog. That means, they do not cover the scenarios using > HiveExternalCatalog. Thus, we need to improve the existing test suite to run > these cases using Hive metastore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.
[ https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954584#comment-15954584 ] Liang-Chi Hsieh commented on SPARK-20193: - Actually I am not sure what {{struct()}} represents. If you want a null for this struct, you can write: {code} spark.range(3).select(col("id"), lit(null).cast(new StructType())) {code} > Selecting empty struct causes ExpressionEncoder error. > -- > > Key: SPARK-20193 > URL: https://issues.apache.org/jira/browse/SPARK-20193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu > Labels: struct > > {{def struct(cols: Column*): Column}} > Given the above signature and the lack of any note in the docs saying that a > struct with no columns is not supported, I would expect the following to work: > {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}} > However, this results in: > {quote} > java.lang.AssertionError: assertion failed: each serializer expression should > contains at least one `BoundReference` > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1131) > ... 39 elided > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20207) Add ablity to exclude current row in WindowSpec
Mathew Wicks created SPARK-20207: Summary: Add ablity to exclude current row in WindowSpec Key: SPARK-20207 URL: https://issues.apache.org/jira/browse/SPARK-20207 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Mathew Wicks Priority: Minor It would be useful if we could implement a way to exclude the current row in WindowSpec. (We can currently only select ranges of rows/time.) Currently, users have to resort to ridiculous measures to exclude the current row from windowing aggregations. As seen here: http://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions/43198839#43198839 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20144) spark.read.parquet no long maintains ordering of the data
[ https://issues.apache.org/jira/browse/SPARK-20144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954559#comment-15954559 ] Liang-Chi Hsieh commented on SPARK-20144: - I don't think the API has the guarantee about the data ordering. The difference between 1.6.3 to 2.0.2 is just due to the change of internal implementation. I checked the current FileSourceScanExec, it still reorders the partition files. When you save the sorted data into Parquet, only the data in individual Parquet file can maintain the data ordering. We shouldn't expect a special ordering on the whole data read back, if the API doesn't explicitly guarantee that. > spark.read.parquet no long maintains ordering of the data > - > > Key: SPARK-20144 > URL: https://issues.apache.org/jira/browse/SPARK-20144 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Li Jin > > Hi, We are trying to upgrade Spark from 1.6.3 to 2.0.2. One issue we found is > when we read parquet files in 2.0.2, the ordering of rows in the resulting > dataframe is not the same as the ordering of rows in the dataframe that the > parquet file was reproduced with. > This is because FileSourceStrategy.scala combines the parquet files into > fewer partitions and also reordered them. This breaks our workflows because > they assume the ordering of the data. > Is this considered a bug? Also FileSourceStrategy and FileSourceScanExec > changed quite a bit from 2.0.2 to 2.1, so not sure if this is an issue with > 2.1. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20079) Re registration of AM hangs spark cluster in yarn-client mode
[ https://issues.apache.org/jira/browse/SPARK-20079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-20079: Description: The ExecutorAllocationManager.reset method is called when re-registering AM, which sets the ExecutorAllocationManager.initializing field true. When this field is true, the Driver does not start a new executor from the AM request. The following two cases will cause the field to False 1. A executor idle for some time. 2. There are new stages to be submitted After the a stage was submitted, the AM was killed and restart ,the above two cases will not appear. 1. When AM is killed, the yarn will kill all running containers. All execuotr will be lost and no executor will be idle. 2. No surviving executor, resulting in the current stage will never be completed, DAG will not submit a new stage. Reproduction steps: 1. Start cluster {noformat} echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2 {noformat} 2. Kill the AM process when a stage is scheduled. was: 1. Start cluster echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | ./bin/spark-shell --master yarn-client --executor-cores 1 --conf spark.shuffle.service.enabled=true --conf spark.dynamicAllocation.enabled=true --conf spark.dynamicAllocation.maxExecutors=2 2. Kill the AM process when a stage is scheduled. > Re registration of AM hangs spark cluster in yarn-client mode > - > > Key: SPARK-20079 > URL: https://issues.apache.org/jira/browse/SPARK-20079 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Guoqiang Li > > The ExecutorAllocationManager.reset method is called when re-registering AM, > which sets the ExecutorAllocationManager.initializing field true. When this > field is true, the Driver does not start a new executor from the AM request. > The following two cases will cause the field to False > 1. A executor idle for some time. > 2. There are new stages to be submitted > After the a stage was submitted, the AM was killed and restart ,the above two > cases will not appear. > 1. When AM is killed, the yarn will kill all running containers. All execuotr > will be lost and no executor will be idle. > 2. No surviving executor, resulting in the current stage will never be > completed, DAG will not submit a new stage. > Reproduction steps: > 1. Start cluster > {noformat} > echo -e "sc.parallelize(1 to 2000).foreach(_ => Thread.sleep(1000))" | > ./bin/spark-shell --master yarn-client --executor-cores 1 --conf > spark.shuffle.service.enabled=true --conf > spark.dynamicAllocation.enabled=true --conf > spark.dynamicAllocation.maxExecutors=2 > {noformat} > 2. Kill the AM process when a stage is scheduled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11421) Add the ability to add a jar to the current class loader
[ https://issues.apache.org/jira/browse/SPARK-11421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954533#comment-15954533 ] Daniel Erenrich commented on SPARK-11421: - Is this not basically a duplicate of the much older https://issues.apache.org/jira/browse/SPARK-5377 > Add the ability to add a jar to the current class loader > > > Key: SPARK-11421 > URL: https://issues.apache.org/jira/browse/SPARK-11421 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: holdenk >Priority: Minor > > addJar add's jars for future operations, but could also add to the current > class loader, this would be really useful in Python & R most likely where > some included python code may wish to add some jars. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954527#comment-15954527 ] Dinesh Man Amatya commented on SPARK-20176: --- Thanks Kazuaki for the effort. I was able to resolve the issue by upgrading the spark and scala version as follows, scala.version : 2.11.5 scala.compat.version : 2.11 spark.version : 2.1.0 > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNull16 ? false : (Boolean) > value17.isNullAt(1); > /* 096 */
[jira] [Updated] (SPARK-20206) spark.ui.killEnabled=false property doesn't reflect on task/stages
[ https://issues.apache.org/jira/browse/SPARK-20206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] srinivasan updated SPARK-20206: --- Priority: Minor (was: Major) > spark.ui.killEnabled=false property doesn't reflect on task/stages > -- > > Key: SPARK-20206 > URL: https://issues.apache.org/jira/browse/SPARK-20206 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.1.0 >Reporter: srinivasan >Priority: Minor > > spark.ui.killEnabled=false property doesn't reflect on active task and > stages.kill hyperlink is still enabled on active tasks and stages -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20206) spark.ui.killEnabled=false property doesn't reflect on task/stages
srinivasan created SPARK-20206: -- Summary: spark.ui.killEnabled=false property doesn't reflect on task/stages Key: SPARK-20206 URL: https://issues.apache.org/jira/browse/SPARK-20206 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.1.0 Reporter: srinivasan spark.ui.killEnabled=false property doesn't reflect on active task and stages.kill hyperlink is still enabled on active tasks and stages -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954447#comment-15954447 ] Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:47 AM: -- Actually, after re-thinking, it seems we would not need this for now if not many users request this. Now we can do a workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema spark.read.schema(sampledSchema).csv(ds) {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. was (Author: hyukjin.kwon): Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema spark.read.schema(sampledSchema).csv(ds) {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954447#comment-15954447 ] Hyukjin Kwon edited comment on SPARK-14726 at 4/4/17 1:40 AM: -- Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS val sampledSchema = spark.read.option("inferSchema", true).csv(ds.sample(false, 0.7)).schema spark.read.schema(sampledSchema).csv(ds) {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. was (Author: hyukjin.kwon): Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7) val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema spark.read.schema(sampledSchema).csv("/tmp/path") {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14726) Support for sampling when inferring schema in CSV data source
[ https://issues.apache.org/jira/browse/SPARK-14726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-14726. -- Resolution: Won't Fix Actually, after re-thinking, it seems we would not need this for now if not many users request this. Workaround as below: {code} val ds = Seq("a", "b", "c", "d").toDS.sample(false, 0.7) val sampledSchema = spark.read.option("inferSchema", true).csv(ds).schema spark.read.schema(sampledSchema).csv("/tmp/path") {code} Actually, this will allow more dynamic options, e.g., with replacement or without replacement or filtering or even just limit 100. I will keep eyes on similar issues and reopen if it seems many users want this. Please reopen this if you strongly feel this should be supported as an option or anyone feels so. > Support for sampling when inferring schema in CSV data source > - > > Key: SPARK-14726 > URL: https://issues.apache.org/jira/browse/SPARK-14726 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Bomi Kim > > Currently, I am using CSV data source and trying to get used to Spark 2.0 > because it has built-in CSV data source. > I realized that CSV data source infers schema with all the data. JSON data > source supports sampling ratio option. > It would be great if CSV data source has this option too (or is this > supported already?). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19186) Hash symbol in middle of Sybase database table name causes Spark Exception
[ https://issues.apache.org/jira/browse/SPARK-19186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19186. -- Resolution: Not A Problem ^ I agree with this. Also, up to my knowledge, we can deal with the dialect in favour of SPARK-17614, assuming the exception came from https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L60-L62 within Spark. I am resolving this per the issue described in this JIRA. Please reopen this if I misunderstood. > Hash symbol in middle of Sybase database table name causes Spark Exception > -- > > Key: SPARK-19186 > URL: https://issues.apache.org/jira/browse/SPARK-19186 > Project: Spark > Issue Type: Bug >Affects Versions: 2.1.0 >Reporter: Adrian Schulewitz >Priority: Minor > > If I use a table name without a '#' symbol in the middle then no exception > occurs but with one an exception is thrown. According to Sybase 15 > documentation a '#' is a legal character. > val testSql = "SELECT * FROM CTP#ADR_TYPE_DBF" > val conf = new SparkConf().setAppName("MUREX DMart Simple Reader via > SQL").setMaster("local[2]") > val sess = SparkSession > .builder() > .appName("MUREX DMart Simple SQL Reader") > .config(conf) > .getOrCreate() > import sess.implicits._ > val df = sess.read > .format("jdbc") > .option("url", > "jdbc:jtds:sybase://auq7064s.unix.anz:4020/mxdmart56") > .option("driver", "net.sourceforge.jtds.jdbc.Driver") > .option("dbtable", "CTP#ADR_TYPE_DBF") > .option("UDT_DEALCRD_REP", "mxdmart56") > .option("user", "INSTAL") > .option("password", "INSTALL") > .load() > df.createOrReplaceTempView("trades") > val resultsDF = sess.sql(testSql) > resultsDF.show() > 17/01/12 14:30:01 INFO SharedState: Warehouse path is > 'file:/C:/DEVELOPMENT/Projects/MUREX/trunk/murex-eom-reporting/spark-warehouse/'. > 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: trades > 17/01/12 14:30:04 INFO SparkSqlParser: Parsing command: SELECT * FROM > CTP#ADR_TYPE_DBF > Exception in thread "main" > org.apache.spark.sql.catalyst.parser.ParseException: > extraneous input '#' expecting {, ',', 'SELECT', 'FROM', 'ADD', 'AS', > 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', > 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', > 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', > 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', > 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'RIGHT', 'FULL', 'NATURAL', > 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', > 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'LAST', 'ROW', 'WITH', > 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', > 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', > 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', > 'UNION', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', > 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', > 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', 'DIV', > 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', > 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', > 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', > 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', > 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', > 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', > 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', > 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', > 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', > 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', > 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', > 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', > 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', > 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', > 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', IDENTIFIER, > BACKQUOTED_IDENTIFIER}(line 1, pos 17) > == SQL == > SELECT * FROM CTP#ADR_TYPE_DBF > -^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache
[jira] [Resolved] (SPARK-10364) Support Parquet logical type TIMESTAMP_MILLIS
[ https://issues.apache.org/jira/browse/SPARK-10364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-10364. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 15332 [https://github.com/apache/spark/pull/15332] > Support Parquet logical type TIMESTAMP_MILLIS > - > > Key: SPARK-10364 > URL: https://issues.apache.org/jira/browse/SPARK-10364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > Fix For: 2.2.0 > > > The {{TimestampType}} in Spark SQL is of microsecond precision. Ideally, we > should convert Spark SQL timestamp values into Parquet {{TIMESTAMP_MICROS}}. > But unfortunately parquet-mr hasn't supported it yet. > For the read path, we should be able to read {{TIMESTAMP_MILLIS}} Parquet > values and pad a 0 microsecond part to read values. > For the write path, currently we are writing timestamps as {{INT96}}, similar > to Impala and Hive. One alternative is that, we can have a separate SQL > option to let users be able to write Spark SQL timestamp values as > {{TIMESTAMP_MILLIS}}. Of course, in this way the microsecond part will be > truncated. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19408: Description: In SPARK-17075, we estimate cardinality of predicate expression "column (op) literal", where op is =, <, <=, >, >= or <=>. In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work. In this jira, we want to estimate the filter factor of predicate expressions involving two columns of same table. For example, multiple tpc-h queries have this kind of predicate "WHERE l_commitdate < l_receiptdate". was: In SPARK-17075, we estimate cardinality of predicate expression "column (op) literal", where op is =, <, <=, >, or >=. In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work. In this jira, we want to estimate the filter factor of predicate expressions involving two columns of same table. For example, multiple tpc-h queries have this kind of predicate "WHERE l_commitdate < l_receiptdate". > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu > Fix For: 2.2.0 > > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, >= or <=>. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. For example, multiple tpc-h queries > have this kind of predicate "WHERE l_commitdate < l_receiptdate". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19408) cardinality estimation involving two columns of the same table
[ https://issues.apache.org/jira/browse/SPARK-19408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19408. - Resolution: Fixed Assignee: Ron Hu Fix Version/s: 2.2.0 > cardinality estimation involving two columns of the same table > -- > > Key: SPARK-19408 > URL: https://issues.apache.org/jira/browse/SPARK-19408 > Project: Spark > Issue Type: Sub-task > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Ron Hu >Assignee: Ron Hu > Fix For: 2.2.0 > > > In SPARK-17075, we estimate cardinality of predicate expression "column (op) > literal", where op is =, <, <=, >, >= or <=>. In SQL queries, we also see > predicate expressions involving two columns such as "column-1 (op) column-2" > where column-1 and column-2 belong to same table. Note that, if column-1 and > column-2 belong to different tables, then it is a join operator's work, NOT a > filter operator's work. > In this jira, we want to estimate the filter factor of predicate expressions > involving two columns of same table. For example, multiple tpc-h queries > have this kind of predicate "WHERE l_commitdate < l_receiptdate". -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20145) "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't
[ https://issues.apache.org/jira/browse/SPARK-20145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-20145. - Resolution: Fixed Assignee: sam elamin Fix Version/s: 2.2.0 > "SELECT * FROM range(1)" works, but "SELECT * FROM RANGE(1)" doesn't > > > Key: SPARK-20145 > URL: https://issues.apache.org/jira/browse/SPARK-20145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski >Assignee: sam elamin > Fix For: 2.2.0 > > > Executed at clean tip of the master branch, with all default settings: > scala> spark.sql("SELECT * FROM range(1)") > res1: org.apache.spark.sql.DataFrame = [id: bigint] > scala> spark.sql("SELECT * FROM RANGE(1)") > org.apache.spark.sql.AnalysisException: could not resolve `RANGE` to a > table-valued function; line 1 pos 14 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:126) > at > org.apache.spark.sql.catalyst.analysis.ResolveTableValuedFunctions$$anonfun$apply$1.applyOrElse(ResolveTableValuedFunctions.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:62) > ... > I believe it should be case insensitive? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954387#comment-15954387 ] Mridul Muralidharan edited comment on SPARK-20205 at 4/4/17 12:15 AM: -- For history server that will fail - good point. Atleast for custom listeners, users can workaround until next release by using current time (in there code when field submissionTime is None). Thanks for clarifying [~vanzin] ! was (Author: mridulm80): For history server that will fail - good point. Atleast for custom listeners, users can workaround until next release by using current time. Thanks for clarifying [~vanzin] ! > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954387#comment-15954387 ] Mridul Muralidharan commented on SPARK-20205: - For history server that will fail - good point. Atleast for custom listeners, users can workaround until next release by using current time. Thanks for clarifying [~vanzin] ! > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18893. - Resolution: Fixed Fix Version/s: 2.2.0 > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > Fix For: 2.2.0 > > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18893) Not support "alter table .. add columns .."
[ https://issues.apache.org/jira/browse/SPARK-18893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954375#comment-15954375 ] Wenchen Fan commented on SPARK-18893: - https://issues.apache.org/jira/browse/SPARK-19261 > Not support "alter table .. add columns .." > > > Key: SPARK-18893 > URL: https://issues.apache.org/jira/browse/SPARK-18893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: zuotingbing > > when we update spark from version 1.5.2 to 2.0.1, all cases we have need > change the table use "alter table add columns " failed, but it is said "All > Hive DDL Functions, including: alter table" in the official document : > http://spark.apache.org/docs/latest/sql-programming-guide.html. > Is there any plan to support sql "alter table .. add/replace columns" ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954357#comment-15954357 ] Marcelo Vanzin commented on SPARK-20205: bq. I was referring to the case where we are persisting to event log or consuming events to externally persist them. I see. In that case I believe it will always be unset. For live listeners, current time is a good enough approximation, but for the history server, for example, that's not an option (since {{SparkListenerStageSubmitted}} does not have a {{time}} field). > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954348#comment-15954348 ] Mridul Muralidharan commented on SPARK-20205: - bq. I wouldn't say incorrect; at worst it's gonna be slightly inaccurate. I was referring to the case where we are persisting to event log or consuming events to externally persist them. In this context, will we always have unspecified submissionTime or is there case where submissionTime is pointing to some incorrect/spurious value (if this is always in the codepath after makeNewStageAttempt; then it should be fine). Essentially, is the workaround for existing spark versions to simply set submissionTime to current time if it is None for SparkListenerStageSubmitted sufficient ? Will it miss some corner case ? (value is set but is incorrect ?) > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954340#comment-15954340 ] Marcelo Vanzin commented on SPARK-20205: bq. This is nasty ! This means submissionTime will always be unset ? Well, it's a little more complicated than that. The UI code currently "self heals", because it just keeps a pointer to the {{StageInfo}} object which is modified by the scheduler later. So eventually the UI sees the value. But the event log, for example, might not have the submission time. bq. Btw, is it possible for submissionTime to be set - but to an incorrect value ? I wouldn't say incorrect; at worst it's gonna be slightly inaccurate. > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
[ https://issues.apache.org/jira/browse/SPARK-20205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954333#comment-15954333 ] Mridul Muralidharan commented on SPARK-20205: - This is nasty ! This means submissionTime will always be unset ? Btw, is it possible for submissionTime to be set - but to an incorrect value ? > DAGScheduler posts SparkListenerStageSubmitted before updating stage > > > Key: SPARK-20205 > URL: https://issues.apache.org/jira/browse/SPARK-20205 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin > > Probably affects other versions, haven't checked. > The code that submits the event to the bus is around line 991: > {code} > stage.makeNewStageAttempt(partitionsToCompute.size, > taskIdToLocations.values.toSeq) > listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, > properties)) > {code} > Later in the same method, the stage information is updated (around line 1057): > {code} > if (tasks.size > 0) { > logInfo(s"Submitting ${tasks.size} missing tasks from $stage > (${stage.rdd}) (first 15 " + > s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") > taskScheduler.submitTasks(new TaskSet( > tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, > properties)) > stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) > {code} > That means an event handler might get a stage submitted event with an unset > submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints
[ https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954312#comment-15954312 ] Kamal Gurala commented on SPARK-4899: - Some performance related concerns https://github.com/apache/spark/pull/60#r16817226 > Support Mesos features: roles and checkpoints > - > > Key: SPARK-4899 > URL: https://issues.apache.org/jira/browse/SPARK-4899 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > Inspired by https://github.com/apache/spark/pull/60 > Mesos has two features that would be nice for Spark to take advantage of: > 1. Roles -- a way to specify ACLs and priorities for users > 2. Checkpoints -- a way to restart a failed Mesos slave without losing all > the work that was happening on the box > Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954233#comment-15954233 ] Steve Loughran edited comment on SPARK-20153 at 4/3/17 10:13 PM: - This is fixed in Hadoop 2.8 with [per-bucket configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]; HADOOP-13336. I would *really* advise against trying to re-implement this in spark as having one consistent model for configuring s3a bindings everywhere as there are a lot more options than just credentials; the S3 endpoint being a critical one when trying to work with V4 auth endpoints. As a temporary workaround, one which will leak your secrets to logs, know that you can go s3a://key:secret@bucket, URL encoding the secret, and so get access. Once you use this, consider all logs sensitive data. was (Author: ste...@apache.org): This is fixed in Hadoop 2.8 with [per-bucket configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]; HADOOP-13336. I would *really* advise against trying to re-implement this in spark as having one consistent model for configuring s3a bindings everywhere will the only way to debug what's going on, especially given that for security reasons you can't log what's going on. As a temporary workaround, one which will leak your secrets to logs, know that you can go s3a://key:secret@bucket, URL encoding the secret. > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20153) Support Multiple aws credentials in order to access multiple Hive on S3 table in spark application
[ https://issues.apache.org/jira/browse/SPARK-20153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954233#comment-15954233 ] Steve Loughran commented on SPARK-20153: This is fixed in Hadoop 2.8 with [per-bucket configuration|http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets]; HADOOP-13336. I would *really* advise against trying to re-implement this in spark as having one consistent model for configuring s3a bindings everywhere will the only way to debug what's going on, especially given that for security reasons you can't log what's going on. As a temporary workaround, one which will leak your secrets to logs, know that you can go s3a://key:secret@bucket, URL encoding the secret. > Support Multiple aws credentials in order to access multiple Hive on S3 table > in spark application > --- > > Key: SPARK-20153 > URL: https://issues.apache.org/jira/browse/SPARK-20153 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.1, 2.1.0 >Reporter: Franck Tago >Priority: Minor > > I need to access multiple hive tables in my spark application where each hive > table is > 1- an external table with data sitting on S3 > 2- each table is own by a different AWS user so I need to provide different > AWS credentials. > I am familiar with setting the aws credentials in the hadoop configuration > object but that does not really help me because I can only set one pair of > (fs.s3a.awsAccessKeyId , fs.s3a.awsSecretAccessKey ) > From my research , there is no easy or elegant way to do this in spark . > Why is that ? > How do I address this use case? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints
[ https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954212#comment-15954212 ] Charles Allen commented on SPARK-4899: -- It was discussed on the mailing list with [~timchen] that checkpointing might just need a timeout setting available to the other schedulers. > Support Mesos features: roles and checkpoints > - > > Key: SPARK-4899 > URL: https://issues.apache.org/jira/browse/SPARK-4899 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > Inspired by https://github.com/apache/spark/pull/60 > Mesos has two features that would be nice for Spark to take advantage of: > 1. Roles -- a way to specify ACLs and priorities for users > 2. Checkpoints -- a way to restart a failed Mesos slave without losing all > the work that was happening on the box > Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20064) Bump the PySpark verison number to 2.2
[ https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20064: Assignee: (was: Apache Spark) > Bump the PySpark verison number to 2.2 > -- > > Key: SPARK-20064 > URL: https://issues.apache.org/jira/browse/SPARK-20064 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Priority: Minor > Labels: starter > > The version.py should be updated for the new version. Note: this isn't > critical since for any releases made with make-distribution the version > number is read from the xml, but if anyone builds from source and manually > looks at the version # it would be good to have it match. This is a good > starter issue, but something we should do quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20064) Bump the PySpark verison number to 2.2
[ https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20064: Assignee: Apache Spark > Bump the PySpark verison number to 2.2 > -- > > Key: SPARK-20064 > URL: https://issues.apache.org/jira/browse/SPARK-20064 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > The version.py should be updated for the new version. Note: this isn't > critical since for any releases made with make-distribution the version > number is read from the xml, but if anyone builds from source and manually > looks at the version # it would be good to have it match. This is a good > starter issue, but something we should do quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20064) Bump the PySpark verison number to 2.2
[ https://issues.apache.org/jira/browse/SPARK-20064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954186#comment-15954186 ] Apache Spark commented on SPARK-20064: -- User 'setjet' has created a pull request for this issue: https://github.com/apache/spark/pull/17523 > Bump the PySpark verison number to 2.2 > -- > > Key: SPARK-20064 > URL: https://issues.apache.org/jira/browse/SPARK-20064 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: holdenk >Priority: Minor > Labels: starter > > The version.py should be updated for the new version. Note: this isn't > critical since for any releases made with make-distribution the version > number is read from the xml, but if anyone builds from source and manually > looks at the version # it would be good to have it match. This is a good > starter issue, but something we should do quickly. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4899) Support Mesos features: roles and checkpoints
[ https://issues.apache.org/jira/browse/SPARK-4899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954170#comment-15954170 ] Charles Allen commented on SPARK-4899: -- {{org.apache.spark.scheduler.cluster.mesos.MesosSchedulerUtils#createSchedulerDriver}} seems to allow checkpointing, which only {{org.apache.spark.scheduler.cluster.mesos.MesosClusterScheduler}} uses. Neither the fine grained nor coarse grained schedulers use it, is there a reason for that? > Support Mesos features: roles and checkpoints > - > > Key: SPARK-4899 > URL: https://issues.apache.org/jira/browse/SPARK-4899 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 1.2.0 >Reporter: Andrew Ash > > Inspired by https://github.com/apache/spark/pull/60 > Mesos has two features that would be nice for Spark to take advantage of: > 1. Roles -- a way to specify ACLs and priorities for users > 2. Checkpoints -- a way to restart a failed Mesos slave without losing all > the work that was happening on the box > Some of these may require a Mesos upgrade past our current 0.18.1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20205) DAGScheduler posts SparkListenerStageSubmitted before updating stage
Marcelo Vanzin created SPARK-20205: -- Summary: DAGScheduler posts SparkListenerStageSubmitted before updating stage Key: SPARK-20205 URL: https://issues.apache.org/jira/browse/SPARK-20205 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.0 Reporter: Marcelo Vanzin Probably affects other versions, haven't checked. The code that submits the event to the bus is around line 991: {code} stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq) listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) {code} Later in the same method, the stage information is updated (around line 1057): {code} if (tasks.size > 0) { logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " + s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") taskScheduler.submitTasks(new TaskSet( tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties)) stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) {code} That means an event handler might get a stage submitted event with an unset submission time. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954132#comment-15954132 ] Apache Spark commented on SPARK-18278: -- User 'foxish' has created a pull request for this issue: https://github.com/apache/spark/pull/17522 > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954093#comment-15954093 ] Kazuaki Ishizaki commented on SPARK-20176: -- Thanks. The code seem to work for the master. I am investigating which change fixes the issue. > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNull16 ? false : (Boolean) > value17.isNullAt(1); > /* 096 */ boolean isNull15 = false; > /* 097 */ double value15 = -1.0; > /* 098 */ if (!isNull16
[jira] [Comment Edited] (SPARK-20176) Spark Dataframe UDAF issue
[ https://issues.apache.org/jira/browse/SPARK-20176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954093#comment-15954093 ] Kazuaki Ishizaki edited comment on SPARK-20176 at 4/3/17 8:13 PM: -- Thanks. The code seem to work for the master. I am investigating which change fixed the issue. was (Author: kiszk): Thanks. The code seem to work for the master. I am investigating which change fixes the issue. > Spark Dataframe UDAF issue > -- > > Key: SPARK-20176 > URL: https://issues.apache.org/jira/browse/SPARK-20176 > Project: Spark > Issue Type: IT Help > Components: Spark Core >Affects Versions: 2.0.2 >Reporter: Dinesh Man Amatya > > Getting following error in custom UDAF > Error while decoding: java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 58, Column 33: Incompatible expression types "boolean" and "java.lang.Boolean" > /* 001 */ public java.lang.Object generate(Object[] references) { > /* 002 */ return new SpecificSafeProjection(references); > /* 003 */ } > /* 004 */ > /* 005 */ class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > /* 006 */ > /* 007 */ private Object[] references; > /* 008 */ private MutableRow mutableRow; > /* 009 */ private Object[] values; > /* 010 */ private Object[] values1; > /* 011 */ private org.apache.spark.sql.types.StructType schema; > /* 012 */ private org.apache.spark.sql.types.StructType schema1; > /* 013 */ > /* 014 */ > /* 015 */ public SpecificSafeProjection(Object[] references) { > /* 016 */ this.references = references; > /* 017 */ mutableRow = (MutableRow) references[references.length - 1]; > /* 018 */ > /* 019 */ > /* 020 */ this.schema = (org.apache.spark.sql.types.StructType) > references[0]; > /* 021 */ this.schema1 = (org.apache.spark.sql.types.StructType) > references[1]; > /* 022 */ } > /* 023 */ > /* 024 */ public java.lang.Object apply(java.lang.Object _i) { > /* 025 */ InternalRow i = (InternalRow) _i; > /* 026 */ > /* 027 */ values = new Object[2]; > /* 028 */ > /* 029 */ boolean isNull2 = i.isNullAt(0); > /* 030 */ UTF8String value2 = isNull2 ? null : (i.getUTF8String(0)); > /* 031 */ > /* 032 */ boolean isNull1 = isNull2; > /* 033 */ final java.lang.String value1 = isNull1 ? null : > (java.lang.String) value2.toString(); > /* 034 */ isNull1 = value1 == null; > /* 035 */ if (isNull1) { > /* 036 */ values[0] = null; > /* 037 */ } else { > /* 038 */ values[0] = value1; > /* 039 */ } > /* 040 */ > /* 041 */ boolean isNull5 = i.isNullAt(1); > /* 042 */ InternalRow value5 = isNull5 ? null : (i.getStruct(1, 2)); > /* 043 */ boolean isNull3 = false; > /* 044 */ org.apache.spark.sql.Row value3 = null; > /* 045 */ if (!false && isNull5) { > /* 046 */ > /* 047 */ final org.apache.spark.sql.Row value6 = null; > /* 048 */ isNull3 = true; > /* 049 */ value3 = value6; > /* 050 */ } else { > /* 051 */ > /* 052 */ values1 = new Object[2]; > /* 053 */ > /* 054 */ boolean isNull10 = i.isNullAt(1); > /* 055 */ InternalRow value10 = isNull10 ? null : (i.getStruct(1, 2)); > /* 056 */ > /* 057 */ boolean isNull9 = isNull10 || false; > /* 058 */ final boolean value9 = isNull9 ? false : (Boolean) > value10.isNullAt(0); > /* 059 */ boolean isNull8 = false; > /* 060 */ double value8 = -1.0; > /* 061 */ if (!isNull9 && value9) { > /* 062 */ > /* 063 */ final double value12 = -1.0; > /* 064 */ isNull8 = true; > /* 065 */ value8 = value12; > /* 066 */ } else { > /* 067 */ > /* 068 */ boolean isNull14 = i.isNullAt(1); > /* 069 */ InternalRow value14 = isNull14 ? null : (i.getStruct(1, 2)); > /* 070 */ boolean isNull13 = isNull14; > /* 071 */ double value13 = -1.0; > /* 072 */ > /* 073 */ if (!isNull14) { > /* 074 */ > /* 075 */ if (value14.isNullAt(0)) { > /* 076 */ isNull13 = true; > /* 077 */ } else { > /* 078 */ value13 = value14.getDouble(0); > /* 079 */ } > /* 080 */ > /* 081 */ } > /* 082 */ isNull8 = isNull13; > /* 083 */ value8 = value13; > /* 084 */ } > /* 085 */ if (isNull8) { > /* 086 */ values1[0] = null; > /* 087 */ } else { > /* 088 */ values1[0] = value8; > /* 089 */ } > /* 090 */ > /* 091 */ boolean isNull17 = i.isNullAt(1); > /* 092 */ InternalRow value17 = isNull17 ? null : (i.getStruct(1, 2)); > /* 093 */ > /* 094 */ boolean isNull16 = isNull17 || false; > /* 095 */ final boolean value16 = isNul
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953968#comment-15953968 ] Wenchen Fan commented on SPARK-19659: - What's the smallest unit of fetching remote shuffle blocks? If the unit is block, I think it's really hard to avoid OOM entirely, as if the estimated block size is wrong, fetching this block may cause OOM and we can do nothing about it. (I guess that's why you add {{spark.reducer.maxBytesShuffleToMemory}} in your PR.) If the unit can be smaller like a byte buffer, and we can fully track and control the shuffle fetch memory usage, I think then we can solve the OOM problem pretty good without introducing new config to users. Is it possible to do it with some advanced netty API? > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
[ https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953930#comment-15953930 ] Apache Spark commented on SPARK-20204: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17521 > separate SQLConf into catalyst confs and sql confs > -- > > Key: SPARK-20204 > URL: https://issues.apache.org/jira/browse/SPARK-20204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
[ https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20204: Assignee: Apache Spark (was: Wenchen Fan) > separate SQLConf into catalyst confs and sql confs > -- > > Key: SPARK-20204 > URL: https://issues.apache.org/jira/browse/SPARK-20204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
[ https://issues.apache.org/jira/browse/SPARK-20204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-20204: Assignee: Wenchen Fan (was: Apache Spark) > separate SQLConf into catalyst confs and sql confs > -- > > Key: SPARK-20204 > URL: https://issues.apache.org/jira/browse/SPARK-20204 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20204) separate SQLConf into catalyst confs and sql confs
Wenchen Fan created SPARK-20204: --- Summary: separate SQLConf into catalyst confs and sql confs Key: SPARK-20204 URL: https://issues.apache.org/jira/browse/SPARK-20204 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19979) [MLLIB] Multiple Estimators/Pipelines In CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-19979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953820#comment-15953820 ] Bryan Cutler commented on SPARK-19979: -- >From the discussion in the PR {noformat} val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) val dt = new DecisionTreeClassifier() .setMaxDepth(5) val pipeline = new Pipeline() val pipeline1: Array[PipelineStage] = Array(tokenizer, hashingTF, lr) val pipeline2: Array[PipelineStage] = Array(tokenizer, hashingTF, dt) val pipeline1_grid = new ParamGridBuilder() .baseOn(pipeline.stages -> pipeline1) .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) .build() val pipeline2_grid = new ParamGridBuilder() .baseOn(pipeline.stages -> pipeline2) .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .build() val paramGrid = pipeline1_grid ++ pipeline2_grid val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(2) // Use 3+ in practice {noformat} [~josephkb] [~mlnick] would this be good to add to the documentation? > [MLLIB] Multiple Estimators/Pipelines In CrossValidator > --- > > Key: SPARK-19979 > URL: https://issues.apache.org/jira/browse/SPARK-19979 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: David Leifker > > Update CrossValidator and TrainValidationSplit to be able to accept multiple > pipelines and grid parameters for testing different algorithms and/or being > able to better control tuning combinations. Maintains backwards compatible > API and reads legacy serialized objects. > The same could be done using an external iterative approach. Build different > pipelines, throwing each into a CrossValidator, and then taking the best > model from each of those CrossValidators. Then finally picking the best from > those. This is the initial approach I explored. It resulted in a lot of > boiler plate code that felt like it shouldn't need to exist if the api simply > allowed for arrays of estimators and their parameters. > A couple advantages to this implementation to consider come from keeping the > functional interface to the CrossValidator. > 1. The caching of the folds is better utilized. An external iterative > approach creates a new set of k folds for each CrossValidator fit and the > folds are discarded after each CrossValidator run. In this implementation a > single set of k folds is created and cached for all of the pipelines. > 2. A potential advantage of using this implementation is for future > parallelization of the pipelines within the CrossValdiator. It is of course > possible to handle the parallelization outside of the CrossValidator here > too, however I believe there is already work in progress to parallelize the > grid parameters and that could be extended to multiple pipelines. > Both of those behind-the-scene optimizations are possible because of > providing the CrossValidator with the data and the complete set of > pipelines/estimators to evaluate up front allowing one to abstract away the > implementation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953793#comment-15953793 ] Apache Spark commented on SPARK-19712: -- User 'nsyca' has created a pull request for this issue: https://github.com/apache/spark/pull/17520 > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19712: Assignee: Apache Spark > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong >Assignee: Apache Spark > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan
[ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19712: Assignee: (was: Apache Spark) > EXISTS and Left Semi join do not produce the same plan > -- > > Key: SPARK-19712 > URL: https://issues.apache.org/jira/browse/SPARK-19712 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nattavut Sutyanyong > > This problem was found during the development of SPARK-18874. > The EXISTS form in the following query: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 > from t3 where t1.t1b=t3.t3b)")}} > gives the optimized plan below: > {code} > == Optimized Logical Plan == > Join Inner, (t1a#7 = t2a#25) > :- Join LeftSemi, (t1b#8 = t3b#58) > : :- Filter isnotnull(t1a#7) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Project [1 AS 1#271, t3b#58] > : +- Relation[t3a#57,t3b#58,t3c#59] parquet > +- Filter isnotnull(t2a#25) >+- Relation[t2a#25,t2b#26,t2c#27] parquet > {code} > whereas a semantically equivalent Left Semi join query below: > {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on > t1.t1b=t3.t3b")}} > gives the following optimized plan: > {code} > == Optimized Logical Plan == > Join LeftSemi, (t1b#8 = t3b#58) > :- Join Inner, (t1a#7 = t2a#25) > : :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7)) > : : +- Relation[t1a#7,t1b#8,t1c#9] parquet > : +- Filter isnotnull(t2a#25) > : +- Relation[t2a#25,t2b#26,t2c#27] parquet > +- Project [t3b#58] >+- Relation[t3a#57,t3b#58,t3c#59] parquet > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953728#comment-15953728 ] DB Tsai commented on SPARK-20047: - I changed the target to 2.3.0 Thanks. > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-20047: Affects Version/s: (was: 2.1.0) 2.2.0 Target Version/s: 2.3.0 (was: 2.2.0) > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20193) Selecting empty struct causes ExpressionEncoder error.
[ https://issues.apache.org/jira/browse/SPARK-20193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953704#comment-15953704 ] Adrian Ionescu commented on SPARK-20193: cc [~hvanhovell] > Selecting empty struct causes ExpressionEncoder error. > -- > > Key: SPARK-20193 > URL: https://issues.apache.org/jira/browse/SPARK-20193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu > Labels: struct > > {{def struct(cols: Column*): Column}} > Given the above signature and the lack of any note in the docs saying that a > struct with no columns is not supported, I would expect the following to work: > {{spark.range(3).select(col("id"), struct().as("empty_struct")).collect}} > However, this results in: > {quote} > java.lang.AssertionError: assertion failed: each serializer expression should > contains at least one `BoundReference` > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:240) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$11.apply(ExpressionEncoder.scala:238) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.(ExpressionEncoder.scala:238) > at > org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:63) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2837) > at org.apache.spark.sql.Dataset.select(Dataset.scala:1131) > ... 39 elided > {quote} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20194) Support partition pruning for InMemoryCatalog
[ https://issues.apache.org/jira/browse/SPARK-20194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-20194. - Resolution: Fixed Assignee: Adrian Ionescu Fix Version/s: 2.2.0 > Support partition pruning for InMemoryCatalog > - > > Key: SPARK-20194 > URL: https://issues.apache.org/jira/browse/SPARK-20194 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.1.0 >Reporter: Adrian Ionescu >Assignee: Adrian Ionescu > Fix For: 2.2.0 > > > {{listPartitionsByFilter()}} is not yet implemented for {{InMemoryCatalog}}: > {quote} > // TODO: Provide an implementation > throw new UnsupportedOperationException( > "listPartitionsByFilter is not implemented for InMemoryCatalog") > {quote} > Because of this, there is a hack in {{FindDataSourceTable}} that avoids > passing along the {{CatalogTable}} to the {{DataSource}} it creates when the > catalog implementation is not "hive", so that, when the latter is resolved, > an {{InMemoryFileIndex}} is created instead of a {{CatalogFileIndex}} which > the {{PruneFileSourcePartitions}} rule matches for. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arush Kharbanda updated SPARK-20199: Comment: was deleted (was: I will work on this issue.) > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar >Priority: Minor > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore
[ https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953647#comment-15953647 ] Jonathan Maron commented on SPARK-11783: I am running a spark job and, when instantiating a HiveContext, I see that the client creates a local derby-based metastore. Is this the intent for client processes? I don't understand the necessity for a client process to create a metastore instance rather than leverage the remote metastore server. > When deployed against remote Hive metastore, HiveContext.executionHive points > to wrong metastore > > > Key: SPARK-11783 > URL: https://issues.apache.org/jira/browse/SPARK-11783 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.6.0 > > > When using remote metastore, execution Hive client somehow is initialized to > point to the actual remote metastore instead of the dummy local Derby > metastore. > To reproduce this issue: > # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 > metastore. > # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}. > # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}} > # Start Thrift server with remote debugging options > # Attach the debugger to the Thrift server driver process, we can verify that > {{executionHive}} points to the remote metastore rather than the local > execution Derby metastore. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9272) Persist information of individual partitions when persisting partitioned data source tables to metastore
[ https://issues.apache.org/jira/browse/SPARK-9272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953582#comment-15953582 ] Daniel Tomes commented on SPARK-9272: - BUMP This is an important issue. Let's get this resolved. > Persist information of individual partitions when persisting partitioned data > source tables to metastore > > > Key: SPARK-9272 > URL: https://issues.apache.org/jira/browse/SPARK-9272 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.0 >Reporter: Cheng Lian > > Currently, when a partitioned data source table is persisted to Hive > metastore, we only persist its partition columns. Information about > individual partitions are not persisted. This forces us to do a partition > discovery before reading a persisted partitioned table, which hurts > performance. > To fix this issue, we may persist partition information into metastore. > Specifically, the format should be compatible with Hive to ensure > interoperability. > One of the approach to collect partition values and partition directory path > for dynamicly partitioned tables is to use accumulators to collect expected > information during the write job. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20047) Constrained Logistic Regression
[ https://issues.apache.org/jira/browse/SPARK-20047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953551#comment-15953551 ] Nick Pentreath commented on SPARK-20047: Is this really targeted for 2.2.0? > Constrained Logistic Regression > --- > > Key: SPARK-20047 > URL: https://issues.apache.org/jira/browse/SPARK-20047 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.1.0 >Reporter: DB Tsai >Assignee: Yanbo Liang > > For certain applications, such as stacked regressions, it is important to put > non-negative constraints on the regression coefficients. Also, if the ranges > of coefficients are known, it makes sense to constrain the coefficient search > space. > Fitting generalized constrained regression models object to Cβ ≤ b, where C ∈ > R^\{m×p\} and b ∈ R^\{m\} are predefined matrices and vectors which places a > set of m linear constraints on the coefficients is very challenging as > discussed in many literatures. > However, for box constraints on the coefficients, the optimization is well > solved. For gradient descent, people can projected gradient descent in the > primal by zeroing the negative weights at each step. For LBFGS, an extended > version of it, LBFGS-B can handle large scale box optimization efficiently. > Unfortunately, for OWLQN, there is no good efficient way to do optimization > with box constrains. > As a result, in this work, we only implement constrained LR with box > constrains without L1 regularization. > Note that since we standardize the data in training phase, so the > coefficients seen in the optimization subroutine are in the scaled space; as > a result, we need to convert the box constrains into scaled space. > Users will be able to set the lower / upper bounds of each coefficients and > intercepts. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953489#comment-15953489 ] Sean Owen commented on SPARK-20202: --- Alrighty, you can leave the status for now, but generally committers set Blocker. I'm not entirely clear this blocks a release, not yet. You're absolutely right, but, the hive fork with binaries and source is part of this project. At least, that's the idea. For example, this is notionally voted on and released with each Spark release, but the binary/source of this fork project isn't separately, explicitly, voted on and separately released. I think that should occur for avoidance of doubt, that this is a blessed artifact of the Spark project. Would this answer your process and policy concerns about the release? It's not pretty but I think that's within the law. Of course, it's no answer in the long term. The goal is to not have to use the fork at all. If Hive packaging changes are already in place to make it unnecessary, great (is that all there is to it, everyone?) I don't know if that presents a solution for earlier versions of Hive. This fork thing may persist in existing branches, but it has to at least be released and used in a proper way. This may need fixes right now. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Blocker > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated SPARK-20202: -- Priority: Blocker (was: Critical) It is against Apache policy to release binaries that aren't part of your project. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Blocker > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953386#comment-15953386 ] Hyukjin Kwon commented on SPARK-19809: -- Shoudn't it contain footer and schema information or a magic number at least? I am not sure if we can say 0 byte file is an ORC file. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1456) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAcc
[jira] [Commented] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953341#comment-15953341 ] Michał Dawid commented on SPARK-19809: -- Those empty files have been created while processing with Pig scripts. {code}-rw-rw-rw- 3 etl hdfs 14103 2017-04-03 01:26 part-v001-o000-r-0_a_2 -rw-rw-rw- 3 etl hdfs 0 2017-04-03 01:26 part-v001-o000-r-0_a_3 -rw-rw-rw- 3 etl hdfs 10125 2017-04-03 01:27 part-v001-o000-r-0_a_4 {code} > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953328#comment-15953328 ] Cyril de Vogelaere commented on SPARK-20203: Oh, I thought we were talking about the performance implication of adding an if which would be tested often. For the issue you just pointed, I will agree it would be a major negative consequence of that change. Sorry, I didn't understand that it was what you were talking about. Well, then I suppose we should resolve this thread with a "won't fix". Except if you think the potential user friendlyness can balance that major default. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953319#comment-15953319 ] Sean Owen commented on SPARK-20203: --- How can this not have performance implications? you generate more frequent patterns, potentially a lot more. You can see this even in the comments and error messages about collecting too many elements to the driver. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953318#comment-15953318 ] Cyril de Vogelaere commented on SPARK-20203: I'm not splitting it, I deleted the other thread. I did agree adding the zero special value might have a tiny negative effect on performance, without adding new functionnalities. So I closed it, following that line of thought. This post, is just about changing the default value. Which, you agreed, can be discussed. That's a new context of discussion, so I created a new thread. This should make more sense no ? > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953315#comment-15953315 ] Owen O'Malley commented on SPARK-20202: --- I should also say here that the Hive community is willing to help. We are in the process of rolling releases so if Spark needs a change, we can work together to get this done. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953304#comment-15953304 ] Sean Owen commented on SPARK-20202: --- Agree. I think the logic was that Spark had released its own source/binary version of Hive, and then used that in Spark. I don't think anybody believes that's a good solution in the long term; it was a work-around for hive-exec's packaging IIRC. Once whatever that is is resolved this can go away, but I defer to those who know the issue better on the details. What I'm not clear on is whether the current org.spark-hive situation is streeetching the source/binary policy so far that it breaks, enough that no more releases can happen without it. Best to make it go away ASAP anyway. But I don't know if changes in Hive 2.5 help integration with Hive 1.x. It may require either temporarily blessing the fork, or more jar surgery to un-uberize the hive-exec jar or something. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953299#comment-15953299 ] Cyril de Vogelaere edited comment on SPARK-20203 at 4/3/17 11:18 AM: - This cannot have performance implication, we are not changing anything but the default value. It does change the number of solution we are searching for. So of course it will take longer since the search space is bigger. But on a dataset where it already found everything, it should still do so. And not be slower at all. Now, it would just find everything by default. Which, I agree, should be debated. To know whether that's really what we want the default behavior of the program to be. was (Author: syrux): This cannot have performance implication, we are not changing anything but the default value. It does change the number of solution we are searching for. So of course it will take longer since the search space is bigger. But on a dataset where it already found everything, it should still do so. Now, it would just find everything by default. Which, I agree, should be debated. To know whether that's really what we want the default behavior of the program to be. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953300#comment-15953300 ] Sean Owen commented on SPARK-20203: --- There's no value in splitting the conversation since it's about exactly the same question. This retained no context about the performance question, for example, which is the central issue. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953299#comment-15953299 ] Cyril de Vogelaere commented on SPARK-20203: This cannot have performance implication, we are not changing anything but the default value. It does change the number of solution we are searching for. So of course it will take longer since the search space is bigger. But on a dataset where it already found everything, it should still do so. Now, it would just find everything by default. Which, I agree, should be debated. To know whether that's really what we want the default behavior of the program to be. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953298#comment-15953298 ] Owen O'Malley commented on SPARK-20202: --- As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953298#comment-15953298 ] Owen O'Malley edited comment on SPARK-20202 at 4/3/17 11:16 AM: As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it needs to formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. was (Author: owen.omalley): As an Apache member, the Spark project can't release binary artifacts that aren't made from its Apache code base. So either, the Spark project needs to use Hive's release artifacts or it formally fork Hive and move the fork into its git repository at Apache and rename it away from org.apache.hive to org.apache.spark. The current path is not allowed. Hive is in the middle of rolling releases and thus this is a good time to make requests. The old uber jar (hive-exec) is already released separately with the classifier "core." It looks like we are using the same protobuf (2.5.0) and kryo (3.0.3) versions. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953297#comment-15953297 ] Cyril de Vogelaere commented on SPARK-20203: SPARK-20180 was about adding a special value (0) to find all pattern no matter their length, and put it as default value. You pointed it might lower the performances, without adding more functionalities. So I closed that thread. This one is just about changing the default value, no other changes in the code. You said it needed discussion, since it was a change in default behavior. But the amount of comment on the last thread would discourage discussion, I felt like a new thread would be more appropriate. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere closed SPARK-20180. -- Resolution: Won't Fix > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953289#comment-15953289 ] Sean Owen commented on SPARK-20203: --- This is again not addressing the point, that doing so has performance implications. Or could. That has to be established. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20203: --- Description: I think changing the default value to Int.MaxValue would be more user friendly. At least for new users. Personally, when I run an algorithm, I expect it to find all solution by default. And a limited number of them, when I set the parameters to do so. The current implementation limit the length of solution patterns to 10. Thus preventing all solution to be printed when running slightly large datasets. I feel like that should be changed, but since this would change the default behavior of PrefixSpan. I think asking for the communities opinion should come first. So, what do you think ? was: I think changing the default value to Int.MaxValue would be more user friendly. At least for new user. Personally, when I run an algorithm, I expect it to find all solution by default. And a limited number of them, when I set the parameters so. The current implementation limit the length of solution patterns to 10. Thus preventing all solution to be printed when running slightly large datasets. > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would be more user > friendly. At least for new users. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters to do so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. > I feel like that should be changed, but since this would change the default > behavior of PrefixSpan. I think asking for the communities opinion should > come first. So, what do you think ? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953282#comment-15953282 ] Cyril de Vogelaere commented on SPARK-20180: Fine, I thought a TODO left in the code would reflect the wish of the community, at least a little. I will close this thread and open a new one on changing the default value to maxInteger, since I personnally think it would be more friendly to new users. Link to new thread : https://issues.apache.org/jira/browse/SPARK-20203 Tomorrow, I will create a new thread with another improvement I want to add to spark. I need to run a performance test on just that change first, to prove it will be usefull. I hope you will follow it too. > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953280#comment-15953280 ] Sean Owen commented on SPARK-20203: --- I don't understand, this is the same as SPARK-20180? > Change default maxPatternLength value to Int.MaxValue in PrefixSpan > --- > > Key: SPARK-20203 > URL: https://issues.apache.org/jira/browse/SPARK-20203 > Project: Spark > Issue Type: Wish > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Trivial > Original Estimate: 0h > Remaining Estimate: 0h > > I think changing the default value to Int.MaxValue would > be more user friendly. At least for new user. > Personally, when I run an algorithm, I expect it to find all solution by > default. And a limited number of them, when I set the parameters so. > The current implementation limit the length of solution patterns to 10. > Thus preventing all solution to be printed when running slightly large > datasets. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20203) Change default maxPatternLength value to Int.MaxValue in PrefixSpan
Cyril de Vogelaere created SPARK-20203: -- Summary: Change default maxPatternLength value to Int.MaxValue in PrefixSpan Key: SPARK-20203 URL: https://issues.apache.org/jira/browse/SPARK-20203 Project: Spark Issue Type: Wish Components: MLlib Affects Versions: 2.1.0 Reporter: Cyril de Vogelaere Priority: Trivial I think changing the default value to Int.MaxValue would be more user friendly. At least for new user. Personally, when I run an algorithm, I expect it to find all solution by default. And a limited number of them, when I set the parameters so. The current implementation limit the length of solution patterns to 10. Thus preventing all solution to be printed when running slightly large datasets. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20202) Remove references to org.spark-project.hive
[ https://issues.apache.org/jira/browse/SPARK-20202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-20202: -- Priority: Critical (was: Blocker) Fix Version/s: (was: 2.1.1) (was: 1.6.4) (was: 2.0.3) I see wide agreement on that. One question I have is, is including Hive this way merely a really-not-nice-to-have or actually not allowed? I think the question is whether sources are available, right? because releases can't have binary-only parts. I plead ignorance, I have never myself paid much attention to this integration. If it's not then this sounds like something has to change for releases beyond 2.1.1 and this can be targeted as a Blocker accordingly. Does this depend on refactoring or changes in Hive? IIRC the problem was hive-exec being an uber-jar, but it's been a long time since I read any of that discussion. > Remove references to org.spark-project.hive > --- > > Key: SPARK-20202 > URL: https://issues.apache.org/jira/browse/SPARK-20202 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 1.6.4, 2.0.3, 2.1.1 >Reporter: Owen O'Malley >Priority: Critical > > Spark can't continue to depend on their fork of Hive and must move to > standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953240#comment-15953240 ] Sean Owen commented on SPARK-20180: --- Surely, the impact is more than an 'if' statement. If you contemplate much larger spans that's going to take longer to compute and return right? I think we're not at all in agreement there, especially as you're seeing the test (?) run forever. Yes I know there's a TODO (BTW you can see who wrote it with 'blame') but that doesn't mean I agree with it. It also doesn't say it should be a default. Keep in mind how much time it takes to discuss these changes relative to the value. We need to converge rapidly to decisions. The question here is performance impact on non-trivial examples. So far I just don't see much compelling reason to change a default. The functionality you want is already available. > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20202) Remove references to org.spark-project.hive
Owen O'Malley created SPARK-20202: - Summary: Remove references to org.spark-project.hive Key: SPARK-20202 URL: https://issues.apache.org/jira/browse/SPARK-20202 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 1.6.4, 2.0.3, 2.1.1 Reporter: Owen O'Malley Priority: Blocker Fix For: 1.6.4, 2.0.3, 2.1.1 Spark can't continue to depend on their fork of Hive and must move to standard Hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19752) OrcGetSplits fails with 0 size files
[ https://issues.apache.org/jira/browse/SPARK-19752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19752. -- Resolution: Duplicate It sounds a duplicate of SPARK-19809. Please reopen that if I misunderstood. > OrcGetSplits fails with 0 size files > > > Key: SPARK-19752 > URL: https://issues.apache.org/jira/browse/SPARK-19752 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.1.0 >Reporter: Nick Orka > > There is a possibility that during some sql queries a partition may have a 0 > size file (empty file). Next time when I try to read from the file by sql > query, I'm getting this error: > 17/02/27 10:33:11 INFO PerfLogger: start=1488191591570 end=1488191591599 duration=29 > from=org.apache.hadoop.hive.ql.io.orc.ReaderImpl> > 17/02/27 10:33:11 ERROR ApplicationMaster: User class threw exception: > java.lang.reflect.InvocationTargetException > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror1.jinvokeraw(JavaMirrors.scala:373) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaMethodMirror.jinvoke(JavaMirrors.scala:339) > at > scala.reflect.runtime.JavaMirrors$JavaMirror$JavaVanillaMethodMirror.apply(JavaMirrors.scala:355) > at com.sessionm.Datapipeline$.main(Datapipeline.scala:200) > at com.sessionm.Datapipeline.main(Datapipeline.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:627) > Caused by: java.lang.RuntimeException: serious problem > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1021) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) > at scala.Option.getOrElse(Option.scala:121) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:84) > at > scala.collection.parallel.AugmentedIterableIterator$class.map2combiner(RemainsIterator.scala:115) > at > scala.collection.parallel.immutable.ParVector$ParVectorIterator.map2combiner(ParVector.scala:62) > at > scala.collection.parallel.ParIterableLike$Map.leaf(ParIterableLike.scala:1054) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) > at > scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) > at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) > at > scala.collection.parallel.ParIterableLike$Map.tryLeaf(ParIterableLike.scala:1051) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.internal(Tasks.scala:169) > at > scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.internal(Tasks.scala:443) > at > scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:149) > at > scala.collection.parallel.AdaptiveWorkStealingForkJ
[jira] [Resolved] (SPARK-19809) NullPointerException on empty ORC file
[ https://issues.apache.org/jira/browse/SPARK-19809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-19809. -- Resolution: Invalid I don't think there is 0 byte ORC file. It should have the footer. Moreover, currently, Spark's ORC datasource does not write out empty files (see https://issues.apache.org/jira/browse/SPARK-15474). Please reopen this if I misunderstood. It would be great if there is some steps to reproduce maybe to verify this issue. I am resolving this. > NullPointerException on empty ORC file > -- > > Key: SPARK-19809 > URL: https://issues.apache.org/jira/browse/SPARK-19809 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.3, 2.0.2 >Reporter: Michał Dawid > > When reading from hive ORC table if there are some 0 byte files we get > NullPointerException: > {code}java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$BISplitStrategy.getSplits(OrcInputFormat.java:560) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1010) > at > org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1048) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at scala.collection.immutable.List.foreach(List.scala:318) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:242) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:240) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:240) > at > org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) > at > org.apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:174) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.DataFrame$$anonfun$org$apache$spark$sql$DataFrame$$execute$1$1.apply(DataFrame.scala:1499) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$execute$1(DataFrame.scala:1498) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$collect(DataFrame.scala:1505) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1375) > at > org.apache.spark.sql.DataFrame$$anonfun$head$1.apply(DataFrame.scala:1374) > at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2099) > at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1374) >
[jira] [Comment Edited] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953201#comment-15953201 ] Cyril de Vogelaere edited comment on SPARK-20180 at 4/3/17 9:57 AM: => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. Since this thread would have established that it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. Also, they are ending, but they are really really slow. I don't think we can proceed with this in this state, right? => I will leave the decision to you was (Author: syrux): => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. Also, they are ending, but they are really really slow. I don't think we can proceed with this in this state, right? => I will leave the decision to you > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatt
[jira] [Assigned] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19641: --- Assignee: Hyukjin Kwon > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953201#comment-15953201 ] Cyril de Vogelaere edited comment on SPARK-20180 at 4/3/17 9:45 AM: => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. Also, they are ending, but they are really really slow. I don't think we can proceed with this in this state, right? => I will leave the decision to you was (Author: syrux): => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. I don't think we can proceed with this in this state, right? => I will leave the decision to you > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all patt
[jira] [Resolved] (SPARK-19641) JSON schema inference in DROPMALFORMED mode produces incorrect schema
[ https://issues.apache.org/jira/browse/SPARK-19641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19641. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17492 [https://github.com/apache/spark/pull/17492] > JSON schema inference in DROPMALFORMED mode produces incorrect schema > - > > Key: SPARK-19641 > URL: https://issues.apache.org/jira/browse/SPARK-19641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Nathan Howell > Fix For: 2.2.0 > > > In {{DROPMALFORMED}} mode the inferred schema may incorrectly contain no > columns. This occurs when one document contains a valid JSON value (such as a > string or number) and the other documents contain objects or arrays. > When the default case in {{JsonInferSchema.compatibleRootType}} is reached > when merging a {{StringType}} and a {{StructType}} the resulting type will be > a {{StringType}}, which is then discarded because a {{StructType}} is > expected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953201#comment-15953201 ] Cyril de Vogelaere commented on SPARK-20180: => Why not let the default be Int.MaxValue? I'm also ok with a default Int.MaxValue, if the special value zero is really something you are against. if that's what this is about, update the title to reflect it. => I will gladly do that, if you think the current title is misleading. This is a behavior change by default, so we should think carefully about it => Yes, I agree. What are the downsides – why would someone have ever made it 10? presumably, performance. => The changed code consist simply in an additionnal condition in an if. If you want to see a graph, I have one that test the differences in performances, but on my implementation optimised for single-item pattern. So it wouldn't be relevant, if you are worried of performance drop, I can do additional tests, on the two lines I changed. If you want me to use some particular dataset, I will also gladly oblige. Just say the word, and you will have them by tommorow. So it would be less about what impact it has on the performance, since it would be negligeable (again, i'm ready to prove that if you want me to), but about whether that feature seems needed or not. Which I agree, is debatable. Also, whichever senior implemented it that way, left this comment : // TODO: support unbounded pattern length when maxPatternLength = 0 Which you can find in the original code, and is the reason I created this Jira's thread first. Among the list of improvement I want to propose. You can find that line in the PrefixSpan code if you don't believe me.If theses change are rejected, then when I have the occasion, I will remove that line. So it would establish it isn't needed. You mention tests don't end and haven't established it's not due to your change. => I'm establishing that right now ... as I said. I don't think we can proceed with this in this state, right? => I will leave the decision to you > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19969: -- Assignee: yuhao yang > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Assignee: yuhao yang > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19969) Doc and examples for Imputer
[ https://issues.apache.org/jira/browse/SPARK-19969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19969. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17324 [https://github.com/apache/spark/pull/17324] > Doc and examples for Imputer > > > Key: SPARK-19969 > URL: https://issues.apache.org/jira/browse/SPARK-19969 > Project: Spark > Issue Type: Documentation > Components: ML >Affects Versions: 2.2.0 >Reporter: Nick Pentreath > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API
[ https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953199#comment-15953199 ] Hyukjin Kwon commented on SPARK-20090: -- [~josephkb], gentle ping. > Add StructType.fieldNames to Python API > --- > > Key: SPARK-20090 > URL: https://issues.apache.org/jira/browse/SPARK-20090 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > The Scala/Java API for {{StructType}} has a method {{fieldNames}}. It would > be nice if the Python {{StructType}} did as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20108) Spark query is getting failed with exception
[ https://issues.apache.org/jira/browse/SPARK-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953196#comment-15953196 ] Hyukjin Kwon commented on SPARK-20108: -- It will help other guys like me to track down the problem and solve this. > Spark query is getting failed with exception > > > Key: SPARK-20108 > URL: https://issues.apache.org/jira/browse/SPARK-20108 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: ZS EDGE > > In our project we have implemented a logic where we programatically generate > spark queries. These queries are executed as a sub query and below is the > sample query-- > sqlContext.sql("INSERT INTO TABLE > test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT > 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22 > > 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','','' > FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN > (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))") > While executing the above code in the pyspark it is throwing below exception-- > FAILS>> > .spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > [Stage 32:=> (10 + 5) / > 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 > failed 4 times; aborting job > 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.s
[jira] [Commented] (SPARK-20108) Spark query is getting failed with exception
[ https://issues.apache.org/jira/browse/SPARK-20108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953195#comment-15953195 ] Hyukjin Kwon commented on SPARK-20108: -- It seems almost impossible to reproduce to me. Do you mind if I ask a self-reproducer? > Spark query is getting failed with exception > > > Key: SPARK-20108 > URL: https://issues.apache.org/jira/browse/SPARK-20108 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 >Reporter: ZS EDGE > > In our project we have implemented a logic where we programatically generate > spark queries. These queries are executed as a sub query and below is the > sample query-- > sqlContext.sql("INSERT INTO TABLE > test_client_r2_r2_2_prod_db1_oz.S3_EMPDTL_Incremental_invalid SELECT > 'S3_EMPDTL_Incremental',S3_EMPDTL_Incremental.row_id,S3_EMPDTL_Incremental.SOURCE_FILE_NAME,S3_EMPDTL_Incremental.SOURCE_ROW_ID,'S3_EMPDTL_Incremental','2017-03-22 > > 20:18:59','1','Emp_id#$Emp_name#$Emp_phone#$Emp_salary_in_K#$Emp_address_id#$Date_of_Birth#$Status#$Dept_id#$Date_of_joining#$Row_Number#$Dec_check#$','test','Y','N/A','','' > FROM S3_EMPDTL_Incremental_r AS S3_EMPDTL_Incremental where row_id IN > (select row_id from s3_empdtl_incremental_r where row_id IN(42949672960))") > While executing the above code in the pyspark it is throwing below exception-- > FAILS>> > .spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.spark.sql.execution.joins.LongToUnsafeRowMap.getValue(HashedRelation.scala:463) > at > org.apache.spark.sql.execution.joins.LongHashedRelation.getValue(HashedRelation.scala:762) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply$mcV$sp(WriterContainer.scala:253) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer$$anonfun$writeRows$1.apply(WriterContainer.scala:252) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:258) > ... 8 more > [Stage 32:=> (10 + 5) / > 26]17/03/22 15:42:10 ERROR TaskSetManager: Task 4 in stage 32.0 > failed 4 times; aborting job > 17/03/22 15:42:10 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 4 in > stage 32.0 failed 4 times, most recent failure: Lost task 4.3 in > stage 32.0 (TID 857, ip-10-116-1-73.ec2.internal): > org.apache.spark.SparkException: Task failed while writing rows > at > org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:261) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply > (InsertIntoHadoopFsRelationCommand.scala:143) > a
[jira] [Updated] (SPARK-20180) Add a special value for unlimited max pattern length in Prefix span, and set it as default.
[ https://issues.apache.org/jira/browse/SPARK-20180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cyril de Vogelaere updated SPARK-20180: --- Summary: Add a special value for unlimited max pattern length in Prefix span, and set it as default. (was: Unlimited max pattern length in Prefix span) > Add a special value for unlimited max pattern length in Prefix span, and set > it as default. > --- > > Key: SPARK-20180 > URL: https://issues.apache.org/jira/browse/SPARK-20180 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.1.0 >Reporter: Cyril de Vogelaere >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > Right now, we need to use .setMaxPatternLength() method to > specify is the maximum pattern length of a sequence. Any pattern longer than > that won't be outputted. > The current default maxPatternlength value being 10. > This should be changed so that with input 0, all pattern of any length would > be outputted. Additionally, the default value should be changed to 0, so that > a new user could find all patterns in his dataset without looking at this > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org