[jira] [Resolved] (SPARK-21381) SparkR: pass on setHandleInvalid for classification algorithms
[ https://issues.apache.org/jira/browse/SPARK-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-21381. -- Resolution: Fixed Assignee: Miao Wang Fix Version/s: 2.3.0 Target Version/s: 2.3.0 > SparkR: pass on setHandleInvalid for classification algorithms > -- > > Key: SPARK-21381 > URL: https://issues.apache.org/jira/browse/SPARK-21381 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.1 >Reporter: Miao Wang >Assignee: Miao Wang > Fix For: 2.3.0 > > > SPARK-20307 Added handleInvalid option to RFormula for tree-based > classification algorithms. We should add this parameter for other > classification algorithms in SparkR. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21581) Spark 2.x distinct return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108324#comment-16108324 ] shengyao piao commented on SPARK-21581: --- [~maropu] Very appreciated for your detailed explanation. I understood about this behavior. Thank you! > Spark 2.x distinct return incorrect result > -- > > Key: SPARK-21581 > URL: https://issues.apache.org/jira/browse/SPARK-21581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: shengyao piao > > Hi all > I'm using Spark2.x on cdh5.11 > I have a json file as follows. > ・sample.json > {code} > {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0} > {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700} > {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800} > {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900} > {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0} > {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""} > {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""} > {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""} > {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""} > {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""} > {code} > And I try to read this file and distinct. > ・spark code > {code} > val s=spark.read.json("sample.json") > s.count > res13: Long = 10 > s.distinct.count > res14: Long = 6< - It's should be 10 > {code} > I know the cause of incorrect result is by mixed type in salary field. > But when I try the same code in Spark 1.6 the result will be 10. > So I think it's a bug in Spark 2.x. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21581) Spark 2.x distinct return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shengyao piao resolved SPARK-21581. --- Resolution: Not A Problem > Spark 2.x distinct return incorrect result > -- > > Key: SPARK-21581 > URL: https://issues.apache.org/jira/browse/SPARK-21581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: shengyao piao > > Hi all > I'm using Spark2.x on cdh5.11 > I have a json file as follows. > ・sample.json > {code} > {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0} > {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700} > {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800} > {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900} > {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0} > {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""} > {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""} > {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""} > {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""} > {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""} > {code} > And I try to read this file and distinct. > ・spark code > {code} > val s=spark.read.json("sample.json") > s.count > res13: Long = 10 > s.distinct.count > res14: Long = 6< - It's should be 10 > {code} > I know the cause of incorrect result is by mixed type in salary field. > But when I try the same code in Spark 1.6 the result will be 10. > So I think it's a bug in Spark 2.x. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18950) Report conflicting fields when merging two StructTypes.
[ https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18950. - Resolution: Fixed Assignee: Bravo Zhang Fix Version/s: 2.3.0 > Report conflicting fields when merging two StructTypes. > --- > > Key: SPARK-18950 > URL: https://issues.apache.org/jira/browse/SPARK-18950 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Cheng Lian >Assignee: Bravo Zhang >Priority: Minor > Labels: starter > Fix For: 2.3.0 > > > Currently, {{StructType.merge()}} only reports data types of conflicting > fields when merging two incompatible schemas. It would be nice to also report > the field names for easier debugging. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21589) Add documents about unsupported functions in Hive UDF/UDTF/UDAF
Takeshi Yamamuro created SPARK-21589: Summary: Add documents about unsupported functions in Hive UDF/UDTF/UDAF Key: SPARK-21589 URL: https://issues.apache.org/jira/browse/SPARK-21589 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 2.2.0 Reporter: Takeshi Yamamuro Priority: Minor -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead
[ https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108310#comment-16108310 ] Liang-Chi Hsieh commented on SPARK-21582: - Please call toDF API with the renamed column names. It could save much time. > DataFrame.withColumnRenamed cause huge performance overhead > --- > > Key: SPARK-21582 > URL: https://issues.apache.org/jira/browse/SPARK-21582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: GuangFancui(ISCAS) > Attachments: 4654.stack > > > Table "item_feature" (DataFrame) has over 900 columns. > When I use > {code:java} > val nameSequeceExcept = Set("gid","category_name","merchant_id") > val df1 = spark.table("item_feature") > val newdf1 = df1.schema.map(_.name).filter(name => > !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => > df1.withColumnRenamed(name, name + "_1" )) > {code} > It took over 30 minutes. > *PID* in stack file is *0x126d* > It seems that _transform_ took too long time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21588) SQLContext.getConf(key, null) should return null, but it throws NPE
Burak Yavuz created SPARK-21588: --- Summary: SQLContext.getConf(key, null) should return null, but it throws NPE Key: SPARK-21588 URL: https://issues.apache.org/jira/browse/SPARK-21588 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Burak Yavuz Priority: Minor SQLContext.get(key) for a key that is not defined in the conf, and doesn't have a default value defined, throws a NoSuchElementException. In order to avoid that, I used a null as the default value, which threw a NPE instead. If it is null, it shouldn't try to parse the default value in `getConfString` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108303#comment-16108303 ] xinzhang commented on SPARK-21067: -- hi guoxiaolongzte.could u consider about this .give some suggest.[~guoxiaolongzte] > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Created] (SPARK-21587) Filter pushdown for EventTime Watermark Operator
Jose Torres created SPARK-21587: --- Summary: Filter pushdown for EventTime Watermark Operator Key: SPARK-21587 URL: https://issues.apache.org/jira/browse/SPARK-21587 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Jose Torres If I have a streaming query that sets a watermark, then a where() that pertains to a partition column does not prune these partitions and they will all be queried, greatly reducing performance for partitioned tables. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20433) Update jackson-databind to 2.6.7.1
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108117#comment-16108117 ] Andrew Ash commented on SPARK-20433: Sorry about not updating the ticket description -- the 2.6.7.1 release was came after the initial batch of patches for projects still on older versions of Jackson. I've now opened PR https://github.com/apache/spark/pull/18789 where we can further discussion on the risk of the version bump. > Update jackson-databind to 2.6.7.1 > -- > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. UPDATE: now the 2.6.X line has a > patch as well: 2.6.7.1 as mentioned at > https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340 > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this > library for the next Spark release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20433: --- Description: There was a security vulnerability recently reported to the upstream jackson-databind project at https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix released. >From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the >first fixed versions in their respectful 2.X branches, and versions in the >2.6.X line and earlier remain vulnerable. UPDATE: now the 2.6.X line has a >patch as well: 2.6.7.1 as mentioned at >https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340 Right now Spark master branch is on 2.6.5: https://github.com/apache/spark/blob/master/pom.xml#L164 and Hadoop branch-2.7 is on 2.2.3: https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 and Hadoop branch-3.0.0-alpha2 is on 2.7.8: https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this library for the next Spark release. was: There was a security vulnerability recently reported to the upstream jackson-databind project at https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix released. >From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the >first fixed versions in their respectful 2.X branches, and versions in the >2.6.X line and earlier remain vulnerable. Right now Spark master branch is on 2.6.5: https://github.com/apache/spark/blob/master/pom.xml#L164 and Hadoop branch-2.7 is on 2.2.3: https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 and Hadoop branch-3.0.0-alpha2 is on 2.7.8: https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 We should try to find to find a way to get on a patched version of jackson-bind for the Spark 2.2.0 release. > Update jackson-databind to 2.6.7.1 > -- > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. UPDATE: now the 2.6.X line has a > patch as well: 2.6.7.1 as mentioned at > https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340 > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this > library for the next Spark release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash updated SPARK-20433: --- Summary: Update jackson-databind to 2.6.7.1 (was: Update jackson-databind to 2.6.7) > Update jackson-databind to 2.6.7.1 > -- > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-20433) Update jackson-databind to 2.6.7
[ https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ash reopened SPARK-20433: > Update jackson-databind to 2.6.7 > > > Key: SPARK-20433 > URL: https://issues.apache.org/jira/browse/SPARK-20433 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Andrew Ash >Priority: Minor > > There was a security vulnerability recently reported to the upstream > jackson-databind project at > https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix > released. > From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the > first fixed versions in their respectful 2.X branches, and versions in the > 2.6.X line and earlier remain vulnerable. > Right now Spark master branch is on 2.6.5: > https://github.com/apache/spark/blob/master/pom.xml#L164 > and Hadoop branch-2.7 is on 2.2.3: > https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71 > and Hadoop branch-3.0.0-alpha2 is on 2.7.8: > https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74 > We should try to find to find a way to get on a patched version of > jackson-bind for the Spark 2.2.0 release. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.
[ https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Vandenberg updated SPARK-21571: Description: We have noticed that history server logs are sometimes never cleaned up. The current history server logic *ONLY* cleans up history files if they are completed since in general it doesn't make sense to clean up inprogress history files (after all, the job is presumably still running?) Note that inprogress history files would generally not be targeted for clean up any way assuming they regularly flush logs and the file system accurately updates the history log last modified time/size, while this is likely it is not guaranteed behavior. As a consequence of the current clean up logic and a combination of unclean shutdowns, various file system bugs, earlier spark bugs, etc. we have accumulated thousands of these dead history files associated with long since gone jobs. For example (with spark.history.fs.cleaner.maxAge=14d): -rw-rw 3 xx ooo 14382 2016-09-13 15:40 /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard -rw-rw 3 ooo 5933 2016-11-01 20:16 /user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4 -rw-rw 3 yyy ooo 0 2017-01-19 11:59 /user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress -rw-rw 3 xooo 0 2017-01-19 14:17 /user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress -rw-rw 3 yyy ooo 0 2017-01-20 10:56 /user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress -rw-rw 3 ooo 11955 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress -rw-rw 3 ooo 11958 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress -rw-rw 3 ooo 11960 2017-01-20 17:55 /user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress Based on the current logic, clean up candidates are skipped in several cases: 1. if a file has 0 bytes, it is completely ignored 2. if a file is in progress and not paresable/can't extract appID, is it completely ignored 3. if a file is complete and but not parseable/can't extract appID, it is completely ignored. To address this edge case and provide a way to clean out orphaned history files I propose a new configuration option: spark.history.fs.cleaner.aggressive={true, false}, default is false. If true, the history server will more aggressively garbage collect history files in cases (1), (2) and (3). Since the default is false, existing customers won't be affected unless they explicitly opt-in. If customers have similar leaking garbage over time they have the option of aggressively cleaning up in such cases. Also note that aggressive clean up may not be appropriate for some customers if they have long running jobs that exceed the cleaner.maxAge time frame and/or have buggy file systems. Would like to get feedback on if this seems like a reasonable solution. was: We have noticed that history server logs are sometimes never cleaned up. The current history server logic *ONLY* cleans up history files if they are completed since in general it doesn't make sense to clean up inprogress history files (after all, the job is presumably still running?) Note that inprogress history files would generally not be targeted for clean up any way assuming they regularly flush logs and the file system accurately updates the history log last modified time/size, while this is likely it is not guaranteed behavior. As a consequence of the current clean up logic and a combination of unclean shutdowns, various file system bugs, earlier spark bugs, etc. we have accumulated thousands of these dead history files associated with long since gone jobs. For example (with spark.history.fs.cleaner.maxAge=14d): -rw-rw 3 xx ooo 14382 2016-09-13 15:40 /user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard -rw-rw 3 ooo 5933 2016-11-01 20:16
[jira] [Updated] (SPARK-21586) Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes, other special characters
[ https://issues.apache.org/jira/browse/SPARK-21586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Balasubramaniam Srinivasan updated SPARK-21586: --- Summary: Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes, other special characters (was: Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes) > Read CSV (SQL Context) Doesnt ignore delimiters within special types of > quotes, other special characters > > > Key: SPARK-21586 > URL: https://issues.apache.org/jira/browse/SPARK-21586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Balasubramaniam Srinivasan > > The read csv command doesn't ignore commas in specific cases like when a > dictionary is an entry in a column, other examples include when we have > double quotes - for example if this is an entry in a column "Here we > go"""This is causing a problem, but is handled well in pandas""" There!". In > the case pandas treats this well as just one entry, but the delimiter doesnt > work well in spark dataframes -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21586) Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes
Balasubramaniam Srinivasan created SPARK-21586: -- Summary: Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes Key: SPARK-21586 URL: https://issues.apache.org/jira/browse/SPARK-21586 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Balasubramaniam Srinivasan The read csv command doesn't ignore commas in specific cases like when a dictionary is an entry in a column, other examples include when we have double quotes - for example if this is an entry in a column "Here we go"""This is causing a problem, but is handled well in pandas""" There!". In the case pandas treats this well as just one entry, but the delimiter doesnt work well in spark dataframes -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108006#comment-16108006 ] Andrew Ray commented on SPARK-21565: No nothing like the limitations of microbatches. The window can be made trivially small if you want only one timestamp per group, for example {{window(eventTime, "1 microsecond")}} And yes this should probably be checked in analysis if this is the intended limitation. > aggregate query fails with watermark on eventTime but works with watermark on > timestamp column generated by current_timestamp > - > > Key: SPARK-21565 > URL: https://issues.apache.org/jira/browse/SPARK-21565 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Assudani > > *Short Description: * > Aggregation query fails with eventTime as watermark column while works with > newTimeStamp column generated by running SQL with current_timestamp, > *Exception:* > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172) > at > org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70) > at > org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > *Code to replicate:* > package test > import java.nio.file.{Files, Path, Paths} > import java.text.SimpleDateFormat > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{SparkSession} > import scala.collection.JavaConverters._ > object Test1 { > def main(args: Array[String]) { > val sparkSession = SparkSession > .builder() > .master("local[*]") > .appName("Spark SQL basic example") > .config("spark.some.config.option", "some-value") > .getOrCreate() > val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss") > val checkpointPath = "target/cp1" > val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath > delete(newEventsPath) > delete(Paths.get(checkpointPath).toAbsolutePath) > Files.createDirectories(newEventsPath) > val dfNewEvents= newEvents(sparkSession) > dfNewEvents.createOrReplaceTempView("dfNewEvents") > //The below works - Start > //val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as > newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds") > //dfNewEvents2.createOrReplaceTempView("dfNewEvents2") > //val groupEvents = sparkSession.sql("select symbol,newTimeStamp, > count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp") > // End > > > //The below doesn't work - Start > val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents > ").withWatermark("eventTime","2 seconds") > dfNewEvents2.createOrReplaceTempView("dfNewEvents2") > val groupEvents = sparkSession.sql("select symbol,eventTime, > count(price) as count1 from dfNewEvents2 group by symbol,eventTime") > // - End > > > val query1 = groupEvents.writeStream > .outputMode("append") > .format("console") > .option("checkpointLocation", checkpointPath) > .start("./myop") > val newEventFile1=newEventsPath.resolve("eventNew1.json") > Files.write(newEventFile1, List( > """{"symbol": > "GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""", > """{"symbol": > "GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}""" > ).toIterable.asJava) > query1.processAllAvailable() > sparkSession.streams.awaitAnyTermination(1) > } > private def newEvents(sparkSession: SparkSession) = { > val newEvents = Paths.get("target/newEvents/").toAbsolutePath >
[jira] [Commented] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107968#comment-16107968 ] Amit Assudani commented on SPARK-21565: --- I am not sure, does it mean we can not use watermark without windows - kind of on a microbatch level ? If that is the case, it should not work with current_timestamp and should have validation not to allow without windows. > aggregate query fails with watermark on eventTime but works with watermark on > timestamp column generated by current_timestamp > - > > Key: SPARK-21565 > URL: https://issues.apache.org/jira/browse/SPARK-21565 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Assudani > > *Short Description: * > Aggregation query fails with eventTime as watermark column while works with > newTimeStamp column generated by running SQL with current_timestamp, > *Exception:* > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172) > at > org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70) > at > org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > *Code to replicate:* > package test > import java.nio.file.{Files, Path, Paths} > import java.text.SimpleDateFormat > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{SparkSession} > import scala.collection.JavaConverters._ > object Test1 { > def main(args: Array[String]) { > val sparkSession = SparkSession > .builder() > .master("local[*]") > .appName("Spark SQL basic example") > .config("spark.some.config.option", "some-value") > .getOrCreate() > val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss") > val checkpointPath = "target/cp1" > val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath > delete(newEventsPath) > delete(Paths.get(checkpointPath).toAbsolutePath) > Files.createDirectories(newEventsPath) > val dfNewEvents= newEvents(sparkSession) > dfNewEvents.createOrReplaceTempView("dfNewEvents") > //The below works - Start > //val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as > newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds") > //dfNewEvents2.createOrReplaceTempView("dfNewEvents2") > //val groupEvents = sparkSession.sql("select symbol,newTimeStamp, > count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp") > // End > > > //The below doesn't work - Start > val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents > ").withWatermark("eventTime","2 seconds") > dfNewEvents2.createOrReplaceTempView("dfNewEvents2") > val groupEvents = sparkSession.sql("select symbol,eventTime, > count(price) as count1 from dfNewEvents2 group by symbol,eventTime") > // - End > > > val query1 = groupEvents.writeStream > .outputMode("append") > .format("console") > .option("checkpointLocation", checkpointPath) > .start("./myop") > val newEventFile1=newEventsPath.resolve("eventNew1.json") > Files.write(newEventFile1, List( > """{"symbol": > "GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""", > """{"symbol": > "GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}""" > ).toIterable.asJava) > query1.processAllAvailable() > sparkSession.streams.awaitAnyTermination(1) > } > private def newEvents(sparkSession: SparkSession) = { > val newEvents = Paths.get("target/newEvents/").toAbsolutePath > delete(newEvents) >
[jira] [Commented] (SPARK-21585) Application Master marking application status as Failed for Client Mode
[ https://issues.apache.org/jira/browse/SPARK-21585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107936#comment-16107936 ] Parth Gandhi commented on SPARK-21585: -- Currently working on the fix, will file a pull request as soon as it is done. > Application Master marking application status as Failed for Client Mode > --- > > Key: SPARK-21585 > URL: https://issues.apache.org/jira/browse/SPARK-21585 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.0 >Reporter: Parth Gandhi >Priority: Minor > > Refer https://issues.apache.org/jira/browse/SPARK-21541 for more clarity. > The fix deployed for SPARK-21541 resulted in the Application Master to set > the final status of a spark application as Failed for the client mode as the > flag 'registered' was not being set to true for client mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21585) Application Master marking application status as Failed for Client Mode
Parth Gandhi created SPARK-21585: Summary: Application Master marking application status as Failed for Client Mode Key: SPARK-21585 URL: https://issues.apache.org/jira/browse/SPARK-21585 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.0 Reporter: Parth Gandhi Priority: Minor Refer https://issues.apache.org/jira/browse/SPARK-21541 for more clarity. The fix deployed for SPARK-21541 resulted in the Application Master to set the final status of a spark application as Failed for the client mode as the flag 'registered' was not being set to true for client mode. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp
[ https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107933#comment-16107933 ] Andrew Ray commented on SPARK-21565: I believe you need to use a window to group by your event time. > aggregate query fails with watermark on eventTime but works with watermark on > timestamp column generated by current_timestamp > - > > Key: SPARK-21565 > URL: https://issues.apache.org/jira/browse/SPARK-21565 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Amit Assudani > > *Short Description: * > Aggregation query fails with eventTime as watermark column while works with > newTimeStamp column generated by running SQL with current_timestamp, > *Exception:* > Caused by: java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:347) > at scala.None$.get(Option.scala:345) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204) > at > org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172) > at > org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70) > at > org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65) > at > org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > *Code to replicate:* > package test > import java.nio.file.{Files, Path, Paths} > import java.text.SimpleDateFormat > import org.apache.spark.sql.types._ > import org.apache.spark.sql.{SparkSession} > import scala.collection.JavaConverters._ > object Test1 { > def main(args: Array[String]) { > val sparkSession = SparkSession > .builder() > .master("local[*]") > .appName("Spark SQL basic example") > .config("spark.some.config.option", "some-value") > .getOrCreate() > val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss") > val checkpointPath = "target/cp1" > val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath > delete(newEventsPath) > delete(Paths.get(checkpointPath).toAbsolutePath) > Files.createDirectories(newEventsPath) > val dfNewEvents= newEvents(sparkSession) > dfNewEvents.createOrReplaceTempView("dfNewEvents") > //The below works - Start > //val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as > newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds") > //dfNewEvents2.createOrReplaceTempView("dfNewEvents2") > //val groupEvents = sparkSession.sql("select symbol,newTimeStamp, > count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp") > // End > > > //The below doesn't work - Start > val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents > ").withWatermark("eventTime","2 seconds") > dfNewEvents2.createOrReplaceTempView("dfNewEvents2") > val groupEvents = sparkSession.sql("select symbol,eventTime, > count(price) as count1 from dfNewEvents2 group by symbol,eventTime") > // - End > > > val query1 = groupEvents.writeStream > .outputMode("append") > .format("console") > .option("checkpointLocation", checkpointPath) > .start("./myop") > val newEventFile1=newEventsPath.resolve("eventNew1.json") > Files.write(newEventFile1, List( > """{"symbol": > "GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""", > """{"symbol": > "GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}""" > ).toIterable.asJava) > query1.processAllAvailable() > sparkSession.streams.awaitAnyTermination(1) > } > private def newEvents(sparkSession: SparkSession) = { > val newEvents = Paths.get("target/newEvents/").toAbsolutePath > delete(newEvents) > Files.createDirectories(newEvents) > val dfNewEvents = > sparkSession.readStream.schema(eventsSchema).json(newEvents.toString)//.withWatermark("eventTime","2 > seconds") >
[jira] [Updated] (SPARK-21584) Update R method for summary to call new implementation
[ https://issues.apache.org/jira/browse/SPARK-21584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Ray updated SPARK-21584: --- Component/s: SQL > Update R method for summary to call new implementation > -- > > Key: SPARK-21584 > URL: https://issues.apache.org/jira/browse/SPARK-21584 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 2.3.0 >Reporter: Andrew Ray > > Follow up to SPARK-21100 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21584) Update R method for summary to call new implementation
Andrew Ray created SPARK-21584: -- Summary: Update R method for summary to call new implementation Key: SPARK-21584 URL: https://issues.apache.org/jira/browse/SPARK-21584 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 2.3.0 Reporter: Andrew Ray Follow up to SPARK-21100 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type
[ https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107798#comment-16107798 ] Drew Robb commented on SPARK-8288: -- I opened a PR for this issue, not sure why the bot didn't pick it up? https://github.com/apache/spark/pull/18766 > ScalaReflection should also try apply methods defined in companion objects > when inferring schema from a Product type > > > Key: SPARK-8288 > URL: https://issues.apache.org/jira/browse/SPARK-8288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.0 >Reporter: Cheng Lian > > This ticket is derived from PARQUET-293 (which actually describes a Spark SQL > issue). > My comment on that issue quoted below: > {quote} > ... The reason of this exception is that, the Scala code Scrooge generates > is actually a trait extending {{Product}}: > {code} > trait Junk > extends ThriftStruct > with scala.Product2[Long, String] > with java.io.Serializable > {code} > while Spark expects a case class, something like: > {code} > case class Junk(junkID: Long, junkString: String) > {code} > The key difference here is that the latter case class version has a > constructor whose arguments can be transformed into fields of the DataFrame > schema. The exception was thrown because Spark can't find such a constructor > from trait {{Junk}}. > {quote} > We can make {{ScalaReflection}} try {{apply}} methods in companion objects, > so that trait types generated by Scrooge can also be used for Spark SQL > schema inference. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL
[ https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107749#comment-16107749 ] Ruslan Dautkhanov commented on SPARK-21274: --- [~viirya], you're right. I've checked now on both PostgreSQL and Oracle. Sorry for the confusion. So the more generic rule is, [from Oracle documentation|https://docs.oracle.com/database/121/SQLRF/operators006.htm#sthref913]: {quote} For example, if a particular value occurs *m* times in nested_table1 and *n* times in nested_table2, then the result would contain the element *min(m,n)* times. ALL is the default. {quote} It's interesting that Oracle doesn't support "intersect all" on simple table sets, but only on nested table sets (through "multisets"): {code} CREATE TYPE sets_test_typ AS object ( num number ); CREATE TYPE sets_test_tab_typ AS TABLE OF sets_test_typ; with tab1 as (select 1 as z from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual ) , tab2 as ( select 1 as z from dual union all select 2 from dual union all select 2 from dual union all select 2 from dual ) SELECT * FROM table( cast(multiset(select z from tab1) as sets_test_tab_typ) multiset intersect ALL cast(multiset(select z from tab2) as sets_test_tab_typ) ) ; {code} So Oracle has returned "2" three times = min(3,4). Same test case in PostgreSQL: {code} scm=> with tab1 as (select 1 as z scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> ) scm->, tab2 as ( scm(> select 1 as z scm(> union all select 2 scm(> union all select 2 scm(> union all select 2 scm(> ) scm-> SELECT z FROM tab1 scm-> INTERSECT all scm-> SELECT z FROM tab2 scm-> ; z --- 1 2 2 2 (4 rows) {code} The bottom line is that you're right, the above approach wouldn't work as you noticed. I still believe though it might be easier to implement except all / intersect all through query rewrite. For example, run group by on both sets, on target list of columns, do full outer join, find min between aggregate counts, and inject rows to final result set according that that min(n,m) count. > Implement EXCEPT ALL and INTERSECT ALL > -- > > Key: SPARK-21274 > URL: https://issues.apache.org/jira/browse/SPARK-21274 > Project: Spark > Issue Type: New Feature > Components: Optimizer, SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: Ruslan Dautkhanov > Labels: set, sql > > 1) *EXCEPT ALL* / MINUS ALL : > {code} > SELECT a,b,c FROM tab1 > EXCEPT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following outer join: > {code} > SELECT a,b,c > FROMtab1 t1 > LEFT OUTER JOIN > tab2 t2 > ON ( > (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c) > ) > WHERE > COALESCE(t2.a, t2.b, t2.c) IS NULL > {code} > (register as a temp.view this second query under "*t1_except_t2_df*" name > that can be also used to find INTERSECT ALL below): > 2) *INTERSECT ALL*: > {code} > SELECT a,b,c FROM tab1 > INTERSECT ALL > SELECT a,b,c FROM tab2 > {code} > can be rewritten as following anti-join using t1_except_t2_df we defined > above: > {code} > SELECT a,b,c > FROMtab1 t1 > WHERE >NOT EXISTS >(SELECT 1 > FROMt1_except_t2_df e > WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c) >) > {code} > So the suggestion is just to use above query rewrites to implement both > EXCEPT ALL and INTERSECT ALL sql set operations. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107733#comment-16107733 ] shane knapp commented on SPARK-21573: - that's odd that it's only failing on -04, as it's configured identically to all the other workers, and the anaconda installation is working fine (i just confirmed). anaconda's bin directory must be on the PATH, otherwise it'll pick up the default/system python (2.6). i set it in the worker's config, but when you kick off a bash shell (to launch the python tests), it doesn't carry PATH over correctly. i'm pretty certain that my PR will address this issue. > Tests failing with run-tests.py SyntaxError occasionally in Jenkins > --- > > Key: SPARK-21573 > URL: https://issues.apache.org/jira/browse/SPARK-21573 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It looks default {{python}} in the path at few places such as > {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute > {{run-tests.py}}: > {code} > python2.6 run-tests.py > File "run-tests.py", line 124 > {m: set(m.dependencies).intersection(modules_to_test) for m in > modules_to_test}, sort=True) > ^ > SyntaxError: invalid syntax > {code} > It looks there are quite some places to fix to support Python 2.6 in > {{run-tests.py}} and related Python scripts. > We might just try to set Python 2.7 in few other scripts running this if > available. > Please also see > http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107713#comment-16107713 ] Sean Owen commented on SPARK-21573: --- [~shaneknapp] does it matter that this and one other script are the only ones referring to python2? I wonder if that's picking up something besides the Python3 that I assume 'py3k' is providing. I am pretty sure it's specific to a machine or machines, maybe? For example this failed 3 times in a row on amp-jenkins-worker-04 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/ > Tests failing with run-tests.py SyntaxError occasionally in Jenkins > --- > > Key: SPARK-21573 > URL: https://issues.apache.org/jira/browse/SPARK-21573 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It looks default {{python}} in the path at few places such as > {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute > {{run-tests.py}}: > {code} > python2.6 run-tests.py > File "run-tests.py", line 124 > {m: set(m.dependencies).intersection(modules_to_test) for m in > modules_to_test}, sort=True) > ^ > SyntaxError: invalid syntax > {code} > It looks there are quite some places to fix to support Python 2.6 in > {{run-tests.py}} and related Python scripts. > We might just try to set Python 2.7 in few other scripts running this if > available. > Please also see > http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107706#comment-16107706 ] shane knapp commented on SPARK-21573: - https://github.com/databricks/spark-jenkins-configurations/pull/39 > Tests failing with run-tests.py SyntaxError occasionally in Jenkins > --- > > Key: SPARK-21573 > URL: https://issues.apache.org/jira/browse/SPARK-21573 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It looks default {{python}} in the path at few places such as > {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute > {{run-tests.py}}: > {code} > python2.6 run-tests.py > File "run-tests.py", line 124 > {m: set(m.dependencies).intersection(modules_to_test) for m in > modules_to_test}, sort=True) > ^ > SyntaxError: invalid syntax > {code} > It looks there are quite some places to fix to support Python 2.6 in > {{run-tests.py}} and related Python scripts. > We might just try to set Python 2.7 in few other scripts running this if > available. > Please also see > http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins
[ https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107683#comment-16107683 ] shane knapp commented on SPARK-21573: - (sorry, was off the grid for the past week) we've been off of py2.6 for a long time. our current python installation is managed by anaconda. this behavior popped up w/the ADAM builds a couple of weeks ago. the fix there was to put the anaconda bin dir in to the PATH (/home/anaconda/bin/) in the builds, and be sure to source activate properly before tests are run (the environment is 'py3k'). i really don't know why this is randomly happening... if it failed consistently it'd be easier to track down. anyways, i'll make a PR for [~joshrosen] for the jenkins build configs and add anaconda to the PATH. > Tests failing with run-tests.py SyntaxError occasionally in Jenkins > --- > > Key: SPARK-21573 > URL: https://issues.apache.org/jira/browse/SPARK-21573 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > It looks default {{python}} in the path at few places such as > {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute > {{run-tests.py}}: > {code} > python2.6 run-tests.py > File "run-tests.py", line 124 > {m: set(m.dependencies).intersection(modules_to_test) for m in > modules_to_test}, sort=True) > ^ > SyntaxError: invalid syntax > {code} > It looks there are quite some places to fix to support Python 2.6 in > {{run-tests.py}} and related Python scripts. > We might just try to set Python 2.7 in few other scripts running this if > available. > Please also see > http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21578) Add JavaSparkContextSuite
[ https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21578: -- Description: Due to SI-8479, [SPARK-1093|https://issues.apache.org/jira/browse/SPARK-21578] introduced redundant [SparkContext constructors|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181]. However, [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is already fixed in Scala 2.10.5 and Scala 2.11.1. The real reason to provide this constructor is that Java code can access `SparkContext` directly. It's Scala behavior, SI-4278. So, this PR adds an explicit testsuite, `JavaSparkContextSuite` to prevent future regression, and fixes the outdate comment, too. was: Due to SI-8479, [SPARK-1093|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181] introduced redundant constructors. Since [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is fixed in Scala 2.10.5 and Scala 2.11.1. We can remove them safely. > Add JavaSparkContextSuite > - > > Key: SPARK-21578 > URL: https://issues.apache.org/jira/browse/SPARK-21578 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > Due to SI-8479, > [SPARK-1093|https://issues.apache.org/jira/browse/SPARK-21578] introduced > redundant [SparkContext > constructors|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181]. > However, [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is already > fixed in Scala 2.10.5 and Scala 2.11.1. > The real reason to provide this constructor is that Java code can access > `SparkContext` directly. It's Scala behavior, SI-4278. So, this PR adds an > explicit testsuite, `JavaSparkContextSuite` to prevent future regression, > and fixes the outdate comment, too. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21578) Add JavaSparkContextSuite
[ https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21578: -- Summary: Add JavaSparkContextSuite (was: Consolidate redundant SparkContext constructors due to SI-8479) > Add JavaSparkContextSuite > - > > Key: SPARK-21578 > URL: https://issues.apache.org/jira/browse/SPARK-21578 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > Due to SI-8479, > [SPARK-1093|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181] > introduced redundant constructors. > Since [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is fixed in > Scala 2.10.5 and Scala 2.11.1. We can remove them safely. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
[ https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107652#comment-16107652 ] Bryan Cutler commented on SPARK-21583: -- I have this implemented already, will submit a PR soon. > Create a ColumnarBatch with ArrowColumnVectors for row based iteration > -- > > Key: SPARK-21583 > URL: https://issues.apache.org/jira/browse/SPARK-21583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. > It would be useful to be able to create a {{ColumnarBatch}} to allow row > based iteration over multiple {{ArrowColumnVectors}}. This would avoid extra > copying to translate column elements into rows and be more efficient memory > usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
[ https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-21583: - Description: The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. It would be useful to be able to create a {{ColumnarBatch}} to allow row based iteration over multiple {{ArrowColumnVectors}}. This would avoid extra copying to translate column elements into rows and be more efficient memory usage while increasing performance. (was: The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. It would be useful to be able to create a {{ColumnarBatch}} to allow row based iteration over multiple {{ArrowColumnVector}} s. This would avoid extra copying to translate column elements into rows and be more efficient memory usage while increasing performance.) > Create a ColumnarBatch with ArrowColumnVectors for row based iteration > -- > > Key: SPARK-21583 > URL: https://issues.apache.org/jira/browse/SPARK-21583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. > It would be useful to be able to create a {{ColumnarBatch}} to allow row > based iteration over multiple {{ArrowColumnVectors}}. This would avoid extra > copying to translate column elements into rows and be more efficient memory > usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
[ https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-21583: - Description: The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. It would be useful to be able to create a {{ColumnarBatch}} to allow row based iteration over multiple {{ArrowColumnVector}} s. This would avoid extra copying to translate column elements into rows and be more efficient memory usage while increasing performance. (was: The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. It would be useful to be able to create a {{ColumnarBatch}} to allow row based iteration over multiple {{ArrowColumnVector}}s. This would avoid extra copying to translate column elements into rows and be more efficient memory usage while increasing performance.) > Create a ColumnarBatch with ArrowColumnVectors for row based iteration > -- > > Key: SPARK-21583 > URL: https://issues.apache.org/jira/browse/SPARK-21583 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. > It would be useful to be able to create a {{ColumnarBatch}} to allow row > based iteration over multiple {{ArrowColumnVector}} s. This would avoid > extra copying to translate column elements into rows and be more efficient > memory usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration
Bryan Cutler created SPARK-21583: Summary: Create a ColumnarBatch with ArrowColumnVectors for row based iteration Key: SPARK-21583 URL: https://issues.apache.org/jira/browse/SPARK-21583 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Bryan Cutler The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data. It would be useful to be able to create a {{ColumnarBatch}} to allow row based iteration over multiple {{ArrowColumnVector}}s. This would avoid extra copying to translate column elements into rows and be more efficient memory usage while increasing performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21579) dropTempView has a critical BUG
[ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ant_nebula updated SPARK-21579: --- Description: when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from http://127.0.0.1:4040/storage/. It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem. {code:java} val spark = SparkSession.builder.master("local").appName("sparkTest").getOrCreate() val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), Row("p5", 40), Row("p6", 15)) val schema = new StructType().add(StructField("name", StringType)).add(StructField("age", IntegerType)) val rowRDD = spark.sparkContext.parallelize(rows, 3) val df = spark.createDataFrame(rowRDD, schema) df.createOrReplaceTempView("ods_table") spark.sql("cache table ods_table") spark.sql("cache table dwd_table1 as select * from ods_table where age>=25") spark.sql("cache table dwd_table2 as select * from dwd_table1 where name='p1'") spark.catalog.dropTempView("dwd_table1") //spark.catalog.dropTempView("ods_table") spark.sql("select * from dwd_table2").show() {code} It will keep ods_table1 in memory, although it will not been used anymore. It waste memory, especially when my service diagram much more complex !screenshot-1.png! was: when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from http://127.0.0.1:4040/storage/. It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem. {code:java} val spark = SparkSession.builder.master("local").appName("sparkTest").getOrCreate() val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), Row("p5", 40), Row("p6", 15)) val schema = new StructType().add(StructField("name", StringType)).add(StructField("age", IntegerType)) val rowRDD = spark.sparkContext.parallelize(rows, 3) val df = spark.createDataFrame(rowRDD, schema) df.createOrReplaceTempView("ods_table") spark.sql("cache table ods_table") spark.sql("cache table dwd_table1 as select * from ods_table where age>=25") spark.sql("cache table dwd_table2 as select * from dwd_table1 where name='p1'") spark.catalog.dropTempView("dwd_table1") //spark.catalog.dropTempView("ods_table") spark.sql("select * from dwd_table2").show() {code} > dropTempView has a critical BUG > --- > > Key: SPARK-21579 > URL: https://issues.apache.org/jira/browse/SPARK-21579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: ant_nebula >Priority: Critical > Attachments: screenshot-1.png > > > when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from > http://127.0.0.1:4040/storage/. > It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem. > {code:java} > val spark = > SparkSession.builder.master("local").appName("sparkTest").getOrCreate() > val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), > Row("p5", 40), Row("p6", 15)) > val schema = new StructType().add(StructField("name", > StringType)).add(StructField("age", IntegerType)) > val rowRDD = spark.sparkContext.parallelize(rows, 3) > val df = spark.createDataFrame(rowRDD, schema) > df.createOrReplaceTempView("ods_table") > spark.sql("cache table ods_table") > spark.sql("cache table dwd_table1 as select * from ods_table where age>=25") > spark.sql("cache table dwd_table2 as select * from dwd_table1 where > name='p1'") > spark.catalog.dropTempView("dwd_table1") > //spark.catalog.dropTempView("ods_table") > spark.sql("select * from dwd_table2").show() > {code} > It will keep ods_table1 in memory, although it will not been used anymore. It > waste memory, especially when my service diagram much more complex > !screenshot-1.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107374#comment-16107374 ] ant_nebula commented on SPARK-19765: What a bad change it is! Now it can not support for this scene any more. https://issues.apache.org/jira/browse/SPARK-21579 > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes > Fix For: 2.1.1, 2.2.0 > > > DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, > UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all > the cached plans that refer to this table -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21579) dropTempView has a critical BUG
[ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107371#comment-16107371 ] ant_nebula commented on SPARK-21579: It will keep ods_table1 in memory, although it will not been used anymore. It waste memory, especially when my service diagram much more complex !screenshot-1.png! > dropTempView has a critical BUG > --- > > Key: SPARK-21579 > URL: https://issues.apache.org/jira/browse/SPARK-21579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: ant_nebula >Priority: Critical > Attachments: screenshot-1.png > > > when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from > http://127.0.0.1:4040/storage/. > It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem. > {code:java} > val spark = > SparkSession.builder.master("local").appName("sparkTest").getOrCreate() > val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), > Row("p5", 40), Row("p6", 15)) > val schema = new StructType().add(StructField("name", > StringType)).add(StructField("age", IntegerType)) > val rowRDD = spark.sparkContext.parallelize(rows, 3) > val df = spark.createDataFrame(rowRDD, schema) > df.createOrReplaceTempView("ods_table") > spark.sql("cache table ods_table") > spark.sql("cache table dwd_table1 as select * from ods_table where age>=25") > spark.sql("cache table dwd_table2 as select * from dwd_table1 where > name='p1'") > spark.catalog.dropTempView("dwd_table1") > //spark.catalog.dropTempView("ods_table") > spark.sql("select * from dwd_table2").show() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21579) dropTempView has a critical BUG
[ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ant_nebula updated SPARK-21579: --- Attachment: screenshot-1.png > dropTempView has a critical BUG > --- > > Key: SPARK-21579 > URL: https://issues.apache.org/jira/browse/SPARK-21579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: ant_nebula >Priority: Critical > Attachments: screenshot-1.png > > > when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from > http://127.0.0.1:4040/storage/. > It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem. > {code:java} > val spark = > SparkSession.builder.master("local").appName("sparkTest").getOrCreate() > val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), > Row("p5", 40), Row("p6", 15)) > val schema = new StructType().add(StructField("name", > StringType)).add(StructField("age", IntegerType)) > val rowRDD = spark.sparkContext.parallelize(rows, 3) > val df = spark.createDataFrame(rowRDD, schema) > df.createOrReplaceTempView("ods_table") > spark.sql("cache table ods_table") > spark.sql("cache table dwd_table1 as select * from ods_table where age>=25") > spark.sql("cache table dwd_table2 as select * from dwd_table1 where > name='p1'") > spark.catalog.dropTempView("dwd_table1") > //spark.catalog.dropTempView("ods_table") > spark.sql("select * from dwd_table2").show() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21561) spark-streaming-kafka-010 DSteam is not pulling anything from Kafka
[ https://issues.apache.org/jira/browse/SPARK-21561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107324#comment-16107324 ] Vlad Badelita commented on SPARK-21561: --- Sorry, didn't know where to ask about this issues. I have since found out it was an actual bug, that someone else experienced: https://issues.apache.org/jira/browse/SPARK-18779 However, it was a kafka bug, not a spark-streaming one and was solved here: https://issues.apache.org/jira/browse/KAFKA-4547 And it changing the kafka-clients version to 0.10.2.1 fixed it for me. > spark-streaming-kafka-010 DSteam is not pulling anything from Kafka > --- > > Key: SPARK-21561 > URL: https://issues.apache.org/jira/browse/SPARK-21561 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.1 >Reporter: Vlad Badelita > Labels: kafka-0.10, spark-streaming > > I am trying to use spark-streaming-kafka-0.10 to pull messages from a kafka > topic(broker version 0.10). I have checked that messages are being produced > and used a KafkaConsumer to pull them successfully. Now, when I try to use > the spark streaming api, I am not getting anything. If I just use > KafkaUtils.createRDD and specify some offset ranges manually it works. But > when, I try to use createDirectStream, all the rdds are empty and when I > check the partition offsets it simply reports that all partitions are 0. Here > is what I tried: > {code:scala} > val sparkConf = new SparkConf().setAppName("kafkastream") > val ssc = new StreamingContext(sparkConf, Seconds(3)) > val topics = Array("my_topic") > val kafkaParams = Map[String, Object]( >"bootstrap.servers" -> "hostname:6667" >"key.deserializer" -> classOf[StringDeserializer], >"value.deserializer" -> classOf[StringDeserializer], >"group.id" -> "my_group", >"auto.offset.reset" -> "earliest", >"enable.auto.commit" -> (true: java.lang.Boolean) > ) > val stream = KafkaUtils.createDirectStream[String, String]( >ssc, >PreferConsistent, >Subscribe[String, String](topics, kafkaParams) > ) > stream.foreachRDD { rdd => >val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges >rdd.foreachPartition { iter => > val o: OffsetRange = offsetRanges(TaskContext.get.partitionId) > println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}") >} >val rddCount = rdd.count() >println("rdd count: ", rddCount) >// stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) > } > ssc.start() > ssc.awaitTermination() > {code} > All partitions show offset ranges from 0 to 0 and all rdds are empty. I would > like it to start from the beginning of a partition but also pick up > everything that is being produced to it. > I have also tried using spark-streaming-kafka-0.8 and it does work. I think > it is a 0.10 issue because everything else works fine. Thank you! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21579) dropTempView has a critical BUG
[ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107296#comment-16107296 ] Takeshi Yamamuro edited comment on SPARK-21579 at 7/31/17 1:17 PM: --- I think this is an expected behaviour in Spark; `cache1` is a sub-tree of `cache2`, so if you uncache `cache1`, spark also needs to uncaches `cache2`. {code} scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base") scala> sql("cache table cache1 as select * from base where age >= 25") scala> sql("cache table cache2 as select * from cache1 where name = 'p1'") scala> spark.table("cache1").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] scala> spark.table("cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Filter (isnotnull(name#5) && (name#5 = p1)) +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] {code} Probably, you better do this; {code} scala> sql("create temporary view tmp1 as select * from base where age >= 25") scala> sql("cache table cache1 as select * from tmp1") scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'") scala> catalog.dropTempView("cache1") // scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or view not found` scala> sql("select * from cache2").explain scala> sql("select * from cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1)) +- LocalTableScan [_1#2, _2#3] {code} was (Author: maropu): I think this is an expected behaviour in Spark; `cache1` is a sub-tree of `cache2`, so if you uncache `cache1`, spark also uncaches `cache2`. {code} scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base") scala> sql("cache table cache1 as select * from base where age >= 25") scala> sql("cache table cache2 as select * from cache1 where name = 'p1'") scala> spark.table("cache1").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] scala> spark.table("cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Filter (isnotnull(name#5) && (name#5 = p1)) +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] {code} Probably, you better do this; {code} scala> sql("create temporary view tmp1 as select * from base where age >= 25") scala> sql("cache table cache1 as select * from tmp1") scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'") scala> catalog.dropTempView("cache1") // scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or view not found` scala> sql("select * from cache2").explain scala> sql("select * from cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1)) +- LocalTableScan [_1#2, _2#3] {code} > dropTempView has a critical BUG > --- > > Key: SPARK-21579 > URL: https://issues.apache.org/jira/browse/SPARK-21579 > Project: Spark > Issue Type: Bug >
[jira] [Commented] (SPARK-21579) dropTempView has a critical BUG
[ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107296#comment-16107296 ] Takeshi Yamamuro commented on SPARK-21579: -- I think this is an expected behaviour; `cache1` is a sub-tree of `cache2`, so if you uncache `cache1`, spark also uncaches `cache2`. {code} scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base") scala> sql("cache table cache1 as select * from base where age >= 25") scala> sql("cache table cache2 as select * from cache1 where name = 'p1'") scala> spark.table("cache1").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] scala> spark.table("cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Filter (isnotnull(name#5) && (name#5 = p1)) +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] {code} Probably, you better do this; {code} scala> sql("create temporary view tmp1 as select * from base where age >= 25") scala> sql("cache table cache1 as select * from tmp1") scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'") scala> catalog.dropTempView("cache1") // scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or view not found` scala> sql("select * from cache2").explain scala> sql("select * from cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1)) +- LocalTableScan [_1#2, _2#3] {code} > dropTempView has a critical BUG > --- > > Key: SPARK-21579 > URL: https://issues.apache.org/jira/browse/SPARK-21579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: ant_nebula >Priority: Critical > > when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from > http://127.0.0.1:4040/storage/. > It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem. > {code:java} > val spark = > SparkSession.builder.master("local").appName("sparkTest").getOrCreate() > val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), > Row("p5", 40), Row("p6", 15)) > val schema = new StructType().add(StructField("name", > StringType)).add(StructField("age", IntegerType)) > val rowRDD = spark.sparkContext.parallelize(rows, 3) > val df = spark.createDataFrame(rowRDD, schema) > df.createOrReplaceTempView("ods_table") > spark.sql("cache table ods_table") > spark.sql("cache table dwd_table1 as select * from ods_table where age>=25") > spark.sql("cache table dwd_table2 as select * from dwd_table1 where > name='p1'") > spark.catalog.dropTempView("dwd_table1") > //spark.catalog.dropTempView("ods_table") > spark.sql("select * from dwd_table2").show() > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21579) dropTempView has a critical BUG
[ https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107296#comment-16107296 ] Takeshi Yamamuro edited comment on SPARK-21579 at 7/31/17 1:16 PM: --- I think this is an expected behaviour in Spark; `cache1` is a sub-tree of `cache2`, so if you uncache `cache1`, spark also uncaches `cache2`. {code} scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base") scala> sql("cache table cache1 as select * from base where age >= 25") scala> sql("cache table cache2 as select * from cache1 where name = 'p1'") scala> spark.table("cache1").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] scala> spark.table("cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Filter (isnotnull(name#5) && (name#5 = p1)) +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] {code} Probably, you better do this; {code} scala> sql("create temporary view tmp1 as select * from base where age >= 25") scala> sql("cache table cache1 as select * from tmp1") scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'") scala> catalog.dropTempView("cache1") // scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or view not found` scala> sql("select * from cache2").explain scala> sql("select * from cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1)) +- LocalTableScan [_1#2, _2#3] {code} was (Author: maropu): I think this is an expected behaviour; `cache1` is a sub-tree of `cache2`, so if you uncache `cache1`, spark also uncaches `cache2`. {code} scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base") scala> sql("cache table cache1 as select * from base where age >= 25") scala> sql("cache table cache2 as select * from cache1 where name = 'p1'") scala> spark.table("cache1").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] scala> spark.table("cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Filter (isnotnull(name#5) && (name#5 = p1)) +- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = p1)] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache1` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (_2#3 >= 25) +- LocalTableScan [_1#2, _2#3] {code} Probably, you better do this; {code} scala> sql("create temporary view tmp1 as select * from base where age >= 25") scala> sql("cache table cache1 as select * from tmp1") scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'") scala> catalog.dropTempView("cache1") // scala> sql("select * from cache1").explain --> Throws AnalysisException `Table or view not found` scala> sql("select * from cache2").explain scala> sql("select * from cache2").explain == Physical Plan == InMemoryTableScan [name#5, age#6] +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, deserialized, 1 replicas), `cache2` +- *Project [_1#2 AS name#5, _2#3 AS age#6] +- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1)) +- LocalTableScan [_1#2, _2#3] {code} > dropTempView has a critical BUG > --- > > Key: SPARK-21579 > URL: https://issues.apache.org/jira/browse/SPARK-21579 > Project: Spark > Issue Type: Bug > Components: SQL >
[jira] [Commented] (SPARK-21559) Remove Mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107147#comment-16107147 ] Stavros Kontopoulos commented on SPARK-21559: - Pr here: https://github.com/apache/spark/pull/18784 > Remove Mesos fine-grained mode > -- > > Key: SPARK-21559 > URL: https://issues.apache.org/jira/browse/SPARK-21559 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos > > After discussing this with people from Mesosphere we agreed that it is time > to remove fine grained mode. Plans are to improve cluster mode to cover any > benefits may existed when using fine grained mode. > [~susanxhuynh] > Previous status of this can be found here: > https://issues.apache.org/jira/browse/SPARK-11857 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21559) Remove Mesos fine-grained mode
[ https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107147#comment-16107147 ] Stavros Kontopoulos edited comment on SPARK-21559 at 7/31/17 11:11 AM: --- PR here: https://github.com/apache/spark/pull/18784 was (Author: skonto): Pr here: https://github.com/apache/spark/pull/18784 > Remove Mesos fine-grained mode > -- > > Key: SPARK-21559 > URL: https://issues.apache.org/jira/browse/SPARK-21559 > Project: Spark > Issue Type: Task > Components: Mesos >Affects Versions: 2.2.0 >Reporter: Stavros Kontopoulos > > After discussing this with people from Mesosphere we agreed that it is time > to remove fine grained mode. Plans are to improve cluster mode to cover any > benefits may existed when using fine grained mode. > [~susanxhuynh] > Previous status of this can be found here: > https://issues.apache.org/jira/browse/SPARK-11857 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21578) Consolidate redundant SparkContext constructors due to SI-8479
[ https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21578: - Assignee: Dongjoon Hyun Priority: Minor (was: Major) Issue Type: Improvement (was: Bug) > Consolidate redundant SparkContext constructors due to SI-8479 > -- > > Key: SPARK-21578 > URL: https://issues.apache.org/jira/browse/SPARK-21578 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > Due to SI-8479, > [SPARK-1093|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181] > introduced redundant constructors. > Since [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is fixed in > Scala 2.10.5 and Scala 2.11.1. We can remove them safely. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21544) Test jar of some module should not install or deploy twice
[ https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-21544: - Assignee: zhoukang Priority: Minor (was: Major) > Test jar of some module should not install or deploy twice > -- > > Key: SPARK-21544 > URL: https://issues.apache.org/jira/browse/SPARK-21544 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.1, 2.1.0 >Reporter: zhoukang >Assignee: zhoukang >Priority: Minor > > For moudle below: > common/network-common > streaming > sql/core > sql/catalyst > tests.jar will install or deploy twice.Like: > {code:java} > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Writing tracking file > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories > [DEBUG] Installing > org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml > [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml > [INFO] Installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > [DEBUG] Skipped re-installing > /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar > to > /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar, > seems unchanged > {code} > The reason is below: > {code:java} > [DEBUG] (f) artifact = > org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT > [DEBUG] (f) attachedArtifacts = > [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark > -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT, > org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0 > -mdh2.1.0.1-SNAPSHOT] > {code} > when executing 'mvn deploy' to nexus during release.I will fail since release > nexus can not be override. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21581) Spark 2.x distinct return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107081#comment-16107081 ] Takeshi Yamamuro commented on SPARK-21581: -- This is an expected behaviour in spark-2.x and the behaviour changes when corrupt records hit. {code} // Spark-v1.6.3 scala> sqlContext.read.json("simple.json").show +---+--++ | name|salary| url| +---+--++ | staff1| 600.0|http://example.ho...| | staff2| 700.0|http://example.ho...| | staff3| 800.0|http://example.ho...| | staff4| 900.0|http://example.ho...| | staff5|1000.0|http://example.ho...| | staff6| null|http://example.ho...| | staff7| null|http://example.ho...| | staff8| null|http://example.ho...| | staff9| null|http://example.ho...| |staff10| null|http://example.ho...| +---+--++ // master scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed STRING").option("columnNameOfCorruptRecord", "_malformed").json("/Users/maropu/Desktop/simple.json").show +--+--+++ | name|salary| url| _malformed| +--+--+++ |staff1| 600.0|http://example.ho...|null| |staff2| 700.0|http://example.ho...|null| |staff3| 800.0|http://example.ho...|null| |staff4| 900.0|http://example.ho...|null| |staff5|1000.0|http://example.ho...|null| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| +--+--+++ {code} > Spark 2.x distinct return incorrect result > -- > > Key: SPARK-21581 > URL: https://issues.apache.org/jira/browse/SPARK-21581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: shengyao piao > > Hi all > I'm using Spark2.x on cdh5.11 > I have a json file as follows. > ・sample.json > {code} > {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0} > {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700} > {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800} > {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900} > {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0} > {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""} > {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""} > {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""} > {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""} > {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""} > {code} > And I try to read this file and distinct. > ・spark code > {code} > val s=spark.read.json("sample.json") > s.count > res13: Long = 10 > s.distinct.count > res14: Long = 6< - It's should be 10 > {code} > I know the cause of incorrect result is by mixed type in salary field. > But when I try the same code in Spark 1.6 the result will be 10. > So I think it's a bug in Spark 2.x. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21581) Spark 2.x distinct return incorrect result
[ https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107081#comment-16107081 ] Takeshi Yamamuro edited comment on SPARK-21581 at 7/31/17 10:08 AM: This is an expected behaviour in spark-2.x and the behaviour changes when corrupt records hit. {code} // Spark-v1.6.3 scala> sqlContext.read.json("simple.json").show +---+--++ | name|salary| url| +---+--++ | staff1| 600.0|http://example.ho...| | staff2| 700.0|http://example.ho...| | staff3| 800.0|http://example.ho...| | staff4| 900.0|http://example.ho...| | staff5|1000.0|http://example.ho...| | staff6| null|http://example.ho...| | staff7| null|http://example.ho...| | staff8| null|http://example.ho...| | staff9| null|http://example.ho...| |staff10| null|http://example.ho...| +---+--++ // master scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed STRING").option("columnNameOfCorruptRecord", "_malformed").json("simple.json").show +--+--+++ | name|salary| url| _malformed| +--+--+++ |staff1| 600.0|http://example.ho...|null| |staff2| 700.0|http://example.ho...|null| |staff3| 800.0|http://example.ho...|null| |staff4| 900.0|http://example.ho...|null| |staff5|1000.0|http://example.ho...|null| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| +--+--+++ {code} was (Author: maropu): This is an expected behaviour in spark-2.x and the behaviour changes when corrupt records hit. {code} // Spark-v1.6.3 scala> sqlContext.read.json("simple.json").show +---+--++ | name|salary| url| +---+--++ | staff1| 600.0|http://example.ho...| | staff2| 700.0|http://example.ho...| | staff3| 800.0|http://example.ho...| | staff4| 900.0|http://example.ho...| | staff5|1000.0|http://example.ho...| | staff6| null|http://example.ho...| | staff7| null|http://example.ho...| | staff8| null|http://example.ho...| | staff9| null|http://example.ho...| |staff10| null|http://example.ho...| +---+--++ // master scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed STRING").option("columnNameOfCorruptRecord", "_malformed").json("/Users/maropu/Desktop/simple.json").show +--+--+++ | name|salary| url| _malformed| +--+--+++ |staff1| 600.0|http://example.ho...|null| |staff2| 700.0|http://example.ho...|null| |staff3| 800.0|http://example.ho...|null| |staff4| 900.0|http://example.ho...|null| |staff5|1000.0|http://example.ho...|null| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| | null| null|null|{"url": "http://e...| +--+--+++ {code} > Spark 2.x distinct return incorrect result > -- > > Key: SPARK-21581 > URL: https://issues.apache.org/jira/browse/SPARK-21581 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0, 2.1.0, 2.2.0 >Reporter: shengyao piao > > Hi all > I'm using Spark2.x on cdh5.11 > I have a json file as follows. > ・sample.json > {code} > {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0} > {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700} > {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800} > {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900} > {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0} > {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""} > {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""} > {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""} > {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""} > {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""} > {code} > And I try to read this file and distinct. > ・spark code > {code}
[jira] [Resolved] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value
[ https://issues.apache.org/jira/browse/SPARK-21099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21099. --- Resolution: Won't Fix > INFO Log Message Using Incorrect Executor Idle Timeout Value > > > Key: SPARK-21099 > URL: https://issues.apache.org/jira/browse/SPARK-21099 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.1.0 >Reporter: Hazem Mahmoud >Priority: Trivial > > INFO log message is using the wrong idle timeout > (spark.dynamicAllocation.executorIdleTimeout) when printing the message that > the executor holding the RDD cache is being removed. > INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been > idle for 30 seconds (new desired total will be 0) > It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the > RDD cache timeout is reached. I was able to confirm this by doing the > following: > 1. Update spark-defaults.conf to set the following: > executorIdleTimeout=30 > cachedExecutorIdleTimeout=20 > 2. Update log4j.properties to set the following: > shell.log.level=INFO > 3. Run the following in spark-shell: > scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234") > scala> textFile.cache().count() > 4. After 30 secs you will see 2 timeout messages, but of which are 30 secs > (whereas one *should* be for 20 secs) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead
[ https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107014#comment-16107014 ] Takeshi Yamamuro commented on SPARK-21582: -- I feel making ~900 dataframes is heavy... > DataFrame.withColumnRenamed cause huge performance overhead > --- > > Key: SPARK-21582 > URL: https://issues.apache.org/jira/browse/SPARK-21582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: GuangFancui(ISCAS) > Attachments: 4654.stack > > > Table "item_feature" (DataFrame) has over 900 columns. > When I use > {code:java} > val nameSequeceExcept = Set("gid","category_name","merchant_id") > val df1 = spark.table("item_feature") > val newdf1 = df1.schema.map(_.name).filter(name => > !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => > df1.withColumnRenamed(name, name + "_1" )) > {code} > It took over 30 minutes. > *PID* in stack file is *0x126d* > It seems that _transform_ took too long time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
[ https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-20564. --- Resolution: Won't Fix > a lot of executor failures when the executor number is more than 2000 > - > > Key: SPARK-20564 > URL: https://issues.apache.org/jira/browse/SPARK-20564 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.6.2, 2.1.0 >Reporter: Hua Liu >Priority: Minor > > When we used more than 2000 executors in a spark application, we noticed a > large number of executors cannot connect to driver and as a result they were > marked as failed. In some cases, the failed executor number reached twice of > the requested executor count and thus applications retried and may eventually > fail. > This is because that YarnAllocator requests all missing containers every > spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, > YarnAllocator can ask for and get over 2000 containers in one request, and > then launch them almost simultaneously. These thousands of executors try to > retrieve spark props and register with driver within seconds. However, driver > handles executor registration, stop, removal and spark props retrieval in one > thread, and it can not handle such a large number of RPCs within a short > period of time. As a result, some executors cannot retrieve spark props > and/or register. These failed executors are then marked as failed, causing > executor removal and aggravating the overloading of driver, which leads to > more executor failures. > This patch adds an extra configuration > spark.yarn.launchContainer.count.simultaneously, which caps the maximal > number of containers that driver can ask for in every > spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of > executors grows steadily. The number of executor failures is reduced and > applications can reach the desired number of executors faster. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead
[ https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107009#comment-16107009 ] Sean Owen commented on SPARK-21582: --- This is a really slow way to rename the columns. I think you want to just set the schema manually? or pass the col names in a call to toDF()? you have a chain of 900 operations here. > DataFrame.withColumnRenamed cause huge performance overhead > --- > > Key: SPARK-21582 > URL: https://issues.apache.org/jira/browse/SPARK-21582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: GuangFancui(ISCAS) > Attachments: 4654.stack > > > Table "item_feature" (DataFrame) has over 900 columns. > When I use > {code:java} > val nameSequeceExcept = Set("gid","category_name","merchant_id") > val df1 = spark.table("item_feature") > val newdf1 = df1.schema.map(_.name).filter(name => > !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => > df1.withColumnRenamed(name, name + "_1" )) > {code} > It took over 30 minutes. > *PID* in stack file is *0x126d* > It seems that _transform_ took too long time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations
[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106971#comment-16106971 ] Hyukjin Kwon commented on SPARK-21577: -- No ... I mean sending it to u...@spark.apache.org as described in the link I used so that other Spark users / developers can receive your email and discuss ... > Issue is handling too many aggregations > > > Key: SPARK-21577 > URL: https://issues.apache.org/jira/browse/SPARK-21577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Cloudera CDH 1.8.3 > Spark 1.6.0 >Reporter: Kannan Subramanian > > my requirement, reading the table from hive(Size - around 1.6 TB). I have to > do more than 200 aggregation operations mostly avg, sum and std_dev. Spark > application total execution time is take more than 12 hours. To Optimize the > code I used shuffle Partitioning and memory tuning and all. But Its > nothelpful for me. Please note that same query I ran in hive on map reduce. > MR job completion time taken around only 5 hours. Kindly let me know is > there any way to optimize or efficient way of handling multiple aggregation > operations.val inputDataDF = > hiveContext.read.parquet("/inputparquetData") > inputDataDF.groupBy("seq_no","year", > "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), > avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... > like 200 columns) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations
[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106970#comment-16106970 ] Kannan Subramanian commented on SPARK-21577: Thanks Hyukjin Kwon, I have sent email about the more number of aggregations performance issue to gurwls...@gmail.com -Kannan > Issue is handling too many aggregations > > > Key: SPARK-21577 > URL: https://issues.apache.org/jira/browse/SPARK-21577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Cloudera CDH 1.8.3 > Spark 1.6.0 >Reporter: Kannan Subramanian > > my requirement, reading the table from hive(Size - around 1.6 TB). I have to > do more than 200 aggregation operations mostly avg, sum and std_dev. Spark > application total execution time is take more than 12 hours. To Optimize the > code I used shuffle Partitioning and memory tuning and all. But Its > nothelpful for me. Please note that same query I ran in hive on map reduce. > MR job completion time taken around only 5 hours. Kindly let me know is > there any way to optimize or efficient way of handling multiple aggregation > operations.val inputDataDF = > hiveContext.read.parquet("/inputparquetData") > inputDataDF.groupBy("seq_no","year", > "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), > avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... > like 200 columns) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead
[ https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GuangFancui(ISCAS) updated SPARK-21582: --- Attachment: 4654.stack PID is 0x126d > DataFrame.withColumnRenamed cause huge performance overhead > --- > > Key: SPARK-21582 > URL: https://issues.apache.org/jira/browse/SPARK-21582 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: GuangFancui(ISCAS) > Attachments: 4654.stack > > > Table "item_feature" (DataFrame) has over 900 columns. > When I use > {code:java} > val nameSequeceExcept = Set("gid","category_name","merchant_id") > val df1 = spark.table("item_feature") > val newdf1 = df1.schema.map(_.name).filter(name => > !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => > df1.withColumnRenamed(name, name + "_1" )) > {code} > It took over 30 minutes. > *PID* in stack file is *0x126d* > It seems that _transform_ took too long time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead
GuangFancui(ISCAS) created SPARK-21582: -- Summary: DataFrame.withColumnRenamed cause huge performance overhead Key: SPARK-21582 URL: https://issues.apache.org/jira/browse/SPARK-21582 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: GuangFancui(ISCAS) Table "item_feature" (DataFrame) has over 900 columns. When I use {code:java} val nameSequeceExcept = Set("gid","category_name","merchant_id") val df1 = spark.table("item_feature") val newdf1 = df1.schema.map(_.name).filter(name => !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => df1.withColumnRenamed(name, name + "_1" )) {code} It took over 30 minutes. *PID* in stack file is *0x126d* It seems that _transform_ took too long time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19043) Make SparkSQLSessionManager more configurable
[ https://issues.apache.org/jira/browse/SPARK-19043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19043. --- Resolution: Won't Fix > Make SparkSQLSessionManager more configurable > - > > Key: SPARK-19043 > URL: https://issues.apache.org/jira/browse/SPARK-19043 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Kent Yao >Priority: Minor > > We can only set the pool size of the HiveServer2 background operation thread > pool size with Spark Thrift Server. But in HiveServer2, we can also set > background operation thread pool size and thread keep alive time. I think we > can make it same with hive's behavior. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21567) Dataset with Tuple of type alias throws error
[ https://issues.apache.org/jira/browse/SPARK-21567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106920#comment-16106920 ] Tomasz Bartczak commented on SPARK-21567: - That's good there is a workaround. However as I think about that - spark implicits should get encoders right for basic data types and case classes and other primitives like tuples? I am not sure if it is a Spark bug or a Scala bug but from a user perspective - it is some kind of an unexpected behaviour. > Dataset with Tuple of type alias throws error > - > > Key: SPARK-21567 > URL: https://issues.apache.org/jira/browse/SPARK-21567 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: verified for spark 2.1.1 and 2.2.0 in sbt build >Reporter: Tomasz Bartczak > > returning from a map a thing that is a tuple containg another tuple - defined > as a type alias - we receive an error. > minimal reproducible case: > having a structure like this: > {code} > object C { > type TwoInt = (Int,Int) > def tupleTypeAlias: TwoInt = (1,1) > } > {code} > when I do: > {code} > Seq(1).toDS().map(_ => ("",C.tupleTypeAlias)) > {code} > I get exception: > {code} > type T1 is not a class > scala.ScalaReflectionException: type T1 is not a class > at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275) > at > scala.reflect.internal.Symbols$SymbolContextApiImpl.asClass(Symbols.scala:84) > at > org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607) > at > org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71) > at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) > at > org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233) > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33) > {code} > in spark 2.1.1 the last exception was 'head of an empty list' -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21577) Issue is handling too many aggregations
[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106855#comment-16106855 ] Hyukjin Kwon edited comment on SPARK-21577 at 7/31/17 6:26 AM: --- Please check out https://spark.apache.org/community.html and I guess you can start the thread. was (Author: hyukjin.kwon): Please check out https://spark.apache.org/community.html. > Issue is handling too many aggregations > > > Key: SPARK-21577 > URL: https://issues.apache.org/jira/browse/SPARK-21577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Cloudera CDH 1.8.3 > Spark 1.6.0 >Reporter: Kannan Subramanian > > my requirement, reading the table from hive(Size - around 1.6 TB). I have to > do more than 200 aggregation operations mostly avg, sum and std_dev. Spark > application total execution time is take more than 12 hours. To Optimize the > code I used shuffle Partitioning and memory tuning and all. But Its > nothelpful for me. Please note that same query I ran in hive on map reduce. > MR job completion time taken around only 5 hours. Kindly let me know is > there any way to optimize or efficient way of handling multiple aggregation > operations.val inputDataDF = > hiveContext.read.parquet("/inputparquetData") > inputDataDF.groupBy("seq_no","year", > "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), > avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... > like 200 columns) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations
[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106855#comment-16106855 ] Hyukjin Kwon commented on SPARK-21577: -- Please check out https://spark.apache.org/community.html. > Issue is handling too many aggregations > > > Key: SPARK-21577 > URL: https://issues.apache.org/jira/browse/SPARK-21577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Cloudera CDH 1.8.3 > Spark 1.6.0 >Reporter: Kannan Subramanian > > my requirement, reading the table from hive(Size - around 1.6 TB). I have to > do more than 200 aggregation operations mostly avg, sum and std_dev. Spark > application total execution time is take more than 12 hours. To Optimize the > code I used shuffle Partitioning and memory tuning and all. But Its > nothelpful for me. Please note that same query I ran in hive on map reduce. > MR job completion time taken around only 5 hours. Kindly let me know is > there any way to optimize or efficient way of handling multiple aggregation > operations.val inputDataDF = > hiveContext.read.parquet("/inputparquetData") > inputDataDF.groupBy("seq_no","year", > "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), > avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... > like 200 columns) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations
[ https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106848#comment-16106848 ] Kannan Subramanian commented on SPARK-21577: please let me know if there is any mail thread started for above issue ? Thanks > Issue is handling too many aggregations > > > Key: SPARK-21577 > URL: https://issues.apache.org/jira/browse/SPARK-21577 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Cloudera CDH 1.8.3 > Spark 1.6.0 >Reporter: Kannan Subramanian > > my requirement, reading the table from hive(Size - around 1.6 TB). I have to > do more than 200 aggregation operations mostly avg, sum and std_dev. Spark > application total execution time is take more than 12 hours. To Optimize the > code I used shuffle Partitioning and memory tuning and all. But Its > nothelpful for me. Please note that same query I ran in hive on map reduce. > MR job completion time taken around only 5 hours. Kindly let me know is > there any way to optimize or efficient way of handling multiple aggregation > operations.val inputDataDF = > hiveContext.read.parquet("/inputparquetData") > inputDataDF.groupBy("seq_no","year", > "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"), > avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")... > like 200 columns) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org