date:20170731

[jira] [Resolved] (SPARK-21381) SparkR: pass on setHandleInvalid for classification algorithms

2017-07-31 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-21381.
--
  Resolution: Fixed
Assignee: Miao Wang
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> SparkR: pass on setHandleInvalid for classification algorithms
> --
>
> Key: SPARK-21381
> URL: https://issues.apache.org/jira/browse/SPARK-21381
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Miao Wang
>Assignee: Miao Wang
> Fix For: 2.3.0
>
>
> SPARK-20307 Added handleInvalid option to RFormula for tree-based 
> classification algorithms. We should add this parameter for other 
> classification algorithms in SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21581) Spark 2.x distinct return incorrect result

2017-07-31 Thread shengyao piao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108324#comment-16108324
 ] 

shengyao piao commented on SPARK-21581:
---

[~maropu]
Very appreciated for your detailed explanation.
I understood about this behavior.

Thank you!


> Spark 2.x distinct return incorrect result
> --
>
> Key: SPARK-21581
> URL: https://issues.apache.org/jira/browse/SPARK-21581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: shengyao piao
>
> Hi all
> I'm using Spark2.x on cdh5.11
> I have a json file as follows.
> ・sample.json
> {code}
> {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0}
> {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700}
> {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800}
> {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900}
> {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0}
> {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""}
> {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""}
> {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""}
> {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""}
> {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""}
> {code}
> And I try to read this file and distinct.
> ・spark code
> {code}
> val s=spark.read.json("sample.json")
> s.count
> res13: Long = 10
> s.distinct.count
> res14: Long = 6< - It's should be 10
> {code}
> I know the cause of incorrect result is by mixed type in salary field.
> But when I try the same code in Spark 1.6 the result will be 10.
> So I think it's a bug in Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21581) Spark 2.x distinct return incorrect result

2017-07-31 Thread shengyao piao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shengyao piao resolved SPARK-21581.
---
Resolution: Not A Problem

> Spark 2.x distinct return incorrect result
> --
>
> Key: SPARK-21581
> URL: https://issues.apache.org/jira/browse/SPARK-21581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: shengyao piao
>
> Hi all
> I'm using Spark2.x on cdh5.11
> I have a json file as follows.
> ・sample.json
> {code}
> {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0}
> {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700}
> {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800}
> {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900}
> {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0}
> {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""}
> {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""}
> {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""}
> {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""}
> {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""}
> {code}
> And I try to read this file and distinct.
> ・spark code
> {code}
> val s=spark.read.json("sample.json")
> s.count
> res13: Long = 10
> s.distinct.count
> res14: Long = 6< - It's should be 10
> {code}
> I know the cause of incorrect result is by mixed type in salary field.
> But when I try the same code in Spark 1.6 the result will be 10.
> So I think it's a bug in Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18950) Report conflicting fields when merging two StructTypes.

2017-07-31 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-18950.
-
   Resolution: Fixed
 Assignee: Bravo Zhang
Fix Version/s: 2.3.0

> Report conflicting fields when merging two StructTypes.
> ---
>
> Key: SPARK-18950
> URL: https://issues.apache.org/jira/browse/SPARK-18950
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Bravo Zhang
>Priority: Minor
>  Labels: starter
> Fix For: 2.3.0
>
>
> Currently, {{StructType.merge()}} only reports data types of conflicting 
> fields when merging two incompatible schemas. It would be nice to also report 
> the field names for easier debugging.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21589) Add documents about unsupported functions in Hive UDF/UDTF/UDAF

2017-07-31 Thread Takeshi Yamamuro (JIRA)

Takeshi Yamamuro created SPARK-21589:


 Summary: Add documents about unsupported functions in Hive 
UDF/UDTF/UDAF
 Key: SPARK-21589
 URL: https://issues.apache.org/jira/browse/SPARK-21589
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 2.2.0
Reporter: Takeshi Yamamuro
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead

2017-07-31 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108310#comment-16108310
 ] 

Liang-Chi Hsieh commented on SPARK-21582:
-

Please call toDF API with the renamed column names. It could save much time.

> DataFrame.withColumnRenamed cause huge performance overhead
> ---
>
> Key: SPARK-21582
> URL: https://issues.apache.org/jira/browse/SPARK-21582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: GuangFancui(ISCAS)
> Attachments: 4654.stack
>
>
> Table "item_feature" (DataFrame) has over 900 columns. 
> When I use
> {code:java}
> val nameSequeceExcept = Set("gid","category_name","merchant_id")
> val df1 = spark.table("item_feature")
> val newdf1 = df1.schema.map(_.name).filter(name => 
> !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => 
> df1.withColumnRenamed(name, name + "_1" ))
> {code}
> It took over 30 minutes.
> *PID* in stack file is *0x126d*
> It seems that _transform_ took too long time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21588) SQLContext.getConf(key, null) should return null, but it throws NPE

2017-07-31 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-21588:
---

 Summary: SQLContext.getConf(key, null) should return null, but it 
throws NPE
 Key: SPARK-21588
 URL: https://issues.apache.org/jira/browse/SPARK-21588
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Burak Yavuz
Priority: Minor


SQLContext.get(key) for a key that is not defined in the conf, and doesn't have 
a default value defined, throws a NoSuchElementException. In order to avoid 
that, I used a null as the default value, which threw a NPE instead. If it is 
null, it shouldn't try to parse the default value in `getConfString`



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-07-31 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108303#comment-16108303
 ] 

xinzhang commented on SPARK-21067:
--

hi guoxiaolongzte.could u  consider about this .give some 
suggest.[~guoxiaolongzte]


> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at

[jira] [Created] (SPARK-21587) Filter pushdown for EventTime Watermark Operator

2017-07-31 Thread Jose Torres (JIRA)

Jose Torres created SPARK-21587:
---

 Summary: Filter pushdown for EventTime Watermark Operator
 Key: SPARK-21587
 URL: https://issues.apache.org/jira/browse/SPARK-21587
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Jose Torres


If I have a streaming query that sets a watermark, then a where() that pertains 
to a partition column does not prune these partitions and they will all be 
queried, greatly reducing performance for partitioned tables.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20433) Update jackson-databind to 2.6.7.1

2017-07-31 Thread Andrew Ash (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108117#comment-16108117
 ] 

Andrew Ash commented on SPARK-20433:


Sorry about not updating the ticket description -- the 2.6.7.1 release was came 
after the initial batch of patches for projects still on older versions of 
Jackson.

I've now opened PR https://github.com/apache/spark/pull/18789 where we can 
further discussion on the risk of the version bump.

> Update jackson-databind to 2.6.7.1
> --
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.  UPDATE: now the 2.6.X line has a 
> patch as well: 2.6.7.1 as mentioned at 
> https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this 
> library for the next Spark release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1

2017-07-31 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20433:
---
Description: 
There was a security vulnerability recently reported to the upstream 
jackson-databind project at 
https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
released.

>From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
>first fixed versions in their respectful 2.X branches, and versions in the 
>2.6.X line and earlier remain vulnerable.  UPDATE: now the 2.6.X line has a 
>patch as well: 2.6.7.1 as mentioned at 
>https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340

Right now Spark master branch is on 2.6.5: 
https://github.com/apache/spark/blob/master/pom.xml#L164

and Hadoop branch-2.7 is on 2.2.3: 
https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71

and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74

We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this 
library for the next Spark release.

  was:
There was a security vulnerability recently reported to the upstream 
jackson-databind project at 
https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
released.

>From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
>first fixed versions in their respectful 2.X branches, and versions in the 
>2.6.X line and earlier remain vulnerable.

Right now Spark master branch is on 2.6.5: 
https://github.com/apache/spark/blob/master/pom.xml#L164

and Hadoop branch-2.7 is on 2.2.3: 
https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71

and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74

We should try to find to find a way to get on a patched version of jackson-bind 
for the Spark 2.2.0 release.


> Update jackson-databind to 2.6.7.1
> --
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.  UPDATE: now the 2.6.X line has a 
> patch as well: 2.6.7.1 as mentioned at 
> https://github.com/FasterXML/jackson-databind/issues/1599#issuecomment-315486340
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should bump Spark from 2.6.5 to 2.6.7.1 to get a patched version of this 
> library for the next Spark release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20433) Update jackson-databind to 2.6.7.1

2017-07-31 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash updated SPARK-20433:
---
Summary: Update jackson-databind to 2.6.7.1  (was: Update jackson-databind 
to 2.6.7)

> Update jackson-databind to 2.6.7.1
> --
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-20433) Update jackson-databind to 2.6.7

2017-07-31 Thread Andrew Ash (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ash reopened SPARK-20433:


> Update jackson-databind to 2.6.7
> 
>
> Key: SPARK-20433
> URL: https://issues.apache.org/jira/browse/SPARK-20433
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Andrew Ash
>Priority: Minor
>
> There was a security vulnerability recently reported to the upstream 
> jackson-databind project at 
> https://github.com/FasterXML/jackson-databind/issues/1599 which now has a fix 
> released.
> From my reading of that, versions 2.7.9.1, 2.8.8.1, and 2.9.0.pr3 are the 
> first fixed versions in their respectful 2.X branches, and versions in the 
> 2.6.X line and earlier remain vulnerable.
> Right now Spark master branch is on 2.6.5: 
> https://github.com/apache/spark/blob/master/pom.xml#L164
> and Hadoop branch-2.7 is on 2.2.3: 
> https://github.com/apache/hadoop/blob/branch-2.7/hadoop-project/pom.xml#L71
> and Hadoop branch-3.0.0-alpha2 is on 2.7.8: 
> https://github.com/apache/hadoop/blob/branch-3.0.0-alpha2/hadoop-project/pom.xml#L74
> We should try to find to find a way to get on a patched version of 
> jackson-bind for the Spark 2.2.0 release.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21571) Spark history server leaves incomplete or unreadable history files around forever.

2017-07-31 Thread Eric Vandenberg (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Vandenberg updated SPARK-21571:

Description: 
We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw   3 xx   ooo  
14382 2016-09-13 15:40 
/user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
-rw-rw   3  ooo   
5933 2016-11-01 20:16 
/user/hadoop/xx/spark/logs/qq2016_ppp-8812_12650700673_dev5365_-65313.lz4
-rw-rw   3 yyy  ooo 
 0 2017-01-19 11:59 
/user/hadoop/xx/spark/logs/0057_326_m-57863.lz4.inprogress
-rw-rw   3 xooo 
 0 2017-01-19 14:17 
/user/hadoop/xx/spark/logs/0063_688_m-33246.lz4.inprogress
-rw-rw   3 yyy  ooo 
 0 2017-01-20 10:56 
/user/hadoop/xx/spark/logs/1030_326_m-45195.lz4.inprogress
-rw-rw   3  ooo  
11955 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1314_54_kk-64671.lz4.inprogress
-rw-rw   3  ooo  
11958 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1315_1667_kk-58968.lz4.inprogress
-rw-rw   3  ooo  
11960 2017-01-20 17:55 
/user/hadoop/xx/spark/logs/1316_54_kk-48058.lz4.inprogress

Based on the current logic, clean up candidates are skipped in several cases:
1. if a file has 0 bytes, it is completely ignored
2. if a file is in progress and not paresable/can't extract appID, is it 
completely ignored
3. if a file is complete and but not parseable/can't extract appID, it is 
completely ignored.

To address this edge case and provide a way to clean out orphaned history files 
I propose a new configuration option:

spark.history.fs.cleaner.aggressive={true, false}, default is false.

If true, the history server will more aggressively garbage collect history 
files in cases (1), (2) and (3).  Since the default is false, existing 
customers won't be affected unless they explicitly opt-in.  If customers have 
similar leaking garbage over time they have the option of aggressively cleaning 
up in such cases.  Also note that aggressive clean up may not be appropriate 
for some customers if they have long running jobs that exceed the 
cleaner.maxAge time frame and/or have buggy file systems.

Would like to get feedback on if this seems like a reasonable solution.


  was:
We have noticed that history server logs are sometimes never cleaned up.  The 
current history server logic *ONLY* cleans up history files if they are 
completed since in general it doesn't make sense to clean up inprogress history 
files (after all, the job is presumably still running?)  Note that inprogress 
history files would generally not be targeted for clean up any way assuming 
they regularly flush logs and the file system accurately updates the history 
log last modified time/size, while this is likely it is not guaranteed behavior.

As a consequence of the current clean up logic and a combination of unclean 
shutdowns, various file system bugs, earlier spark bugs, etc. we have 
accumulated thousands of these dead history files associated with long since 
gone jobs.

For example (with spark.history.fs.cleaner.maxAge=14d):

-rw-rw   3 xx   ooo  
14382 2016-09-13 15:40 
/user/hadoop/xx/spark/logs/qq1974_ppp-8812_11058600195_dev4384_-53982.zstandard
-rw-rw   3  ooo   
5933 2016-11-01 20:16

[jira] [Updated] (SPARK-21586) Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes, other special characters

2017-07-31 Thread Balasubramaniam Srinivasan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Balasubramaniam Srinivasan updated SPARK-21586:
---
Summary: Read CSV (SQL Context) Doesnt ignore delimiters within special 
types of quotes, other special characters  (was: Read CSV (SQL Context) Doesnt 
ignore delimiters within special types of quotes)

> Read CSV (SQL Context) Doesnt ignore delimiters within special types of 
> quotes, other special characters
> 
>
> Key: SPARK-21586
> URL: https://issues.apache.org/jira/browse/SPARK-21586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Balasubramaniam Srinivasan
>
> The read csv command doesn't ignore commas in specific cases like when a 
> dictionary is an entry in a column, other examples include when we have 
> double quotes - for example if this is an entry in a column "Here we 
> go"""This is causing a problem, but is handled well in pandas""" There!". In 
> the case pandas treats this well as just one entry, but the delimiter doesnt 
> work well in spark dataframes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21586) Read CSV (SQL Context) Doesnt ignore delimiters within special types of quotes

2017-07-31 Thread Balasubramaniam Srinivasan (JIRA)

Balasubramaniam Srinivasan created SPARK-21586:
--

 Summary: Read CSV (SQL Context) Doesnt ignore delimiters within 
special types of quotes
 Key: SPARK-21586
 URL: https://issues.apache.org/jira/browse/SPARK-21586
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Balasubramaniam Srinivasan


The read csv command doesn't ignore commas in specific cases like when a 
dictionary is an entry in a column, other examples include when we have double 
quotes - for example if this is an entry in a column "Here we go"""This is 
causing a problem, but is handled well in pandas""" There!". In the case pandas 
treats this well as just one entry, but the delimiter doesnt work well in spark 
dataframes



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-31 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108006#comment-16108006
 ] 

Andrew Ray commented on SPARK-21565:


No nothing like the limitations of microbatches. The window can be made 
trivially small if you want only one timestamp per group, for example 
{{window(eventTime, "1 microsecond")}}

And yes this should probably be checked in analysis if this is the intended 
limitation.

> aggregate query fails with watermark on eventTime but works with watermark on 
> timestamp column generated by current_timestamp
> -
>
> Key: SPARK-21565
> URL: https://issues.apache.org/jira/browse/SPARK-21565
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Assudani
>
> *Short Description: *
> Aggregation query fails with eventTime as watermark column while works with 
> newTimeStamp column generated by running SQL with current_timestamp,
> *Exception:*
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
>   at 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
>   at 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> *Code to replicate:*
> package test
> import java.nio.file.{Files, Path, Paths}
> import java.text.SimpleDateFormat
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{SparkSession}
> import scala.collection.JavaConverters._
> object Test1 {
>   def main(args: Array[String]) {
> val sparkSession = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("Spark SQL basic example")
>   .config("spark.some.config.option", "some-value")
>   .getOrCreate()
> val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
> val checkpointPath = "target/cp1"
> val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
> delete(newEventsPath)
> delete(Paths.get(checkpointPath).toAbsolutePath)
> Files.createDirectories(newEventsPath)
> val dfNewEvents= newEvents(sparkSession)
> dfNewEvents.createOrReplaceTempView("dfNewEvents")
> //The below works - Start
> //val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
> newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
> //dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
> //val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
> count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
> // End
> 
> 
> //The below doesn't work - Start
> val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
> ").withWatermark("eventTime","2 seconds")
>  dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
>   val groupEvents = sparkSession.sql("select symbol,eventTime, 
> count(price) as count1 from dfNewEvents2 group by symbol,eventTime")
> // - End
> 
> 
> val query1 = groupEvents.writeStream
>   .outputMode("append")
> .format("console")
>   .option("checkpointLocation", checkpointPath)
>   .start("./myop")
> val newEventFile1=newEventsPath.resolve("eventNew1.json")
> Files.write(newEventFile1, List(
>   """{"symbol": 
> "GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
>   """{"symbol": 
> "GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
> ).toIterable.asJava)
> query1.processAllAvailable()
> sparkSession.streams.awaitAnyTermination(1)
>   }
>   private def newEvents(sparkSession: SparkSession) = {
> val newEvents = Paths.get("target/newEvents/").toAbsolutePath
>

[jira] [Commented] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-31 Thread Amit Assudani (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107968#comment-16107968
 ] 

Amit Assudani commented on SPARK-21565:
---

I am not sure, does it mean we can not use watermark without windows - kind of 
on a microbatch level ? 

If that is the case, it should not work with current_timestamp and should have 
validation not to allow without windows.

> aggregate query fails with watermark on eventTime but works with watermark on 
> timestamp column generated by current_timestamp
> -
>
> Key: SPARK-21565
> URL: https://issues.apache.org/jira/browse/SPARK-21565
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Assudani
>
> *Short Description: *
> Aggregation query fails with eventTime as watermark column while works with 
> newTimeStamp column generated by running SQL with current_timestamp,
> *Exception:*
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
>   at 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
>   at 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> *Code to replicate:*
> package test
> import java.nio.file.{Files, Path, Paths}
> import java.text.SimpleDateFormat
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{SparkSession}
> import scala.collection.JavaConverters._
> object Test1 {
>   def main(args: Array[String]) {
> val sparkSession = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("Spark SQL basic example")
>   .config("spark.some.config.option", "some-value")
>   .getOrCreate()
> val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
> val checkpointPath = "target/cp1"
> val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
> delete(newEventsPath)
> delete(Paths.get(checkpointPath).toAbsolutePath)
> Files.createDirectories(newEventsPath)
> val dfNewEvents= newEvents(sparkSession)
> dfNewEvents.createOrReplaceTempView("dfNewEvents")
> //The below works - Start
> //val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
> newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
> //dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
> //val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
> count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
> // End
> 
> 
> //The below doesn't work - Start
> val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
> ").withWatermark("eventTime","2 seconds")
>  dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
>   val groupEvents = sparkSession.sql("select symbol,eventTime, 
> count(price) as count1 from dfNewEvents2 group by symbol,eventTime")
> // - End
> 
> 
> val query1 = groupEvents.writeStream
>   .outputMode("append")
> .format("console")
>   .option("checkpointLocation", checkpointPath)
>   .start("./myop")
> val newEventFile1=newEventsPath.resolve("eventNew1.json")
> Files.write(newEventFile1, List(
>   """{"symbol": 
> "GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
>   """{"symbol": 
> "GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
> ).toIterable.asJava)
> query1.processAllAvailable()
> sparkSession.streams.awaitAnyTermination(1)
>   }
>   private def newEvents(sparkSession: SparkSession) = {
> val newEvents = Paths.get("target/newEvents/").toAbsolutePath
> delete(newEvents)
>

[jira] [Commented] (SPARK-21585) Application Master marking application status as Failed for Client Mode

2017-07-31 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107936#comment-16107936
 ] 

Parth Gandhi commented on SPARK-21585:
--

Currently working on the fix, will file a pull request as soon as it is done.

> Application Master marking application status as Failed for Client Mode
> ---
>
> Key: SPARK-21585
> URL: https://issues.apache.org/jira/browse/SPARK-21585
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.0
>Reporter: Parth Gandhi
>Priority: Minor
>
> Refer https://issues.apache.org/jira/browse/SPARK-21541 for more clarity.
> The fix deployed for SPARK-21541 resulted in the Application Master to set 
> the final status of a spark application as Failed for the client mode as the 
> flag 'registered' was not being set to true for client mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21585) Application Master marking application status as Failed for Client Mode

2017-07-31 Thread Parth Gandhi (JIRA)

Parth Gandhi created SPARK-21585:


 Summary: Application Master marking application status as Failed 
for Client Mode
 Key: SPARK-21585
 URL: https://issues.apache.org/jira/browse/SPARK-21585
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.0
Reporter: Parth Gandhi
Priority: Minor


Refer https://issues.apache.org/jira/browse/SPARK-21541 for more clarity.

The fix deployed for SPARK-21541 resulted in the Application Master to set the 
final status of a spark application as Failed for the client mode as the flag 
'registered' was not being set to true for client mode.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21565) aggregate query fails with watermark on eventTime but works with watermark on timestamp column generated by current_timestamp

2017-07-31 Thread Andrew Ray (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107933#comment-16107933
 ] 

Andrew Ray commented on SPARK-21565:


I believe you need to use a window to group by your event time.

> aggregate query fails with watermark on eventTime but works with watermark on 
> timestamp column generated by current_timestamp
> -
>
> Key: SPARK-21565
> URL: https://issues.apache.org/jira/browse/SPARK-21565
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Amit Assudani
>
> *Short Description: *
> Aggregation query fails with eventTime as watermark column while works with 
> newTimeStamp column generated by running SQL with current_timestamp,
> *Exception:*
> Caused by: java.util.NoSuchElementException: None.get
>   at scala.None$.get(Option.scala:347)
>   at scala.None$.get(Option.scala:345)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:204)
>   at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3.apply(statefulOperators.scala:172)
>   at 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:70)
>   at 
> org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:65)
>   at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:64)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> *Code to replicate:*
> package test
> import java.nio.file.{Files, Path, Paths}
> import java.text.SimpleDateFormat
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.{SparkSession}
> import scala.collection.JavaConverters._
> object Test1 {
>   def main(args: Array[String]) {
> val sparkSession = SparkSession
>   .builder()
>   .master("local[*]")
>   .appName("Spark SQL basic example")
>   .config("spark.some.config.option", "some-value")
>   .getOrCreate()
> val sdf = new SimpleDateFormat("-MM-dd HH:mm:ss")
> val checkpointPath = "target/cp1"
> val newEventsPath = Paths.get("target/newEvents/").toAbsolutePath
> delete(newEventsPath)
> delete(Paths.get(checkpointPath).toAbsolutePath)
> Files.createDirectories(newEventsPath)
> val dfNewEvents= newEvents(sparkSession)
> dfNewEvents.createOrReplaceTempView("dfNewEvents")
> //The below works - Start
> //val dfNewEvents2 = sparkSession.sql("select *,current_timestamp as 
> newTimeStamp from dfNewEvents ").withWatermark("newTimeStamp","2 seconds")
> //dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
> //val groupEvents = sparkSession.sql("select symbol,newTimeStamp, 
> count(price) as count1 from dfNewEvents2 group by symbol,newTimeStamp")
> // End
> 
> 
> //The below doesn't work - Start
> val dfNewEvents2 = sparkSession.sql("select * from dfNewEvents 
> ").withWatermark("eventTime","2 seconds")
>  dfNewEvents2.createOrReplaceTempView("dfNewEvents2")
>   val groupEvents = sparkSession.sql("select symbol,eventTime, 
> count(price) as count1 from dfNewEvents2 group by symbol,eventTime")
> // - End
> 
> 
> val query1 = groupEvents.writeStream
>   .outputMode("append")
> .format("console")
>   .option("checkpointLocation", checkpointPath)
>   .start("./myop")
> val newEventFile1=newEventsPath.resolve("eventNew1.json")
> Files.write(newEventFile1, List(
>   """{"symbol": 
> "GOOG","price":100,"eventTime":"2017-07-25T16:00:00.000-04:00"}""",
>   """{"symbol": 
> "GOOG","price":200,"eventTime":"2017-07-25T16:00:00.000-04:00"}"""
> ).toIterable.asJava)
> query1.processAllAvailable()
> sparkSession.streams.awaitAnyTermination(1)
>   }
>   private def newEvents(sparkSession: SparkSession) = {
> val newEvents = Paths.get("target/newEvents/").toAbsolutePath
> delete(newEvents)
> Files.createDirectories(newEvents)
> val dfNewEvents = 
> sparkSession.readStream.schema(eventsSchema).json(newEvents.toString)//.withWatermark("eventTime","2
>  seconds")
>

[jira] [Updated] (SPARK-21584) Update R method for summary to call new implementation

2017-07-31 Thread Andrew Ray (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Ray updated SPARK-21584:
---
Component/s: SQL

> Update R method for summary to call new implementation
> --
>
> Key: SPARK-21584
> URL: https://issues.apache.org/jira/browse/SPARK-21584
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 2.3.0
>Reporter: Andrew Ray
>
> Follow up to SPARK-21100



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21584) Update R method for summary to call new implementation

2017-07-31 Thread Andrew Ray (JIRA)

Andrew Ray created SPARK-21584:
--

 Summary: Update R method for summary to call new implementation
 Key: SPARK-21584
 URL: https://issues.apache.org/jira/browse/SPARK-21584
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.3.0
Reporter: Andrew Ray


Follow up to SPARK-21100



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8288) ScalaReflection should also try apply methods defined in companion objects when inferring schema from a Product type

2017-07-31 Thread Drew Robb (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107798#comment-16107798
 ] 

Drew Robb commented on SPARK-8288:
--

I opened a PR for this issue, not sure why the bot didn't pick it up? 
https://github.com/apache/spark/pull/18766

> ScalaReflection should also try apply methods defined in companion objects 
> when inferring schema from a Product type
> 
>
> Key: SPARK-8288
> URL: https://issues.apache.org/jira/browse/SPARK-8288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>
> This ticket is derived from PARQUET-293 (which actually describes a Spark SQL 
> issue).
> My comment on that issue quoted below:
> {quote}
> ...  The reason of this exception is that, the Scala code Scrooge generates 
> is actually a trait extending {{Product}}:
> {code}
> trait Junk
>   extends ThriftStruct
>   with scala.Product2[Long, String]
>   with java.io.Serializable
> {code}
> while Spark expects a case class, something like:
> {code}
> case class Junk(junkID: Long, junkString: String)
> {code}
> The key difference here is that the latter case class version has a 
> constructor whose arguments can be transformed into fields of the DataFrame 
> schema.  The exception was thrown because Spark can't find such a constructor 
> from trait {{Junk}}.
> {quote}
> We can make {{ScalaReflection}} try {{apply}} methods in companion objects, 
> so that trait types generated by Scrooge can also be used for Spark SQL 
> schema inference.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21274) Implement EXCEPT ALL and INTERSECT ALL

2017-07-31 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107749#comment-16107749
 ] 

Ruslan Dautkhanov commented on SPARK-21274:
---

[~viirya], you're right. I've checked now on both PostgreSQL and Oracle. Sorry 
for the confusion.

So the more generic rule is, [from Oracle 
documentation|https://docs.oracle.com/database/121/SQLRF/operators006.htm#sthref913]:
 
{quote}
For example, if a particular value occurs *m* times in nested_table1 and *n* 
times in nested_table2, then the result would contain the element *min(m,n)* 
times. ALL is the default.
{quote}

It's interesting that Oracle doesn't support "intersect all" on simple table 
sets, but only on nested table sets (through "multisets"):

{code}
CREATE TYPE sets_test_typ AS object ( 
num number );

CREATE TYPE sets_test_tab_typ AS
  TABLE OF sets_test_typ;

with tab1 as (select 1 as z from dual 
union all select 2 from dual
union all select 2 from dual
union all select 2 from dual
union all select 2 from dual
)
   , tab2 as (
  select 1 as z from dual 
union all select 2 from dual
union all select 2 from dual
union all select 2 from dual
)
SELECT * 
FROM table(
cast(multiset(select z from tab1) as sets_test_tab_typ)
multiset intersect ALL
cast(multiset(select z from tab2) as sets_test_tab_typ)
  )
;
{code}
So Oracle has returned "2" three times = min(3,4).

Same test case in PostgreSQL:
{code}
scm=> with tab1 as (select 1 as z
scm(> union all select 2
scm(> union all select 2
scm(> union all select 2
scm(> union all select 2
scm(> )
scm->, tab2 as (
scm(>   select 1 as z
scm(> union all select 2
scm(> union all select 2
scm(> union all select 2
scm(> )
scm-> SELECT z FROM tab1
scm->  INTERSECT all
scm-> SELECT z FROM tab2
scm-> ;
 z
---
 1
 2
 2
 2
(4 rows)

{code}

The bottom line is that you're right, the above approach wouldn't work as you 
noticed.

I still believe though it might be easier to implement except all / intersect 
all through query rewrite.
For example, run group by on both sets, on target list of columns, do full 
outer join, find min between aggregate counts, and inject rows to final result 
set according that that min(n,m) count.


> Implement EXCEPT ALL and INTERSECT ALL
> --
>
> Key: SPARK-21274
> URL: https://issues.apache.org/jira/browse/SPARK-21274
> Project: Spark
>  Issue Type: New Feature
>  Components: Optimizer, SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Ruslan Dautkhanov
>  Labels: set, sql
>
> 1) *EXCEPT ALL* / MINUS ALL :
> {code}
> SELECT a,b,c FROM tab1
>  EXCEPT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following outer join:
> {code}
> SELECT a,b,c
> FROMtab1 t1
>  LEFT OUTER JOIN 
> tab2 t2
>  ON (
> (t1.a, t1.b, t1.c) = (t2.a, t2.b, t2.c)
>  )
> WHERE
> COALESCE(t2.a, t2.b, t2.c) IS NULL
> {code}
> (register as a temp.view this second query under "*t1_except_t2_df*" name 
> that can be also used to find INTERSECT ALL below):
> 2) *INTERSECT ALL*:
> {code}
> SELECT a,b,c FROM tab1
>  INTERSECT ALL 
> SELECT a,b,c FROM tab2
> {code}
> can be rewritten as following anti-join using t1_except_t2_df we defined 
> above:
> {code}
> SELECT a,b,c
> FROMtab1 t1
> WHERE 
>NOT EXISTS
>(SELECT 1
> FROMt1_except_t2_df e
> WHERE (t1.a, t1.b, t1.c) = (e.a, e.b, e.c)
>)
> {code}
> So the suggestion is just to use above query rewrites to implement both 
> EXCEPT ALL and INTERSECT ALL sql set operations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-31 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107733#comment-16107733
 ] 

shane knapp commented on SPARK-21573:
-

that's odd that it's only failing on -04, as it's configured identically to all 
the other workers, and the anaconda installation is working fine (i just 
confirmed).

anaconda's bin directory must be on the PATH, otherwise it'll pick up the 
default/system python (2.6).  i set it in the worker's config, but when you 
kick off a bash shell (to launch the python tests), it doesn't carry PATH over 
correctly.  i'm pretty certain that my PR will address this issue.


> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107713#comment-16107713
 ] 

Sean Owen commented on SPARK-21573:
---

[~shaneknapp] does it matter that this and one other script are the only ones 
referring to python2? I wonder if that's picking up something besides the 
Python3 that I assume 'py3k' is providing.

I am pretty sure it's specific to a machine or machines, maybe? For example 
this failed 3 times in a row on  amp-jenkins-worker-04

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.1-test-sbt-hadoop-2.6/

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-31 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107706#comment-16107706
 ] 

shane knapp commented on SPARK-21573:
-

https://github.com/databricks/spark-jenkins-configurations/pull/39

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21573) Tests failing with run-tests.py SyntaxError occasionally in Jenkins

2017-07-31 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107683#comment-16107683
 ] 

shane knapp commented on SPARK-21573:
-

(sorry, was off the grid for the past week)

we've been off of py2.6 for a long time.  our current python installation is 
managed by anaconda.

this behavior popped up w/the ADAM builds a couple of weeks ago.  the fix there 
was to put the anaconda bin dir in to the PATH (/home/anaconda/bin/) in the 
builds, and be sure to source activate properly before tests are run (the 
environment is 'py3k').  

i really don't know why this is randomly happening...  if it failed 
consistently it'd be easier to track down.

anyways, i'll make a PR for [~joshrosen] for the jenkins build configs and add 
anaconda to the PATH.

> Tests failing with run-tests.py SyntaxError occasionally in Jenkins
> ---
>
> Key: SPARK-21573
> URL: https://issues.apache.org/jira/browse/SPARK-21573
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> It looks default {{python}} in the path at few places such as 
> {{./dev/run-tests}} use Python 2.6 in Jenkins and it fails to execute 
> {{run-tests.py}}:
> {code}
> python2.6 run-tests.py
>   File "run-tests.py", line 124
> {m: set(m.dependencies).intersection(modules_to_test) for m in 
> modules_to_test}, sort=True)
> ^
> SyntaxError: invalid syntax
> {code}
> It looks there are quite some places to fix to support Python 2.6 in 
> {{run-tests.py}} and related Python scripts.
> We might just try to set Python 2.7 in few other scripts running this if 
> available.
> Please also see 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Tests-failing-with-run-tests-py-SyntaxError-td22030.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21578) Add JavaSparkContextSuite

2017-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21578:
--
Description: 
Due to SI-8479, [SPARK-1093|https://issues.apache.org/jira/browse/SPARK-21578] 
introduced redundant [SparkContext 
constructors|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181].
 However, [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is already 
fixed in Scala 2.10.5 and Scala 2.11.1. 

The real reason to provide this constructor is that Java code can access 
`SparkContext` directly. It's Scala behavior, SI-4278. So, this PR adds an 
explicit testsuite, `JavaSparkContextSuite`  to prevent future regression, and 
fixes the outdate comment, too.

  was:
Due to SI-8479, 
[SPARK-1093|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181]
 introduced redundant constructors.

Since [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is fixed in Scala 
2.10.5 and Scala 2.11.1. We can remove them safely.


> Add JavaSparkContextSuite
> -
>
> Key: SPARK-21578
> URL: https://issues.apache.org/jira/browse/SPARK-21578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Due to SI-8479, 
> [SPARK-1093|https://issues.apache.org/jira/browse/SPARK-21578] introduced 
> redundant [SparkContext 
> constructors|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181].
>  However, [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is already 
> fixed in Scala 2.10.5 and Scala 2.11.1. 
> The real reason to provide this constructor is that Java code can access 
> `SparkContext` directly. It's Scala behavior, SI-4278. So, this PR adds an 
> explicit testsuite, `JavaSparkContextSuite`  to prevent future regression, 
> and fixes the outdate comment, too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21578) Add JavaSparkContextSuite

2017-07-31 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21578:
--
Summary: Add JavaSparkContextSuite  (was: Consolidate redundant 
SparkContext constructors due to SI-8479)

> Add JavaSparkContextSuite
> -
>
> Key: SPARK-21578
> URL: https://issues.apache.org/jira/browse/SPARK-21578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Due to SI-8479, 
> [SPARK-1093|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181]
>  introduced redundant constructors.
> Since [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is fixed in 
> Scala 2.10.5 and Scala 2.11.1. We can remove them safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-07-31 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107652#comment-16107652
 ] 

Bryan Cutler commented on SPARK-21583:
--

I have this implemented already, will submit a PR soon.

> Create a ColumnarBatch with ArrowColumnVectors for row based iteration
> --
>
> Key: SPARK-21583
> URL: https://issues.apache.org/jira/browse/SPARK-21583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
> It would be useful to be able to create a {{ColumnarBatch}} to allow row 
> based iteration over multiple {{ArrowColumnVectors}}.  This would avoid extra 
> copying to translate column elements into rows and be more efficient memory 
> usage while increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-07-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21583:
-
Description: The existing {{ArrowColumnVector}} creates a read-only vector 
of Arrow data.  It would be useful to be able to create a {{ColumnarBatch}} to 
allow row based iteration over multiple {{ArrowColumnVectors}}.  This would 
avoid extra copying to translate column elements into rows and be more 
efficient memory usage while increasing performance.  (was: The existing 
{{ArrowColumnVector}} creates a read-only vector of Arrow data.  It would be 
useful to be able to create a {{ColumnarBatch}} to allow row based iteration 
over multiple {{ArrowColumnVector}} s.  This would avoid extra copying to 
translate column elements into rows and be more efficient memory usage while 
increasing performance.)

> Create a ColumnarBatch with ArrowColumnVectors for row based iteration
> --
>
> Key: SPARK-21583
> URL: https://issues.apache.org/jira/browse/SPARK-21583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
> It would be useful to be able to create a {{ColumnarBatch}} to allow row 
> based iteration over multiple {{ArrowColumnVectors}}.  This would avoid extra 
> copying to translate column elements into rows and be more efficient memory 
> usage while increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-07-31 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21583:
-
Description: The existing {{ArrowColumnVector}} creates a read-only vector 
of Arrow data.  It would be useful to be able to create a {{ColumnarBatch}} to 
allow row based iteration over multiple {{ArrowColumnVector}} s.  This would 
avoid extra copying to translate column elements into rows and be more 
efficient memory usage while increasing performance.  (was: The existing 
{{ArrowColumnVector}} creates a read-only vector of Arrow data.  It would be 
useful to be able to create a {{ColumnarBatch}} to allow row based iteration 
over multiple {{ArrowColumnVector}}s.  This would avoid extra copying to 
translate column elements into rows and be more efficient memory usage while 
increasing performance.)

> Create a ColumnarBatch with ArrowColumnVectors for row based iteration
> --
>
> Key: SPARK-21583
> URL: https://issues.apache.org/jira/browse/SPARK-21583
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
> It would be useful to be able to create a {{ColumnarBatch}} to allow row 
> based iteration over multiple {{ArrowColumnVector}} s.  This would avoid 
> extra copying to translate column elements into rows and be more efficient 
> memory usage while increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21583) Create a ColumnarBatch with ArrowColumnVectors for row based iteration

2017-07-31 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-21583:


 Summary: Create a ColumnarBatch with ArrowColumnVectors for row 
based iteration
 Key: SPARK-21583
 URL: https://issues.apache.org/jira/browse/SPARK-21583
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


The existing {{ArrowColumnVector}} creates a read-only vector of Arrow data.  
It would be useful to be able to create a {{ColumnarBatch}} to allow row based 
iteration over multiple {{ArrowColumnVector}}s.  This would avoid extra copying 
to translate column elements into rows and be more efficient memory usage while 
increasing performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21579) dropTempView has a critical BUG

2017-07-31 Thread ant_nebula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ant_nebula updated SPARK-21579:
---
Description: 
when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from 
http://127.0.0.1:4040/storage/. 
It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.


{code:java}
val spark = 
SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), 
Row("p5", 40), Row("p6", 15))
val schema = new StructType().add(StructField("name", 
StringType)).add(StructField("age", IntegerType))

val rowRDD = spark.sparkContext.parallelize(rows, 3)
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("ods_table")
spark.sql("cache table ods_table")

spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
spark.sql("cache table dwd_table2 as select * from dwd_table1 where name='p1'")
spark.catalog.dropTempView("dwd_table1")
//spark.catalog.dropTempView("ods_table")

spark.sql("select * from dwd_table2").show()
{code}

It will keep ods_table1 in memory, although it will not been used anymore. It 
waste memory, especially when my service diagram much more complex
!screenshot-1.png!

  was:
when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from 
http://127.0.0.1:4040/storage/. 
It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.


{code:java}
val spark = 
SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), 
Row("p5", 40), Row("p6", 15))
val schema = new StructType().add(StructField("name", 
StringType)).add(StructField("age", IntegerType))

val rowRDD = spark.sparkContext.parallelize(rows, 3)
val df = spark.createDataFrame(rowRDD, schema)
df.createOrReplaceTempView("ods_table")
spark.sql("cache table ods_table")

spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
spark.sql("cache table dwd_table2 as select * from dwd_table1 where name='p1'")
spark.catalog.dropTempView("dwd_table1")
//spark.catalog.dropTempView("ods_table")

spark.sql("select * from dwd_table2").show()
{code}



> dropTempView has a critical BUG
> ---
>
> Key: SPARK-21579
> URL: https://issues.apache.org/jira/browse/SPARK-21579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: ant_nebula
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from 
> http://127.0.0.1:4040/storage/. 
> It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.
> {code:java}
> val spark = 
> SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
> val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), 
> Row("p5", 40), Row("p6", 15))
> val schema = new StructType().add(StructField("name", 
> StringType)).add(StructField("age", IntegerType))
> val rowRDD = spark.sparkContext.parallelize(rows, 3)
> val df = spark.createDataFrame(rowRDD, schema)
> df.createOrReplaceTempView("ods_table")
> spark.sql("cache table ods_table")
> spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
> spark.sql("cache table dwd_table2 as select * from dwd_table1 where 
> name='p1'")
> spark.catalog.dropTempView("dwd_table1")
> //spark.catalog.dropTempView("ods_table")
> spark.sql("select * from dwd_table2").show()
> {code}
> It will keep ods_table1 in memory, although it will not been used anymore. It 
> waste memory, especially when my service diagram much more complex
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table

2017-07-31 Thread ant_nebula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107374#comment-16107374
 ] 

ant_nebula commented on SPARK-19765:


What a bad change it is!
Now it can not support for this scene any more.
https://issues.apache.org/jira/browse/SPARK-21579

> UNCACHE TABLE should also un-cache all cached plans that refer to this table
> 
>
> Key: SPARK-19765
> URL: https://issues.apache.org/jira/browse/SPARK-19765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>  Labels: release_notes
> Fix For: 2.1.1, 2.2.0
>
>
> DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, 
> UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all 
> the cached plans that refer to this table



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21579) dropTempView has a critical BUG

2017-07-31 Thread ant_nebula (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107371#comment-16107371
 ] 

ant_nebula commented on SPARK-21579:


It will keep ods_table1 in memory, although it will not been used anymore. It 
waste memory, especially when my service diagram much more complex
!screenshot-1.png!

> dropTempView has a critical BUG
> ---
>
> Key: SPARK-21579
> URL: https://issues.apache.org/jira/browse/SPARK-21579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: ant_nebula
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from 
> http://127.0.0.1:4040/storage/. 
> It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.
> {code:java}
> val spark = 
> SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
> val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), 
> Row("p5", 40), Row("p6", 15))
> val schema = new StructType().add(StructField("name", 
> StringType)).add(StructField("age", IntegerType))
> val rowRDD = spark.sparkContext.parallelize(rows, 3)
> val df = spark.createDataFrame(rowRDD, schema)
> df.createOrReplaceTempView("ods_table")
> spark.sql("cache table ods_table")
> spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
> spark.sql("cache table dwd_table2 as select * from dwd_table1 where 
> name='p1'")
> spark.catalog.dropTempView("dwd_table1")
> //spark.catalog.dropTempView("ods_table")
> spark.sql("select * from dwd_table2").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21579) dropTempView has a critical BUG

2017-07-31 Thread ant_nebula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ant_nebula updated SPARK-21579:
---
Attachment: screenshot-1.png

> dropTempView has a critical BUG
> ---
>
> Key: SPARK-21579
> URL: https://issues.apache.org/jira/browse/SPARK-21579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: ant_nebula
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from 
> http://127.0.0.1:4040/storage/. 
> It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.
> {code:java}
> val spark = 
> SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
> val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), 
> Row("p5", 40), Row("p6", 15))
> val schema = new StructType().add(StructField("name", 
> StringType)).add(StructField("age", IntegerType))
> val rowRDD = spark.sparkContext.parallelize(rows, 3)
> val df = spark.createDataFrame(rowRDD, schema)
> df.createOrReplaceTempView("ods_table")
> spark.sql("cache table ods_table")
> spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
> spark.sql("cache table dwd_table2 as select * from dwd_table1 where 
> name='p1'")
> spark.catalog.dropTempView("dwd_table1")
> //spark.catalog.dropTempView("ods_table")
> spark.sql("select * from dwd_table2").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21561) spark-streaming-kafka-010 DSteam is not pulling anything from Kafka

2017-07-31 Thread Vlad Badelita (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107324#comment-16107324
 ] 

Vlad Badelita commented on SPARK-21561:
---

Sorry, didn't know where to ask about this issues. I have since found out it 
was an actual bug, that someone else experienced:
https://issues.apache.org/jira/browse/SPARK-18779

However, it was a kafka bug, not a spark-streaming one and was solved here:
https://issues.apache.org/jira/browse/KAFKA-4547

And it changing the kafka-clients version to 0.10.2.1 fixed it for me.

> spark-streaming-kafka-010 DSteam is not pulling anything from Kafka
> ---
>
> Key: SPARK-21561
> URL: https://issues.apache.org/jira/browse/SPARK-21561
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
>Reporter: Vlad Badelita
>  Labels: kafka-0.10, spark-streaming
>
> I am trying to use spark-streaming-kafka-0.10 to pull messages from a kafka 
> topic(broker version 0.10). I have checked that messages are being produced 
> and used a KafkaConsumer to pull them successfully. Now, when I try to use 
> the spark streaming api, I am not getting anything. If I just use 
> KafkaUtils.createRDD and specify some offset ranges manually it works. But 
> when, I try to use createDirectStream, all the rdds are empty and when I 
> check the partition offsets it simply reports that all partitions are 0. Here 
> is what I tried:
> {code:scala}
>  val sparkConf = new SparkConf().setAppName("kafkastream")
>  val ssc = new StreamingContext(sparkConf, Seconds(3))
>  val topics = Array("my_topic")
>  val kafkaParams = Map[String, Object](
>"bootstrap.servers" -> "hostname:6667"
>"key.deserializer" -> classOf[StringDeserializer],
>"value.deserializer" -> classOf[StringDeserializer],
>"group.id" -> "my_group",
>"auto.offset.reset" -> "earliest",
>"enable.auto.commit" -> (true: java.lang.Boolean)
>  )
>  val stream = KafkaUtils.createDirectStream[String, String](
>ssc,
>PreferConsistent,
>Subscribe[String, String](topics, kafkaParams)
>  )
>  stream.foreachRDD { rdd =>
>val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
>rdd.foreachPartition { iter =>
>  val o: OffsetRange = offsetRanges(TaskContext.get.partitionId)
>  println(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
>}
>val rddCount = rdd.count()
>println("rdd count: ", rddCount)
>// stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
>  }
>  ssc.start()
>  ssc.awaitTermination()
> {code}
> All partitions show offset ranges from 0 to 0 and all rdds are empty. I would 
> like it to start from the beginning of a partition but also pick up 
> everything that is being produced to it.
> I have also tried using spark-streaming-kafka-0.8 and it does work. I think 
> it is a 0.10 issue because everything else works fine. Thank you!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21579) dropTempView has a critical BUG

2017-07-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107296#comment-16107296
 ] 

Takeshi Yamamuro edited comment on SPARK-21579 at 7/31/17 1:17 PM:
---

I think this is an expected behaviour in Spark;  `cache1` is a sub-tree of 
`cache2`, so if you uncache `cache1`, spark also needs to uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache1`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (_2#3 >= 25)
   +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Filter (isnotnull(name#5) && (name#5 = p1))
+- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = 
p1)]
  +- InMemoryRelation [name#5, age#6], true, 1, 
StorageLevel(disk, memory, deserialized, 1 replicas), `cache1`
+- *Project [_1#2 AS name#5, _2#3 AS age#6]
   +- *Filter (_2#3 >= 25)
  +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException 
`Table or view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
   +- LocalTableScan [_1#2, _2#3]
{code}


was (Author: maropu):
I think this is an expected behaviour in Spark;  `cache1` is a sub-tree of 
`cache2`, so if you uncache `cache1`, spark also uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache1`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (_2#3 >= 25)
   +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Filter (isnotnull(name#5) && (name#5 = p1))
+- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = 
p1)]
  +- InMemoryRelation [name#5, age#6], true, 1, 
StorageLevel(disk, memory, deserialized, 1 replicas), `cache1`
+- *Project [_1#2 AS name#5, _2#3 AS age#6]
   +- *Filter (_2#3 >= 25)
  +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException 
`Table or view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
   +- LocalTableScan [_1#2, _2#3]
{code}

> dropTempView has a critical BUG
> ---
>
> Key: SPARK-21579
> URL: https://issues.apache.org/jira/browse/SPARK-21579
> Project: Spark
>  Issue Type: Bug
>

[jira] [Commented] (SPARK-21579) dropTempView has a critical BUG

2017-07-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107296#comment-16107296
 ] 

Takeshi Yamamuro commented on SPARK-21579:
--

I think this is an expected behaviour;  `cache1` is a sub-tree of `cache2`, so 
if you uncache `cache1`, spark also uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache1`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (_2#3 >= 25)
   +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Filter (isnotnull(name#5) && (name#5 = p1))
+- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = 
p1)]
  +- InMemoryRelation [name#5, age#6], true, 1, 
StorageLevel(disk, memory, deserialized, 1 replicas), `cache1`
+- *Project [_1#2 AS name#5, _2#3 AS age#6]
   +- *Filter (_2#3 >= 25)
  +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException 
`Table or view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
   +- LocalTableScan [_1#2, _2#3]
{code}

> dropTempView has a critical BUG
> ---
>
> Key: SPARK-21579
> URL: https://issues.apache.org/jira/browse/SPARK-21579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: ant_nebula
>Priority: Critical
>
> when I dropTempView dwd_table1 only, sub table dwd_table2 also disappear from 
> http://127.0.0.1:4040/storage/. 
> It affect version 2.1.1 and 2.2.0, 2.1.0 is ok for this problem.
> {code:java}
> val spark = 
> SparkSession.builder.master("local").appName("sparkTest").getOrCreate()
> val rows = Seq(Row("p1", 30), Row("p2", 20), Row("p3", 25), Row("p4", 10), 
> Row("p5", 40), Row("p6", 15))
> val schema = new StructType().add(StructField("name", 
> StringType)).add(StructField("age", IntegerType))
> val rowRDD = spark.sparkContext.parallelize(rows, 3)
> val df = spark.createDataFrame(rowRDD, schema)
> df.createOrReplaceTempView("ods_table")
> spark.sql("cache table ods_table")
> spark.sql("cache table dwd_table1 as select * from ods_table where age>=25")
> spark.sql("cache table dwd_table2 as select * from dwd_table1 where 
> name='p1'")
> spark.catalog.dropTempView("dwd_table1")
> //spark.catalog.dropTempView("ods_table")
> spark.sql("select * from dwd_table2").show()
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21579) dropTempView has a critical BUG

2017-07-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107296#comment-16107296
 ] 

Takeshi Yamamuro edited comment on SPARK-21579 at 7/31/17 1:16 PM:
---

I think this is an expected behaviour in Spark;  `cache1` is a sub-tree of 
`cache2`, so if you uncache `cache1`, spark also uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache1`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (_2#3 >= 25)
   +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Filter (isnotnull(name#5) && (name#5 = p1))
+- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = 
p1)]
  +- InMemoryRelation [name#5, age#6], true, 1, 
StorageLevel(disk, memory, deserialized, 1 replicas), `cache1`
+- *Project [_1#2 AS name#5, _2#3 AS age#6]
   +- *Filter (_2#3 >= 25)
  +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException 
`Table or view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
   +- LocalTableScan [_1#2, _2#3]
{code}


was (Author: maropu):
I think this is an expected behaviour;  `cache1` is a sub-tree of `cache2`, so 
if you uncache `cache1`, spark also uncaches `cache2`.
{code}
scala> Seq(("name1", 28)).toDF("name", "age").createOrReplaceTempView("base")
scala> sql("cache table cache1 as select * from base where age >= 25")
scala> sql("cache table cache2 as select * from cache1 where name = 'p1'")
scala> spark.table("cache1").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache1`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (_2#3 >= 25)
   +- LocalTableScan [_1#2, _2#3]

scala> spark.table("cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Filter (isnotnull(name#5) && (name#5 = p1))
+- InMemoryTableScan [name#5, age#6], [isnotnull(name#5), (name#5 = 
p1)]
  +- InMemoryRelation [name#5, age#6], true, 1, 
StorageLevel(disk, memory, deserialized, 1 replicas), `cache1`
+- *Project [_1#2 AS name#5, _2#3 AS age#6]
   +- *Filter (_2#3 >= 25)
  +- LocalTableScan [_1#2, _2#3]
{code}

Probably, you better do this;
{code}
scala> sql("create temporary view tmp1 as select * from base where age >= 25")
scala> sql("cache table cache1 as select * from tmp1")
scala> sql("cache table cache2 as select * from tmp1 where name = 'p1'")
scala> catalog.dropTempView("cache1")
// scala> sql("select * from cache1").explain --> Throws AnalysisException 
`Table or view not found`
scala> sql("select * from cache2").explain
scala> sql("select * from cache2").explain
== Physical Plan ==
InMemoryTableScan [name#5, age#6]
   +- InMemoryRelation [name#5, age#6], true, 1, StorageLevel(disk, memory, 
deserialized, 1 replicas), `cache2`
 +- *Project [_1#2 AS name#5, _2#3 AS age#6]
+- *Filter (((_2#3 >= 25) && isnotnull(_1#2)) && (_1#2 = p1))
   +- LocalTableScan [_1#2, _2#3]
{code}

> dropTempView has a critical BUG
> ---
>
> Key: SPARK-21579
> URL: https://issues.apache.org/jira/browse/SPARK-21579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>

[jira] [Commented] (SPARK-21559) Remove Mesos fine-grained mode

2017-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107147#comment-16107147
 ] 

Stavros Kontopoulos commented on SPARK-21559:
-

Pr here: https://github.com/apache/spark/pull/18784

> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grained mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grained mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21559) Remove Mesos fine-grained mode

2017-07-31 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107147#comment-16107147
 ] 

Stavros Kontopoulos edited comment on SPARK-21559 at 7/31/17 11:11 AM:
---

PR here: https://github.com/apache/spark/pull/18784


was (Author: skonto):
Pr here: https://github.com/apache/spark/pull/18784

> Remove Mesos fine-grained mode
> --
>
> Key: SPARK-21559
> URL: https://issues.apache.org/jira/browse/SPARK-21559
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Stavros Kontopoulos
>
> After discussing this with people from Mesosphere we agreed that it is time 
> to remove fine grained mode. Plans are to improve cluster mode to cover any 
> benefits may existed when using fine grained mode.
>  [~susanxhuynh]
> Previous status of this can be found here:
> https://issues.apache.org/jira/browse/SPARK-11857



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21578) Consolidate redundant SparkContext constructors due to SI-8479

2017-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21578:
-

  Assignee: Dongjoon Hyun
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

> Consolidate redundant SparkContext constructors due to SI-8479
> --
>
> Key: SPARK-21578
> URL: https://issues.apache.org/jira/browse/SPARK-21578
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> Due to SI-8479, 
> [SPARK-1093|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L148-L181]
>  introduced redundant constructors.
> Since [SI-8479|https://issues.scala-lang.org/browse/SI-8479] is fixed in 
> Scala 2.10.5 and Scala 2.11.1. We can remove them safely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21544) Test jar of some module should not install or deploy twice

2017-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21544:
-

Assignee: zhoukang
Priority: Minor  (was: Major)

> Test jar of some module should not install or deploy twice
> --
>
> Key: SPARK-21544
> URL: https://issues.apache.org/jira/browse/SPARK-21544
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.1, 2.1.0
>Reporter: zhoukang
>Assignee: zhoukang
>Priority: Minor
>
> For moudle below:
> common/network-common
> streaming
> sql/core
> sql/catalyst
> tests.jar will install or deploy twice.Like:
> {code:java}
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Writing tracking file 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/_remote.repositories
> [DEBUG] Installing 
> org.apache.spark:spark-streaming_2.11:2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata.xml
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/maven-metadata-local.xml
> [DEBUG] Installing org.apache.spark:spark-streaming_2.11/maven-metadata.xml 
> to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/maven-metadata-local.xml
> [INFO] Installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
> [DEBUG] Skipped re-installing 
> /home/mi/Work/Spark/scala2.11/spark/streaming/target/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar
>  to 
> /home/mi/.m2/repository/org/apache/spark/spark-streaming_2.11/2.1.0-mdh2.1.0.1-SNAPSHOT/spark-streaming_2.11-2.1.0-mdh2.1.0.1-SNAPSHOT-tests.jar,
>  seems unchanged
> {code}
> The reason is below:
> {code:java}
> [DEBUG]   (f) artifact = 
> org.apache.spark:spark-streaming_2.11:jar:2.1.0-mdh2.1.0.1-SNAPSHOT
> [DEBUG]   (f) attachedArtifacts = 
> [org.apache.spark:spark-streaming_2.11:test-jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:jar:tests:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark
> -streaming_2.11:java-source:sources:2.1.0-mdh2.1.0.1-SNAPSHOT, 
> org.apache.spark:spark-streaming_2.11:java-source:test-sources:2.1.0-mdh2.1.0.1-SNAPSHOT,
>  org.apache.spark:spark-streaming_2.11:javadoc:javadoc:2.1.0
> -mdh2.1.0.1-SNAPSHOT]
> {code}
> when executing 'mvn deploy' to nexus during release.I will fail since release 
> nexus can not be override.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21581) Spark 2.x distinct return incorrect result

2017-07-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107081#comment-16107081
 ] 

Takeshi Yamamuro commented on SPARK-21581:
--

This is an expected behaviour in spark-2.x and the behaviour changes when 
corrupt records hit.

{code}
// Spark-v1.6.3
scala> sqlContext.read.json("simple.json").show
+---+--++
|   name|salary| url|
+---+--++
| staff1| 600.0|http://example.ho...|
| staff2| 700.0|http://example.ho...|
| staff3| 800.0|http://example.ho...|
| staff4| 900.0|http://example.ho...|
| staff5|1000.0|http://example.ho...|
| staff6|  null|http://example.ho...|
| staff7|  null|http://example.ho...|
| staff8|  null|http://example.ho...|
| staff9|  null|http://example.ho...|
|staff10|  null|http://example.ho...|
+---+--++

// master
scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed 
STRING").option("columnNameOfCorruptRecord", 
"_malformed").json("/Users/maropu/Desktop/simple.json").show
+--+--+++
|  name|salary| url|  _malformed|
+--+--+++
|staff1| 600.0|http://example.ho...|null|
|staff2| 700.0|http://example.ho...|null|
|staff3| 800.0|http://example.ho...|null|
|staff4| 900.0|http://example.ho...|null|
|staff5|1000.0|http://example.ho...|null|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
+--+--+++
{code}

> Spark 2.x distinct return incorrect result
> --
>
> Key: SPARK-21581
> URL: https://issues.apache.org/jira/browse/SPARK-21581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: shengyao piao
>
> Hi all
> I'm using Spark2.x on cdh5.11
> I have a json file as follows.
> ・sample.json
> {code}
> {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0}
> {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700}
> {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800}
> {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900}
> {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0}
> {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""}
> {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""}
> {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""}
> {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""}
> {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""}
> {code}
> And I try to read this file and distinct.
> ・spark code
> {code}
> val s=spark.read.json("sample.json")
> s.count
> res13: Long = 10
> s.distinct.count
> res14: Long = 6< - It's should be 10
> {code}
> I know the cause of incorrect result is by mixed type in salary field.
> But when I try the same code in Spark 1.6 the result will be 10.
> So I think it's a bug in Spark 2.x.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21581) Spark 2.x distinct return incorrect result

2017-07-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107081#comment-16107081
 ] 

Takeshi Yamamuro edited comment on SPARK-21581 at 7/31/17 10:08 AM:


This is an expected behaviour in spark-2.x and the behaviour changes when 
corrupt records hit.

{code}
// Spark-v1.6.3
scala> sqlContext.read.json("simple.json").show
+---+--++
|   name|salary| url|
+---+--++
| staff1| 600.0|http://example.ho...|
| staff2| 700.0|http://example.ho...|
| staff3| 800.0|http://example.ho...|
| staff4| 900.0|http://example.ho...|
| staff5|1000.0|http://example.ho...|
| staff6|  null|http://example.ho...|
| staff7|  null|http://example.ho...|
| staff8|  null|http://example.ho...|
| staff9|  null|http://example.ho...|
|staff10|  null|http://example.ho...|
+---+--++

// master
scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed 
STRING").option("columnNameOfCorruptRecord", 
"_malformed").json("simple.json").show
+--+--+++
|  name|salary| url|  _malformed|
+--+--+++
|staff1| 600.0|http://example.ho...|null|
|staff2| 700.0|http://example.ho...|null|
|staff3| 800.0|http://example.ho...|null|
|staff4| 900.0|http://example.ho...|null|
|staff5|1000.0|http://example.ho...|null|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
+--+--+++
{code}


was (Author: maropu):
This is an expected behaviour in spark-2.x and the behaviour changes when 
corrupt records hit.

{code}
// Spark-v1.6.3
scala> sqlContext.read.json("simple.json").show
+---+--++
|   name|salary| url|
+---+--++
| staff1| 600.0|http://example.ho...|
| staff2| 700.0|http://example.ho...|
| staff3| 800.0|http://example.ho...|
| staff4| 900.0|http://example.ho...|
| staff5|1000.0|http://example.ho...|
| staff6|  null|http://example.ho...|
| staff7|  null|http://example.ho...|
| staff8|  null|http://example.ho...|
| staff9|  null|http://example.ho...|
|staff10|  null|http://example.ho...|
+---+--++

// master
scala> spark.read.schema("name STRING, salary DOUBLE, url STRING, _malformed 
STRING").option("columnNameOfCorruptRecord", 
"_malformed").json("/Users/maropu/Desktop/simple.json").show
+--+--+++
|  name|salary| url|  _malformed|
+--+--+++
|staff1| 600.0|http://example.ho...|null|
|staff2| 700.0|http://example.ho...|null|
|staff3| 800.0|http://example.ho...|null|
|staff4| 900.0|http://example.ho...|null|
|staff5|1000.0|http://example.ho...|null|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
|  null|  null|null|{"url": "http://e...|
+--+--+++
{code}

> Spark 2.x distinct return incorrect result
> --
>
> Key: SPARK-21581
> URL: https://issues.apache.org/jira/browse/SPARK-21581
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: shengyao piao
>
> Hi all
> I'm using Spark2.x on cdh5.11
> I have a json file as follows.
> ・sample.json
> {code}
> {"url": "http://example.hoge/staff1;, "name": "staff1", "salary":600.0}
> {"url": "http://example.hoge/staff2;, "name": "staff2", "salary":700}
> {"url": "http://example.hoge/staff3;, "name": "staff3", "salary":800}
> {"url": "http://example.hoge/staff4;, "name": "staff4", "salary":900}
> {"url": "http://example.hoge/staff5;, "name": "staff5", "salary":1000.0}
> {"url": "http://example.hoge/staff6;, "name": "staff6", "salary":""}
> {"url": "http://example.hoge/staff7;, "name": "staff7", "salary":""}
> {"url": "http://example.hoge/staff8;, "name": "staff8", "salary":""}
> {"url": "http://example.hoge/staff9;, "name": "staff9", "salary":""}
> {"url": "http://example.hoge/staff10;, "name": "staff10", "salary":""}
> {code}
> And I try to read this file and distinct.
> ・spark code
> {code}

[jira] [Resolved] (SPARK-21099) INFO Log Message Using Incorrect Executor Idle Timeout Value

2017-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21099.
---
Resolution: Won't Fix

> INFO Log Message Using Incorrect Executor Idle Timeout Value
> 
>
> Key: SPARK-21099
> URL: https://issues.apache.org/jira/browse/SPARK-21099
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Hazem Mahmoud
>Priority: Trivial
>
> INFO log message is using the wrong idle timeout 
> (spark.dynamicAllocation.executorIdleTimeout) when printing the message that 
> the executor holding the RDD cache is being removed.
> INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been 
> idle for 30 seconds (new desired total will be 0)
> It should be using spark.dynamicAllocation.cachedExecutorIdleTimeout when the 
> RDD cache timeout is reached. I was able to confirm this by doing the 
> following:
> 1. Update spark-defaults.conf to set the following:
> executorIdleTimeout=30
> cachedExecutorIdleTimeout=20
> 2. Update log4j.properties to set the following:
> shell.log.level=INFO
> 3. Run the following in spark-shell:
> scala> val textFile = sc.textFile("/user/spark/applicationHistory/app_1234")
> scala> textFile.cache().count()
> 4. After 30 secs you will see 2 timeout messages, but of which are 30 secs 
> (whereas one *should* be for 20 secs)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead

2017-07-31 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107014#comment-16107014
 ] 

Takeshi Yamamuro commented on SPARK-21582:
--

I feel making ~900 dataframes is heavy...

> DataFrame.withColumnRenamed cause huge performance overhead
> ---
>
> Key: SPARK-21582
> URL: https://issues.apache.org/jira/browse/SPARK-21582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: GuangFancui(ISCAS)
> Attachments: 4654.stack
>
>
> Table "item_feature" (DataFrame) has over 900 columns. 
> When I use
> {code:java}
> val nameSequeceExcept = Set("gid","category_name","merchant_id")
> val df1 = spark.table("item_feature")
> val newdf1 = df1.schema.map(_.name).filter(name => 
> !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => 
> df1.withColumnRenamed(name, name + "_1" ))
> {code}
> It took over 30 minutes.
> *PID* in stack file is *0x126d*
> It seems that _transform_ took too long time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20564) a lot of executor failures when the executor number is more than 2000

2017-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20564.
---
Resolution: Won't Fix

> a lot of executor failures when the executor number is more than 2000
> -
>
> Key: SPARK-20564
> URL: https://issues.apache.org/jira/browse/SPARK-20564
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.6.2, 2.1.0
>Reporter: Hua Liu
>Priority: Minor
>
> When we used more than 2000 executors in a spark application, we noticed a 
> large number of executors cannot connect to driver and as a result they were 
> marked as failed. In some cases, the failed executor number reached twice of 
> the requested executor count and thus applications retried and may eventually 
> fail.
> This is because that YarnAllocator requests all missing containers every 
> spark.yarn.scheduler.heartbeat.interval-ms (default 3 seconds). For example, 
> YarnAllocator can ask for and get over 2000 containers in one request, and 
> then launch them almost simultaneously. These thousands of executors try to 
> retrieve spark props and register with driver within seconds. However, driver 
> handles executor registration, stop, removal and spark props retrieval in one 
> thread, and it can not handle such a large number of RPCs within a short 
> period of time. As a result, some executors cannot retrieve spark props 
> and/or register. These failed executors are then marked as failed, causing 
> executor removal and aggravating the overloading of driver, which leads to 
> more executor failures. 
> This patch adds an extra configuration 
> spark.yarn.launchContainer.count.simultaneously, which caps the maximal 
> number of containers that driver can ask for in every 
> spark.yarn.scheduler.heartbeat.interval-ms. As a result, the number of 
> executors grows steadily. The number of executor failures is reduced and 
> applications can reach the desired number of executors faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead

2017-07-31 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107009#comment-16107009
 ] 

Sean Owen commented on SPARK-21582:
---

This is a really slow way to rename the columns. I think you want to just set 
the schema manually? or pass the col names in a call to toDF()? you have a 
chain of 900 operations here.

> DataFrame.withColumnRenamed cause huge performance overhead
> ---
>
> Key: SPARK-21582
> URL: https://issues.apache.org/jira/browse/SPARK-21582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: GuangFancui(ISCAS)
> Attachments: 4654.stack
>
>
> Table "item_feature" (DataFrame) has over 900 columns. 
> When I use
> {code:java}
> val nameSequeceExcept = Set("gid","category_name","merchant_id")
> val df1 = spark.table("item_feature")
> val newdf1 = df1.schema.map(_.name).filter(name => 
> !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => 
> df1.withColumnRenamed(name, name + "_1" ))
> {code}
> It took over 30 minutes.
> *PID* in stack file is *0x126d*
> It seems that _transform_ took too long time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations

2017-07-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106971#comment-16106971
 ] 

Hyukjin Kwon commented on SPARK-21577:
--

No ... I mean sending it to u...@spark.apache.org as described in the link I 
used so that other Spark users / developers can receive your email and discuss 
...

> Issue is handling too many aggregations 
> 
>
> Key: SPARK-21577
> URL: https://issues.apache.org/jira/browse/SPARK-21577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
>Reporter: Kannan Subramanian
>
>  my requirement, reading the table from hive(Size - around 1.6 TB). I have to 
> do more than 200 aggregation operations mostly avg, sum and std_dev. Spark 
> application total execution time is take more than 12 hours. To Optimize the 
> code I used shuffle Partitioning and memory tuning and all. But Its 
> nothelpful for me. Please note that same query I ran in hive on map reduce. 
> MR job completion time taken around only 5 hours.  Kindly let me know is 
> there any way to optimize or efficient way of handling multiple aggregation 
> operations.val inputDataDF = 
> hiveContext.read.parquet("/inputparquetData")
> inputDataDF.groupBy("seq_no","year", 
> "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  
> avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")...
>  like 200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations

2017-07-31 Thread Kannan Subramanian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106970#comment-16106970
 ] 

Kannan Subramanian commented on SPARK-21577:


Thanks Hyukjin Kwon,
 I have sent email about the more number of aggregations performance issue to 
gurwls...@gmail.com

-Kannan


> Issue is handling too many aggregations 
> 
>
> Key: SPARK-21577
> URL: https://issues.apache.org/jira/browse/SPARK-21577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
>Reporter: Kannan Subramanian
>
>  my requirement, reading the table from hive(Size - around 1.6 TB). I have to 
> do more than 200 aggregation operations mostly avg, sum and std_dev. Spark 
> application total execution time is take more than 12 hours. To Optimize the 
> code I used shuffle Partitioning and memory tuning and all. But Its 
> nothelpful for me. Please note that same query I ran in hive on map reduce. 
> MR job completion time taken around only 5 hours.  Kindly let me know is 
> there any way to optimize or efficient way of handling multiple aggregation 
> operations.val inputDataDF = 
> hiveContext.read.parquet("/inputparquetData")
> inputDataDF.groupBy("seq_no","year", 
> "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  
> avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")...
>  like 200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead

2017-07-31 Thread GuangFancui(ISCAS) (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GuangFancui(ISCAS) updated SPARK-21582:
---
Attachment: 4654.stack

PID is 0x126d

> DataFrame.withColumnRenamed cause huge performance overhead
> ---
>
> Key: SPARK-21582
> URL: https://issues.apache.org/jira/browse/SPARK-21582
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: GuangFancui(ISCAS)
> Attachments: 4654.stack
>
>
> Table "item_feature" (DataFrame) has over 900 columns. 
> When I use
> {code:java}
> val nameSequeceExcept = Set("gid","category_name","merchant_id")
> val df1 = spark.table("item_feature")
> val newdf1 = df1.schema.map(_.name).filter(name => 
> !nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => 
> df1.withColumnRenamed(name, name + "_1" ))
> {code}
> It took over 30 minutes.
> *PID* in stack file is *0x126d*
> It seems that _transform_ took too long time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21582) DataFrame.withColumnRenamed cause huge performance overhead

2017-07-31 Thread GuangFancui(ISCAS) (JIRA)

GuangFancui(ISCAS) created SPARK-21582:
--

 Summary: DataFrame.withColumnRenamed cause huge performance 
overhead
 Key: SPARK-21582
 URL: https://issues.apache.org/jira/browse/SPARK-21582
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: GuangFancui(ISCAS)


Table "item_feature" (DataFrame) has over 900 columns. 
When I use

{code:java}
val nameSequeceExcept = Set("gid","category_name","merchant_id")
val df1 = spark.table("item_feature")
val newdf1 = df1.schema.map(_.name).filter(name => 
!nameSequeceExcept.contains(name)).foldLeft(df1)((df1, name) => 
df1.withColumnRenamed(name, name + "_1" ))
{code}


It took over 30 minutes.

*PID* in stack file is *0x126d*
It seems that _transform_ took too long time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19043) Make SparkSQLSessionManager more configurable

2017-07-31 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19043.
---
Resolution: Won't Fix

> Make SparkSQLSessionManager more configurable
> -
>
> Key: SPARK-19043
> URL: https://issues.apache.org/jira/browse/SPARK-19043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Kent Yao
>Priority: Minor
>
> We can only set the pool size of the HiveServer2 background operation thread 
> pool size with Spark Thrift Server. But in HiveServer2, we can also set 
> background operation thread pool size and thread keep alive time. I think we 
> can make it same with hive's behavior.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21567) Dataset with Tuple of type alias throws error

2017-07-31 Thread Tomasz Bartczak (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106920#comment-16106920
 ] 

Tomasz Bartczak commented on SPARK-21567:
-

That's good there is a workaround. 

However as I think about that - spark implicits should get encoders right for 
basic data types and case classes and other primitives like tuples?

I am not sure if it is a Spark bug or a Scala bug but from a user perspective - 
it is some kind of an unexpected behaviour.

> Dataset with Tuple of type alias throws error
> -
>
> Key: SPARK-21567
> URL: https://issues.apache.org/jira/browse/SPARK-21567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: verified for spark 2.1.1 and 2.2.0 in sbt build
>Reporter: Tomasz Bartczak
>
> returning from a map a thing that is a tuple containg another tuple - defined 
> as a type alias - we receive an error.
> minimal reproducible case:
> having a structure like this:
> {code}
> object C {
>   type TwoInt = (Int,Int)
>   def tupleTypeAlias: TwoInt = (1,1)
> }
> {code}
> when I do:
> {code}
> Seq(1).toDS().map(_ => ("",C.tupleTypeAlias))
> {code}
> I get exception:
> {code}
> type T1 is not a class
> scala.ScalaReflectionException: type T1 is not a class
>   at scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
>   at 
> scala.reflect.internal.Symbols$SymbolContextApiImpl.asClass(Symbols.scala:84)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.immutable.List.flatMap(List.scala:344)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
>   at 
> org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
> {code}
> in spark 2.1.1 the last exception was 'head of an empty list'



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21577) Issue is handling too many aggregations

2017-07-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106855#comment-16106855
 ] 

Hyukjin Kwon edited comment on SPARK-21577 at 7/31/17 6:26 AM:
---

Please check out https://spark.apache.org/community.html and I guess you can 
start the thread.


was (Author: hyukjin.kwon):
Please check out https://spark.apache.org/community.html.

> Issue is handling too many aggregations 
> 
>
> Key: SPARK-21577
> URL: https://issues.apache.org/jira/browse/SPARK-21577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
>Reporter: Kannan Subramanian
>
>  my requirement, reading the table from hive(Size - around 1.6 TB). I have to 
> do more than 200 aggregation operations mostly avg, sum and std_dev. Spark 
> application total execution time is take more than 12 hours. To Optimize the 
> code I used shuffle Partitioning and memory tuning and all. But Its 
> nothelpful for me. Please note that same query I ran in hive on map reduce. 
> MR job completion time taken around only 5 hours.  Kindly let me know is 
> there any way to optimize or efficient way of handling multiple aggregation 
> operations.val inputDataDF = 
> hiveContext.read.parquet("/inputparquetData")
> inputDataDF.groupBy("seq_no","year", 
> "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  
> avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")...
>  like 200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations

2017-07-31 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106855#comment-16106855
 ] 

Hyukjin Kwon commented on SPARK-21577:
--

Please check out https://spark.apache.org/community.html.

> Issue is handling too many aggregations 
> 
>
> Key: SPARK-21577
> URL: https://issues.apache.org/jira/browse/SPARK-21577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
>Reporter: Kannan Subramanian
>
>  my requirement, reading the table from hive(Size - around 1.6 TB). I have to 
> do more than 200 aggregation operations mostly avg, sum and std_dev. Spark 
> application total execution time is take more than 12 hours. To Optimize the 
> code I used shuffle Partitioning and memory tuning and all. But Its 
> nothelpful for me. Please note that same query I ran in hive on map reduce. 
> MR job completion time taken around only 5 hours.  Kindly let me know is 
> there any way to optimize or efficient way of handling multiple aggregation 
> operations.val inputDataDF = 
> hiveContext.read.parquet("/inputparquetData")
> inputDataDF.groupBy("seq_no","year", 
> "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  
> avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")...
>  like 200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21577) Issue is handling too many aggregations

2017-07-31 Thread Kannan Subramanian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16106848#comment-16106848
 ] 

Kannan Subramanian commented on SPARK-21577:


please let me know if there is any mail thread started for above issue ?

Thanks

> Issue is handling too many aggregations 
> 
>
> Key: SPARK-21577
> URL: https://issues.apache.org/jira/browse/SPARK-21577
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Cloudera CDH 1.8.3
> Spark 1.6.0
>Reporter: Kannan Subramanian
>
>  my requirement, reading the table from hive(Size - around 1.6 TB). I have to 
> do more than 200 aggregation operations mostly avg, sum and std_dev. Spark 
> application total execution time is take more than 12 hours. To Optimize the 
> code I used shuffle Partitioning and memory tuning and all. But Its 
> nothelpful for me. Please note that same query I ran in hive on map reduce. 
> MR job completion time taken around only 5 hours.  Kindly let me know is 
> there any way to optimize or efficient way of handling multiple aggregation 
> operations.val inputDataDF = 
> hiveContext.read.parquet("/inputparquetData")
> inputDataDF.groupBy("seq_no","year", 
> "month","radius").agg(count($"Dseq"),avg($"Emp"),avg($"Ntw"),avg($"Age"),  
> avg($"DAll"),avg($"PAll"),avg($"DSum"),avg($"dol"),sum("sl"),sum($"PA"),sum($"DS")...
>  like 200 columns)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

62 matches

Mail list logo