[jira] [Updated] (SPARK-21358) Argument of repartitionandsortwithinpartitions at pyspark

2017-07-09 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida updated SPARK-21358:
---
Summary: Argument of repartitionandsortwithinpartitions at pyspark  (was: 
variable of repartitionandsortwithinpartitions at pyspark)

> Argument of repartitionandsortwithinpartitions at pyspark
> -
>
> Key: SPARK-21358
> URL: https://issues.apache.org/jira/browse/SPARK-21358
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Examples
>Affects Versions: 2.1.1
>Reporter: chie hayashida
>Priority: Minor
>
> In rdd.py, implementation of repartitionandsortwithinpartitions is below.
> ```
>def repartitionAndSortWithinPartitions(self, numPartitions=None, 
> partitionFunc=portable_hash,
>ascending=True, keyfunc=lambda x: 
> x):
> ```
> And at document, there is following sample script.
> ```
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> 2)
> ```
> The third argument (ascending) expected to be boolean, so following script is 
> better, I think.
> ```
> >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
> 3)])
> >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
> True)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21358) variable of repartitionandsortwithinpartitions at pyspark

2017-07-09 Thread chie hayashida (JIRA)
chie hayashida created SPARK-21358:
--

 Summary: variable of repartitionandsortwithinpartitions at pyspark
 Key: SPARK-21358
 URL: https://issues.apache.org/jira/browse/SPARK-21358
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples
Affects Versions: 2.1.1
Reporter: chie hayashida
Priority: Minor


In rdd.py, implementation of repartitionandsortwithinpartitions is below.

```
   def repartitionAndSortWithinPartitions(self, numPartitions=None, 
partitionFunc=portable_hash,
   ascending=True, keyfunc=lambda x: x):

```
And at document, there is following sample script.

```
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2)
```

The third argument (ascending) expected to be boolean, so following script is 
better, I think.
```
>>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 
3)])
>>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 
True)
```




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21357) FileInputDStream not remove out of date RDD

2017-07-09 Thread dadazheng (JIRA)
dadazheng created SPARK-21357:
-

 Summary: FileInputDStream not remove out of date RDD
 Key: SPARK-21357
 URL: https://issues.apache.org/jira/browse/SPARK-21357
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.0
Reporter: dadazheng


The method in org.apache.spark.streaming.dstream.FileInputDSteam.clearMetadata 
at line 166
will not remove out of date RDDs in generatedRDDs.
This will cause leak of memory and OOM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21083) Store zero size and row count after analyzing empty table

2017-07-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21083:

Fix Version/s: 2.1.2

> Store zero size and row count after analyzing empty table
> -
>
> Key: SPARK-21083
> URL: https://issues.apache.org/jira/browse/SPARK-21083
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Comment: was deleted

(was: !http://example.com/image.png!)

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test2.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: test2.JPG

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test2.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079799#comment-16079799
 ] 

fengchaoge commented on SPARK-21337:


OK  i  will have a try. thank you  very much.

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079795#comment-16079795
 ] 

Hyukjin Kwon commented on SPARK-21337:
--

Probably, I think it would be nicer to narrow down via executing the query 
multiple times after removing apparently unrelated parts. For example, I don't 
think we need many fields to reproduce this problem.

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21356) CSV datasource failed to parse a value having newline in its value

2017-07-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21356.
--
Resolution: Invalid

I am resolving this as the workaround looks so easy and I am not sure if it 
makes sense to allow newline in its value without quotes for now.

> CSV datasource failed to parse a value having newline in its value
> --
>
> Key: SPARK-21356
> URL: https://issues.apache.org/jira/browse/SPARK-21356
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> This is related with SPARK-21355. I guess this is also a rather corner case. 
> I found this during testing SPARK-21289.
> It looks a bug in Univocity.
> The codes below failed to parse newline in the value.
> {code}
> scala> spark.read.csv(Seq("a\nb", "abc").toDS).show()
> +---+
> |_c0|
> +---+
> |  a|
> |abc|
> +---+
> {code}
> But working around can be easily done with quotes as below:
> {code}
> scala> spark.read.csv(Seq("\"a\nb\"", "abc").toDS).show()
> +---+
> |_c0|
> +---+
> |a
> b|
> |abc|
> +---+
> {code}
> Meaning this works:
> with the file below:
> {code}
> "a
> b",abc
> {code}
> {code}
> scala> spark.read.option("multiLine", true).csv("tmp.csv").show()
> +---+---+
> |_c0|_c1|
> +---+---+
> |a
> b|abc|
> +---+---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079773#comment-16079773
 ] 

fengchaoge commented on SPARK-21337:


1. create database GBD_DM_PAC_SAFE;
2. use GBD_DM_PAC_SAFE;
3. create  table app_claim_assess_rule_granularity;
SQL like this,just  for test:

SELECT x___sql___.2jjg AS cjjg, x___sql___.3jjg AS djjg, (((CASE WHEN ((CASE 
WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 0.20001) THEN 1 WHEN ((CASE WHEN (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 0.5) THEN 2 ELSE 3 END) * 10) + (CASE WHEN ((CASE WHEN x___sql___.jcbj = 
0 THEN CAST(NULL AS DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj 
END) < 300) THEN 1 WHEN ((CASE WHEN x___sql___.jcbj = 0 THEN CAST(NULL AS 
DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj END) < 2000) THEN 2 
ELSE 3 END)) AS calculation_0290210162047568, x___sql___.updated_date AS 
calculation_0910125090644141, (CASE WHEN (x___sql___.small_type = '01') THEN 
'人工报价' ELSE (CASE WHEN (x___sql___.small_type = '02') THEN '指导人' ELSE 
x___sql___.small_type END) END) AS calculation_1700125090616887, 
x___sql___.impairment_amount AS calculation_1750125100625463, (CASE WHEN ((CASE 
WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 0.20001) THEN 1 WHEN ((CASE WHEN (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 0.5) THEN 2 ELSE 3 END) AS calculation_2170210160935298, (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
AS calculation_2390124154901057, (CASE WHEN (x___sql___.application_code = 
'DSFS') THEN '定损发送规则' ELSE x___sql___.application_code END) AS 
calculation_2770125090429540, x___sql___.rule_name AS 
calculation_3060125090537403, (CASE WHEN ((CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
< 10) THEN '暂不考虑规则' ELSE (CASE WHEN CASE WHEN ((CASE WHEN (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 0.20001) THEN 1 WHEN ((CASE WHEN (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 0.5) THEN 2 ELSE 3 END) * 10) + (CASE WHEN ((CASE WHEN x___sql___.jcbj = 
0 THEN CAST(NULL AS DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj 
END) < 300) THEN 1 WHEN ((CASE WHEN x___sql___.jcbj = 0 THEN CAST(NULL AS 
DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj END) < 2000) THEN 2 
ELSE 3 END)) = 13) THEN '最紧要优化规则' WHEN ((CASE WHEN ((CASE WHEN (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
= 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN 
(x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT 
(x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) 
END) < 

[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079790#comment-16079790
 ] 

fengchaoge commented on SPARK-21337:


Attachments  actually happened i have no idea about code generation. some one 
can help? very  thanks much.

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21355) JSON datasource failed to parse a value having newline in its value

2017-07-09 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21355.
--
Resolution: Invalid

I am resolving this per https://stackoverflow.com/a/42073.

{quote}
This is of course correct, but I'd like to add the reason for having to do 
this: the JSON spec at ietf.org/rfc/rfc4627.txt contains this sentence in 
section 2.5: "All Unicode characters may be placed within the quotation marks 
except for the characters that must be escaped: quotation mark, reverse 
solidus, and the control characters (U+ through U+001F)." Since a newline 
is a control character, it must be escaped.
{quote}

> JSON datasource failed to parse a value having newline in its value
> ---
>
> Key: SPARK-21355
> URL: https://issues.apache.org/jira/browse/SPARK-21355
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> I guess this is a rather corner case. I found this during testing SPARK-21289.
> It looks a bug in Jackson.
> The codes below failed to parse newline in the value.
> {code}
> scala> spark.read.json(Seq("{ \"f\": \"a\nb\"}", "{ \"f\": 
> \"abc\"}").toDS).show()
> +---++
> |_corrupt_record|   f|
> +---++
> |  { "f": "a
> b"}|null|
> |   null| abc|
> +---++
> {code}
> Meaning this also does not work
> with the JSON files as below:
> {code}
> {"f": "
> d",  "f0": 3}
> {code}
> {code}
> scala> spark.read.option("multiLine", true).json("tmp.json").show()
> ++
> | _corrupt_record|
> ++
> |{"f": "
> d",  "f0"...|
> ++
> {code}
> Of course, the codes below work:
> {code}
> scala> spark.read.json(Seq("{ \"f\": \"ab\"}", "{ \"f\": 
> \"abc\"}").toDS).show()
> +---+
> |  f|
> +---+
> | ab|
> |abc|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079785#comment-16079785
 ] 

fengchaoge commented on SPARK-21337:


thank you very much, what should i do for next? Thank you for your guidance

the table like this :
CREATE TABLE app_claim_assess_rule_granularity(
  report_no string,   
  case_times string,  
  id_clm_channel_process string,  
  loss_object_no string,  
  assess_times string,
  loss_name string,   
  max_loss_amount string, 
  impairment_amount string,   
  rule_code string,   
  rule_name string,   
  application_code string,
  brand_name string,  
  manufacturer_name string,   
  series_name string, 
  group_name string,  
  model_name string,  
  end_case_date string,   
  updated_date string,
  assess_um string,   
  car_mark string,
  garage_code string, 
  garage_name string  , 
  garage_type string  , 
  privilege_group_name string  , 
  small_type string,
  is_transfer string,  
  praepostor_type string,  
  channel_type string,  
  channel_flag string,  
  loss_type string, 
  loss_agree_amount string,  
  loss_count_agree string,   
  department_code string,   
  department_code_01 string,  
  department_code_02_v string,   
  department_code_03 string,   
  department_code_04 string,  
  department_code_name_01 string,   
  department_code_name_02 string,  
  department_code_name_03 string,  
  department_code_name_04 string,  
  assess_dept_code string,   
  verify_department_code_01 string,  
  verify_department_code_02 string,  
  verify_department_code_03 string,  
  verify_department_code_04 string,   
  verify_department_code_name_01 string,  
  verify_department_code_name_02 string,  
  verify_department_code_name_03 string,   
  verify_department_code_name_04 string,  
  assess_quote_price_um string,  
  assess_guide_um string,  
  assess_center_guide_um string,  
  rule_type string,  
  loss_count_assess string,  
  loss_name_rank string,  
  loss_name_rule_rank string,   
  both_trigger string)
PARTITIONED BY ( 
  department_code_02 string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' 
WITH SERDEPROPERTIES ( 
  'field.delim'='\u0001', 
  'serialization.format'='\u0001') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.RCFileInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.RCFileOutputFormat'
LOCATION
  
'hdfs://hdp-hdfs01/user/hive/warehouse/gbd_dm_pac_safe.db/app_claim_assess_rule_granularity'
TBLPROPERTIES (
  'transient_lastDdlTime'='1499412897')

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread fengchaoge (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fengchaoge updated SPARK-21337:
---
Attachment: test1.JPG
test.JPG

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19659) Fetch big blocks to disk when shuffle-read

2017-07-09 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19659.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

The major work is done in 2.2.0. But it's disabled by default because it cannot 
talk with old shuffle service.

> Fetch big blocks to disk when shuffle-read
> --
>
> Key: SPARK-19659
> URL: https://issues.apache.org/jira/browse/SPARK-19659
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
> Fix For: 2.2.0
>
> Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf
>
>
> Currently the whole block is fetched into memory(offheap by default) when 
> shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can 
> be large when skew situations. If OOM happens during shuffle read, job will 
> be killed and users will be notified to "Consider boosting 
> spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more 
> memory can resolve the OOM. However the approach is not perfectly suitable 
> for production environment, especially for data warehouse.
> Using Spark SQL as data engine in warehouse, users hope to have a unified 
> parameter(e.g. memory) but less resource wasted(resource is allocated but not 
> used),
> It's not always easy to predict skew situations, when happen, it make sense 
> to fetch remote blocks to disk for shuffle-read, rather than
> kill the job because of OOM. This approach is mentioned during the discussion 
> in SPARK-3019, by [~sandyr] and [~mridulm80]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB

2017-07-09 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079776#comment-16079776
 ] 

Hyukjin Kwon commented on SPARK-21337:
--

No, just copying and pasting the SQL without a further investigation does not 
help. Also, I can't reproduce this by the steps you gave in the master.
In such case, I believe it should be narrowed down. It sounds virtually asking 
investigation rather than describing an issue.

> SQL which has large ‘case when’ expressions may cause code generation beyond 
> 64KB
> -
>
> Key: SPARK-21337
> URL: https://issues.apache.org/jira/browse/SPARK-21337
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2
>Reporter: fengchaoge
> Fix For: 2.1.1
>
> Attachments: test1.JPG, test.JPG
>
>
> when there are large 'case when ' expressions in spark sql,the CodeGenerator 
> failed to compile it. 
> Error message is followed by a huge dump of generated source code,at last 
> failed.
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.janino.JaninoRuntimeException: Code of method 
> apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
>  of class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  grows beyond 64 KB.
> It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it  
> apparence in spark-2.1.1 again. 
> https://issues.apache.org/jira/browse/SPARK-13242.
> is there something wrong ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21355) JSON datasource failed to parse a value having newline in its value

2017-07-09 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21355:


 Summary: JSON datasource failed to parse a value having newline in 
its value
 Key: SPARK-21355
 URL: https://issues.apache.org/jira/browse/SPARK-21355
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


I guess this is a rather corner case. I found this during testing SPARK-21289.
It looks a bug in Jackson.

The codes below failed to parse newline in the value.

{code}
scala> spark.read.json(Seq("{ \"f\": \"a\nb\"}", "{ \"f\": 
\"abc\"}").toDS).show()
+---++
|_corrupt_record|   f|
+---++
|  { "f": "a
b"}|null|
|   null| abc|
+---++
{code}

Meaning this also does not work

with the JSON files as below:

{code}
{"f": "
d",  "f0": 3}
{code}


{code}
scala> spark.read.option("multiLine", true).json("tmp.json").show()
++
| _corrupt_record|
++
|{"f": "
d",  "f0"...|
++
{code}

Of course, the codes below work:

{code}
scala> spark.read.json(Seq("{ \"f\": \"ab\"}", "{ \"f\": \"abc\"}").toDS).show()
+---+
|  f|
+---+
| ab|
|abc|
+---+
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21356) CSV datasource failed to parse a value having newline in its value

2017-07-09 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21356:


 Summary: CSV datasource failed to parse a value having newline in 
its value
 Key: SPARK-21356
 URL: https://issues.apache.org/jira/browse/SPARK-21356
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Trivial


This is related with SPARK-21355. I guess this is also a rather corner case. I 
found this during testing SPARK-21289.

It looks a bug in Univocity.

The codes below failed to parse newline in the value.

{code}
scala> spark.read.csv(Seq("a\nb", "abc").toDS).show()
+---+
|_c0|
+---+
|  a|
|abc|
+---+
{code}

But working around can be easily done with quotes as below:

{code}
scala> spark.read.csv(Seq("\"a\nb\"", "abc").toDS).show()
+---+
|_c0|
+---+
|a
b|
|abc|
+---+
{code}

Meaning this works:

with the file below:

{code}
"a
b",abc
{code}


{code}
scala> spark.read.option("multiLine", true).csv("tmp.csv").show()
+---+---+
|_c0|_c1|
+---+---+
|a
b|abc|
+---+---+
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21354) INPUT FILE related functions do not support more than one sources

2017-07-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21354:


Assignee: Xiao Li  (was: Apache Spark)

> INPUT FILE related functions do not support more than one sources
> -
>
> Key: SPARK-21354
> URL: https://issues.apache.org/jira/browse/SPARK-21354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {noformat}
> hive> select *, INPUT__FILE__NAME FROM t1, t2;
> FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
> Tables/Subqueries
> {noformat}
> The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
> {{input_file_block_length}} do not support more than one sources, like what 
> Hive does. Currently, we do not block it and the outputs are ambiguous.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21354) INPUT FILE related functions do not support more than one sources

2017-07-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079743#comment-16079743
 ] 

Apache Spark commented on SPARK-21354:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/18580

> INPUT FILE related functions do not support more than one sources
> -
>
> Key: SPARK-21354
> URL: https://issues.apache.org/jira/browse/SPARK-21354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {noformat}
> hive> select *, INPUT__FILE__NAME FROM t1, t2;
> FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
> Tables/Subqueries
> {noformat}
> The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
> {{input_file_block_length}} do not support more than one sources, like what 
> Hive does. Currently, we do not block it and the outputs are ambiguous.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079751#comment-16079751
 ] 

Dongjoon Hyun edited comment on SPARK-21349 at 7/9/17 11:29 PM:


This issue is not about blindly raising the threashold, 100K. The default value 
will be the same for all users. What I mean is the size of task become bigger 
now than 3 years ago. Currenlty, it complains more frequently and misleadingly. 
For some apps, it always complains so that it becomes a meaningless warning, 
too.

Is there any reason or criteria to decide 100K as the threadhold at that time 
three years ago? Then, at least, can we reevaluate the threshold?


was (Author: dongjoon):
This issue is not about blindly raising the threashold, 100K. The default value 
will be the same for all users. What I mean is the size of task become bigger 
now than 3 years ago. Currenlty, it complains more frequently and misleadingly.

Is there any reason or criteria to decide 100K as the threadhold at that time 
three years ago? Then, at least, can we reevaluate the threshold?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079751#comment-16079751
 ] 

Dongjoon Hyun edited comment on SPARK-21349 at 7/9/17 11:28 PM:


This issue is not about blindly raising the threashold, 100K. The default value 
will be the same for all users. What I mean is the size of task become bigger 
now than 3 years ago. Currenlty, it complains more frequently and misleadingly.

Is there any reason or criteria to decide 100K as the threadhold at that time 
three years ago? Then, at least, can we reevaluate the threshold?


was (Author: dongjoon):
This issue is not about blindly raising the threashold, 100K. The default value 
will be the same for all users. What I mean is the size of task become bigger 
now than 3 years ago. Currenlty, it complains more frequently and misleadingly.

Is there any reason or criteria to decide 100K as the threadhold at that time 
three years ago? Then, at least, can we evaluate the threshold?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079751#comment-16079751
 ] 

Dongjoon Hyun commented on SPARK-21349:
---

This issue is not about blindly raising the threashold, 100K. The default value 
will be the same for all users. What I mean is the size of task become bigger 
now than 3 years ago. Currenlty, it complains more frequently and misleadingly.

Is there any reason or criteria to decide 100K as the threadhold at that time 
three years ago? Then, at least, can we evaluate the threshold?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079746#comment-16079746
 ] 

Kay Ousterhout commented on SPARK-21349:


Does that mean we should just raise this threshold for all users?

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21354) INPUT FILE related functions do not support more than one sources

2017-07-09 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21354:

Description: 
{noformat}
hive> select *, INPUT__FILE__NAME FROM t1, t2;
FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
Tables/Subqueries
{noformat}

The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
{{input_file_block_length}} do not support more than one sources, like what 
Hive does. Currently, we do not block it and the outputs are ambiguous.

  was:
{noformat}
hive> select *, INPUT__FILE__NAME FROM t1, t2;
FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
Tables/Subqueries
{noformat}

The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
{{input_file_block_length}} do not support more than one sources, like what 
Hive does


> INPUT FILE related functions do not support more than one sources
> -
>
> Key: SPARK-21354
> URL: https://issues.apache.org/jira/browse/SPARK-21354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> {noformat}
> hive> select *, INPUT__FILE__NAME FROM t1, t2;
> FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
> Tables/Subqueries
> {noformat}
> The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
> {{input_file_block_length}} do not support more than one sources, like what 
> Hive does. Currently, we do not block it and the outputs are ambiguous.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21354) INPUT FILE related functions do not support more than one sources

2017-07-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21354:


Assignee: Apache Spark  (was: Xiao Li)

> INPUT FILE related functions do not support more than one sources
> -
>
> Key: SPARK-21354
> URL: https://issues.apache.org/jira/browse/SPARK-21354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> {noformat}
> hive> select *, INPUT__FILE__NAME FROM t1, t2;
> FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
> Tables/Subqueries
> {noformat}
> The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
> {{input_file_block_length}} do not support more than one sources, like what 
> Hive does. Currently, we do not block it and the outputs are ambiguous.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21354) INPUT FILE related functions do not support more than one sources

2017-07-09 Thread Xiao Li (JIRA)
Xiao Li created SPARK-21354:
---

 Summary: INPUT FILE related functions do not support more than one 
sources
 Key: SPARK-21354
 URL: https://issues.apache.org/jira/browse/SPARK-21354
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.0.2, 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


{noformat}
hive> select *, INPUT__FILE__NAME FROM t1, t2;
FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One 
Tables/Subqueries
{noformat}

The build-in functions {{input_file_name}}, {{input_file_block_start}}, 
{{input_file_block_length}} do not support more than one sources, like what 
Hive does



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21349:
--
Description: 
Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
SPARK-2185. Although this is just a warning message, this issue tries to make 
`TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.

According to the Jenkins log, we also have 123 warnings even in our unit test.

  was:Since Spark 1.1.0, Spark emits warning when task size exceeds a 
threshold, SPARK-2185. Although this is just a warning message, this issue 
tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for 
advanced users.


> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.
> According to the Jenkins log, we also have 123 warnings even in our unit test.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079728#comment-16079728
 ] 

Dongjoon Hyun commented on SPARK-21349:
---

Thank you for advice, [~kayousterhout]!
For usability, we are buliding more complex tasks like Spark SQL before 3 years 
ago. This always happens for the specific user apps now.
For that apps, it would be great if we configure this. In addition, we can make 
this as `internal` configuration instead of CONSTANT.

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079724#comment-16079724
 ] 

Kay Ousterhout commented on SPARK-21349:


Is this a major usability issue (and what's the use case where task sizes are 
regularly > 100KB)?  I'm hesitant to make this a configuration parameter -- 
Spark already has a huge number of configuration parameters, making it hard for 
users to figure out which ones are relevant for them.

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable

2017-07-09 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-21349:
--
Description: Since Spark 1.1.0, Spark emits warning when task size exceeds 
a threshold, SPARK-2185. Although this is just a warning message, this issue 
tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for 
advanced users.  (was: Although this is just a warning message, this issue 
tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for 
advanced users.)

> Make TASK_SIZE_TO_WARN_KB configurable
> --
>
> Key: SPARK-21349
> URL: https://issues.apache.org/jira/browse/SPARK-21349
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, 
> SPARK-2185. Although this is just a warning message, this issue tries to make 
> `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations

2017-07-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079633#comment-16079633
 ] 

Apache Spark commented on SPARK-21353:
--

User 'heary-cao' has created a pull request for this issue:
https://github.com/apache/spark/pull/18555

> add checkValue in spark.internal.config about how to correctly set 
> configurations
> -
>
> Key: SPARK-21353
> URL: https://issues.apache.org/jira/browse/SPARK-21353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Priority: Minor
>
> add checkValue for configurations:
> spark.driver.memory
> spark.executor.memory
> spark.blockManager.port
> spark.task.cpus
> spark.dynamicAllocation.minExecutors
> spark.task.maxFailures
> spark.blacklist.task.maxTaskAttemptsPerExecutor
> spark.blacklist.task.maxTaskAttemptsPerNode
> spark.blacklist.application.maxFailedTasksPerExecutor
> spark.blacklist.stage.maxFailedTasksPerExecutor
> spark.blacklist.application.maxFailedExecutorsPerNode
> spark.blacklist.stage.maxFailedExecutorsPerNode
> spark.scheduler.listenerbus.metrics.maxListenerClassesTimed
> spark.ui.retainedTasks
> spark.blockManager.port
> spark.driver.blockManager.port
> spark.files.maxPartitionBytes
> spark.files.openCostInBytes
> spark.shuffle.accurateBlockThreshold
> spark.shuffle.registration.timeout
> spark.shuffle.registration.maxAttempts
> spark.reducer.maxReqSizeShuffleToMem
> and we copy the document from 
> http://spark.apache.org/docs/latest/configuration.html
> spark.driver.userClassPathFirst
> spark.driver.memory
> spark.executor.userClassPathFirst
> spark.executor.memory
> spark.yarn.isPython
> spark.task.cpus
> spark.dynamicAllocation.minExecutors
> spark.dynamicAllocation.initialExecutors
> spark.dynamicAllocation.maxExecutors
> spark.shuffle.service.enabled
> spark.submit.pyFiles
> spark.task.maxFailures
> spark.blacklist.enabled
> spark.blacklist.task.maxTaskAttemptsPerExecutor
> spark.blacklist.task.maxTaskAttemptsPerNode
> spark.blacklist.stage.maxFailedTasksPerExecutor
> spark.blacklist.stage.maxFailedExecutorsPerNode
> spark.pyspark.driver.python
> spark.pyspark.python
> spark.ui.retainedTasks
> spark.io.encryption.enabled
> spark.io.encryption.keygen.algorithm
> spark.io.encryption.keySizeBits
> spark.authenticate.enableSaslEncryption
> spark.authenticate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations

2017-07-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21353:


Assignee: Apache Spark

> add checkValue in spark.internal.config about how to correctly set 
> configurations
> -
>
> Key: SPARK-21353
> URL: https://issues.apache.org/jira/browse/SPARK-21353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Assignee: Apache Spark
>Priority: Minor
>
> add checkValue for configurations:
> spark.driver.memory
> spark.executor.memory
> spark.blockManager.port
> spark.task.cpus
> spark.dynamicAllocation.minExecutors
> spark.task.maxFailures
> spark.blacklist.task.maxTaskAttemptsPerExecutor
> spark.blacklist.task.maxTaskAttemptsPerNode
> spark.blacklist.application.maxFailedTasksPerExecutor
> spark.blacklist.stage.maxFailedTasksPerExecutor
> spark.blacklist.application.maxFailedExecutorsPerNode
> spark.blacklist.stage.maxFailedExecutorsPerNode
> spark.scheduler.listenerbus.metrics.maxListenerClassesTimed
> spark.ui.retainedTasks
> spark.blockManager.port
> spark.driver.blockManager.port
> spark.files.maxPartitionBytes
> spark.files.openCostInBytes
> spark.shuffle.accurateBlockThreshold
> spark.shuffle.registration.timeout
> spark.shuffle.registration.maxAttempts
> spark.reducer.maxReqSizeShuffleToMem
> and we copy the document from 
> http://spark.apache.org/docs/latest/configuration.html
> spark.driver.userClassPathFirst
> spark.driver.memory
> spark.executor.userClassPathFirst
> spark.executor.memory
> spark.yarn.isPython
> spark.task.cpus
> spark.dynamicAllocation.minExecutors
> spark.dynamicAllocation.initialExecutors
> spark.dynamicAllocation.maxExecutors
> spark.shuffle.service.enabled
> spark.submit.pyFiles
> spark.task.maxFailures
> spark.blacklist.enabled
> spark.blacklist.task.maxTaskAttemptsPerExecutor
> spark.blacklist.task.maxTaskAttemptsPerNode
> spark.blacklist.stage.maxFailedTasksPerExecutor
> spark.blacklist.stage.maxFailedExecutorsPerNode
> spark.pyspark.driver.python
> spark.pyspark.python
> spark.ui.retainedTasks
> spark.io.encryption.enabled
> spark.io.encryption.keygen.algorithm
> spark.io.encryption.keySizeBits
> spark.authenticate.enableSaslEncryption
> spark.authenticate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations

2017-07-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21353:


Assignee: (was: Apache Spark)

> add checkValue in spark.internal.config about how to correctly set 
> configurations
> -
>
> Key: SPARK-21353
> URL: https://issues.apache.org/jira/browse/SPARK-21353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: caoxuewen
>Priority: Minor
>
> add checkValue for configurations:
> spark.driver.memory
> spark.executor.memory
> spark.blockManager.port
> spark.task.cpus
> spark.dynamicAllocation.minExecutors
> spark.task.maxFailures
> spark.blacklist.task.maxTaskAttemptsPerExecutor
> spark.blacklist.task.maxTaskAttemptsPerNode
> spark.blacklist.application.maxFailedTasksPerExecutor
> spark.blacklist.stage.maxFailedTasksPerExecutor
> spark.blacklist.application.maxFailedExecutorsPerNode
> spark.blacklist.stage.maxFailedExecutorsPerNode
> spark.scheduler.listenerbus.metrics.maxListenerClassesTimed
> spark.ui.retainedTasks
> spark.blockManager.port
> spark.driver.blockManager.port
> spark.files.maxPartitionBytes
> spark.files.openCostInBytes
> spark.shuffle.accurateBlockThreshold
> spark.shuffle.registration.timeout
> spark.shuffle.registration.maxAttempts
> spark.reducer.maxReqSizeShuffleToMem
> and we copy the document from 
> http://spark.apache.org/docs/latest/configuration.html
> spark.driver.userClassPathFirst
> spark.driver.memory
> spark.executor.userClassPathFirst
> spark.executor.memory
> spark.yarn.isPython
> spark.task.cpus
> spark.dynamicAllocation.minExecutors
> spark.dynamicAllocation.initialExecutors
> spark.dynamicAllocation.maxExecutors
> spark.shuffle.service.enabled
> spark.submit.pyFiles
> spark.task.maxFailures
> spark.blacklist.enabled
> spark.blacklist.task.maxTaskAttemptsPerExecutor
> spark.blacklist.task.maxTaskAttemptsPerNode
> spark.blacklist.stage.maxFailedTasksPerExecutor
> spark.blacklist.stage.maxFailedExecutorsPerNode
> spark.pyspark.driver.python
> spark.pyspark.python
> spark.ui.retainedTasks
> spark.io.encryption.enabled
> spark.io.encryption.keygen.algorithm
> spark.io.encryption.keySizeBits
> spark.authenticate.enableSaslEncryption
> spark.authenticate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations

2017-07-09 Thread caoxuewen (JIRA)
caoxuewen created SPARK-21353:
-

 Summary: add checkValue in spark.internal.config about how to 
correctly set configurations
 Key: SPARK-21353
 URL: https://issues.apache.org/jira/browse/SPARK-21353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: caoxuewen
Priority: Minor


add checkValue for configurations:

spark.driver.memory
spark.executor.memory
spark.blockManager.port
spark.task.cpus
spark.dynamicAllocation.minExecutors
spark.task.maxFailures
spark.blacklist.task.maxTaskAttemptsPerExecutor
spark.blacklist.task.maxTaskAttemptsPerNode
spark.blacklist.application.maxFailedTasksPerExecutor
spark.blacklist.stage.maxFailedTasksPerExecutor
spark.blacklist.application.maxFailedExecutorsPerNode
spark.blacklist.stage.maxFailedExecutorsPerNode
spark.scheduler.listenerbus.metrics.maxListenerClassesTimed
spark.ui.retainedTasks
spark.blockManager.port
spark.driver.blockManager.port
spark.files.maxPartitionBytes
spark.files.openCostInBytes
spark.shuffle.accurateBlockThreshold
spark.shuffle.registration.timeout
spark.shuffle.registration.maxAttempts
spark.reducer.maxReqSizeShuffleToMem
and we copy the document from 
http://spark.apache.org/docs/latest/configuration.html

spark.driver.userClassPathFirst
spark.driver.memory
spark.executor.userClassPathFirst
spark.executor.memory
spark.yarn.isPython
spark.task.cpus
spark.dynamicAllocation.minExecutors
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.maxExecutors
spark.shuffle.service.enabled
spark.submit.pyFiles
spark.task.maxFailures
spark.blacklist.enabled
spark.blacklist.task.maxTaskAttemptsPerExecutor
spark.blacklist.task.maxTaskAttemptsPerNode
spark.blacklist.stage.maxFailedTasksPerExecutor
spark.blacklist.stage.maxFailedExecutorsPerNode
spark.pyspark.driver.python
spark.pyspark.python
spark.ui.retainedTasks
spark.io.encryption.enabled
spark.io.encryption.keygen.algorithm
spark.io.encryption.keySizeBits
spark.authenticate.enableSaslEncryption
spark.authenticate



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-21352:
---

> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.
> https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-21352.
-

> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.
> https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21352.
---
Resolution: Invalid

> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.
> https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21352.
---
Resolution: Fixed

That does not imply this is the right place. This is for proposing changes and 
diagnosing specific bugs. Do not reopen JIRAs.

> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.
> https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Shubham Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Gupta updated SPARK-21352:
--
Description: 
I am trying to figure out the memory used by executors for a Spark Streaming 
job. For data I am using the rest endpoint for Spark AllExecutors and just 
summing up the metrics totalDuration * spark.executor.memory for every executor 
and then emitting the final sum as the memory usage.

But this is coming out to be very small for application which ran whole day , 
is something wrong with the logic.Also I am using dynamic allocation and 
executorIdleTimeout is 5 seconds.

Also I am also assuming that if some executor was removed for due to idle 
timeout and then was allocated to some other task then its totalDuration will 
be increased by the amount of time took by the executor to execute this new 
task.

https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts

  was:
I am trying to figure out the memory used by executors for a Spark Streaming 
job. For data I am using the rest endpoint for Spark AllExecutors and just 
summing up the metrics totalDuration * spark.executor.memory for every executor 
and then emitting the final sum as the memory usage.

But this is coming out to be very small for application which ran whole day , 
is something wrong with the logic.Also I am using dynamic allocation and 
executorIdleTimeout is 5 seconds.

Also I am also assuming that if some executor was removed for due to idle 
timeout and then was allocated to some other task then its totalDuration will 
be increased by the amount of time took by the executor to execute this new 
task.


> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.
> https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Shubham Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Gupta reopened SPARK-21352:
---

There is no solution provided for the problem and neither stack overflow helping


> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21083) Store zero size and row count after analyzing empty table

2017-07-09 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-21083:

Fix Version/s: 2.2.1

> Store zero size and row count after analyzing empty table
> -
>
> Key: SPARK-21083
> URL: https://issues.apache.org/jira/browse/SPARK-21083
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset

2017-07-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079543#comment-16079543
 ] 

Apache Spark commented on SPARK-18016:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18579

> Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
> -
>
> Key: SPARK-18016
> URL: https://issues.apache.org/jira/browse/SPARK-18016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Aleksander Eskilson
>Assignee: Aleksander Eskilson
> Fix For: 2.3.0
>
>
> When attempting to encode collections of large Java objects to Datasets 
> having very wide or deeply nested schemas, code generation can fail, yielding:
> {code}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for 
> class 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection
>  has grown past JVM limit of 0x
>   at 
> org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439)
>   at 
> org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358)
>   at 
> org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894)
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420)
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369)
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345)
>   at 
> 

[jira] [Resolved] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21352.
---
Resolution: Invalid

Please point questions to StackOverflow or the mailing list.

> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21332) Incorrect result type inferred for some decimal expressions

2017-07-09 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079525#comment-16079525
 ] 

Anton Okolnychyi commented on SPARK-21332:
--

I know the root cause and will submit a PR soon.

> Incorrect result type inferred for some decimal expressions
> ---
>
> Key: SPARK-21332
> URL: https://issues.apache.org/jira/browse/SPARK-21332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: Alexander Shkapsky
>
> Decimal expressions do not always follow the type inference rules explained 
> in DecimalPrecision.scala.  An incorrect result type is produced when the 
> expressions contains  more than 2 decimals.
> For example:
> spark-sql> CREATE TABLE Decimals(decimal_26_6 DECIMAL(26,6)); 
> ...
> spark-sql> describe decimals;
> ...
> decimal_26_6  decimal(26,6)   NULL
> spark-sql> explain select decimal_26_6 * decimal_26_6 from decimals;
> ...
> == Physical Plan ==
> *Project [CheckOverflow((decimal_26_6#99 * decimal_26_6#99), 
> DecimalType(38,12)) AS (decimal_26_6 * decimal_26_6)#100]
> +- HiveTableScan [decimal_26_6#99], MetastoreRelation default, decimals
> However:
> spark-sql> explain select decimal_26_6 * decimal_26_6 * decimal_26_6 from 
> decimals;
> ...
> == Physical Plan ==
> *Project [CheckOverflow((cast(CheckOverflow((decimal_26_6#104 * 
> decimal_26_6#104), DecimalType(38,12)) as decimal(26,6)) * decimal_26_6#104), 
> DecimalType(38,12)) AS ((decimal_26_6 * decimal_26_6) * decimal_26_6)#105]
> +- HiveTableScan [decimal_26_6#104], MetastoreRelation default, decimals
> The expected result type is DecimalType(38,18).
> In Hive 1.1.0:
> hive> explain select decimal_26_6 * decimal_26_6 from decimals;
> OK
> STAGE DEPENDENCIES:
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> TableScan
>   alias: decimals
>   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
> stats: NONE
>   Select Operator
> expressions: (decimal_26_6 * decimal_26_6) (type: decimal(38,12))
> outputColumnNames: _col0
> Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
> stats: NONE
> ListSink
> Time taken: 0.772 seconds, Fetched: 17 row(s)
> hive> explain select decimal_26_6 * decimal_26_6 * decimal_26_6 from decimals;
> OK
> STAGE DEPENDENCIES:
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> TableScan
>   alias: decimals
>   Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
> stats: NONE
>   Select Operator
> expressions: ((decimal_26_6 * decimal_26_6) * decimal_26_6) 
> (type: decimal(38,18))
> outputColumnNames: _col0
> Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column 
> stats: NONE
> ListSink
> Time taken: 0.064 seconds, Fetched: 17 row(s)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Shubham Gupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubham Gupta updated SPARK-21352:
--
Issue Type: Improvement  (was: Bug)

> Memory Usage in Spark Streaming
> ---
>
> Key: SPARK-21352
> URL: https://issues.apache.org/jira/browse/SPARK-21352
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.1.1
>Reporter: Shubham Gupta
>  Labels: newbie
>
> I am trying to figure out the memory used by executors for a Spark Streaming 
> job. For data I am using the rest endpoint for Spark AllExecutors and just 
> summing up the metrics totalDuration * spark.executor.memory for every 
> executor and then emitting the final sum as the memory usage.
> But this is coming out to be very small for application which ran whole day , 
> is something wrong with the logic.Also I am using dynamic allocation and 
> executorIdleTimeout is 5 seconds.
> Also I am also assuming that if some executor was removed for due to idle 
> timeout and then was allocated to some other task then its totalDuration will 
> be increased by the amount of time took by the executor to execute this new 
> task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21352) Memory Usage in Spark Streaming

2017-07-09 Thread Shubham Gupta (JIRA)
Shubham Gupta created SPARK-21352:
-

 Summary: Memory Usage in Spark Streaming
 Key: SPARK-21352
 URL: https://issues.apache.org/jira/browse/SPARK-21352
 Project: Spark
  Issue Type: Bug
  Components: DStreams, Spark Submit, YARN
Affects Versions: 2.1.1
Reporter: Shubham Gupta


I am trying to figure out the memory used by executors for a Spark Streaming 
job. For data I am using the rest endpoint for Spark AllExecutors and just 
summing up the metrics totalDuration * spark.executor.memory for every executor 
and then emitting the final sum as the memory usage.

But this is coming out to be very small for application which ran whole day , 
is something wrong with the logic.Also I am using dynamic allocation and 
executorIdleTimeout is 5 seconds.

Also I am also assuming that if some executor was removed for due to idle 
timeout and then was allocated to some other task then its totalDuration will 
be increased by the amount of time took by the executor to execute this new 
task.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel

2017-07-09 Thread 颜发才

[ 
https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079507#comment-16079507
 ] 

Yan Facai (颜发才) commented on SPARK-21341:
-

Yes, [~sowen] is right. Why not to use save and load method?

> Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel 
> -
>
> Key: SPARK-21341
> URL: https://issues.apache.org/jira/browse/SPARK-21341
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Zied Sellami
>
> I am using sparContext.saveAsObjectFile to save a complex object containing a 
> pipelineModel with a Word2Vec ML Transformer. When I load the object and call 
> myPipelineModel.transform, Word2VecModel raise a null pointer error on line 
> 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by 
> removing@transient annotation on val wordVectors and @transient lazy val on 
> getVectors function.
> -Why this 2 val are transient ?
> -Any solution to add a boolean function on the Word2Vec Transformer to force 
> the serialization of wordVectors.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21083) Store zero size and row count after analyzing empty table

2017-07-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079493#comment-16079493
 ] 

Apache Spark commented on SPARK-21083:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18577

> Store zero size and row count after analyzing empty table
> -
>
> Key: SPARK-21083
> URL: https://issues.apache.org/jira/browse/SPARK-21083
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2017-07-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21351:


Assignee: (was: Apache Spark)

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2017-07-09 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21351:


Assignee: Apache Spark

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2017-07-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079480#comment-16079480
 ] 

Apache Spark commented on SPARK-21351:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18576

> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
> scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
> scala> val targetQuery = bIsNotNull.distinct
> scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
> res5: Boolean = true
> scala> targetQuery.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 ==
> *HashAggregate(keys=[b#19], functions=[], output=[b#19])
> +- Exchange hashpartitioning(b#19, 200)
>+- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
>   +- *Project [_2#16 AS b#19]
>  +- *Filter isnotnull(_2#16)
> +- LocalTableScan [_1#15, _2#16]
> Generated code:
> ...
> /* 124 */   protected void processNext() throws java.io.IOException {
> ...
> /* 132 */ // output the result
> /* 133 */
> /* 134 */ while (agg_mapIter.next()) {
> /* 135 */   wholestagecodegen_numOutputRows.add(1);
> /* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
> /* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
> /* 138 */
> /* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
> /* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
> /* 141 */   agg_rowWriter1.zeroOutNullBytes();
> /* 142 */
> // We don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`
> /* 143 */   if (agg_isNull4) {
> /* 144 */ agg_rowWriter1.setNullAt(0);
> /* 145 */   } else {
> /* 146 */ agg_rowWriter1.write(0, agg_value4);
> /* 147 */   }
> /* 148 */   append(agg_result1);
> /* 149 */
> /* 150 */   if (shouldStop()) return;
> /* 151 */ }
> /* 152 */
> /* 153 */ agg_mapIter.close();
> /* 154 */ if (agg_sorter == null) {
> /* 155 */   agg_hashMap.free();
> /* 156 */ }
> /* 157 */   }
> /* 158 */
> /* 159 */ }
> {code}
> In the line 143, we don't need this NULL check because NULL is filtered out 
> in `$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21083) Store zero size and row count after analyzing empty table

2017-07-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079474#comment-16079474
 ] 

Apache Spark commented on SPARK-21083:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/18575

> Store zero size and row count after analyzing empty table
> -
>
> Key: SPARK-21083
> URL: https://issues.apache.org/jira/browse/SPARK-21083
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2017-07-09 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-21351:
-
Description: 
In the master, optimized plans do not respect the nullability that `Filter` 
might change when having `IsNotNull`.
This generates unnecessary code for NULL checks. For example:

{code}
scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
scala> val targetQuery = bIsNotNull.distinct
scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
res5: Boolean = true

scala> targetQuery.debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
*HashAggregate(keys=[b#19], functions=[], output=[b#19])
+- Exchange hashpartitioning(b#19, 200)
   +- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
  +- *Project [_2#16 AS b#19]
 +- *Filter isnotnull(_2#16)
+- LocalTableScan [_1#15, _2#16]

Generated code:
...
/* 124 */   protected void processNext() throws java.io.IOException {
...
/* 132 */ // output the result
/* 133 */
/* 134 */ while (agg_mapIter.next()) {
/* 135 */   wholestagecodegen_numOutputRows.add(1);
/* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
/* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
/* 138 */
/* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
/* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
/* 141 */   agg_rowWriter1.zeroOutNullBytes();
/* 142 */
// We don't need this NULL check because NULL is filtered out 
in `$"b" =!=2`
/* 143 */   if (agg_isNull4) {
/* 144 */ agg_rowWriter1.setNullAt(0);
/* 145 */   } else {
/* 146 */ agg_rowWriter1.write(0, agg_value4);
/* 147 */   }
/* 148 */   append(agg_result1);
/* 149 */
/* 150 */   if (shouldStop()) return;
/* 151 */ }
/* 152 */
/* 153 */ agg_mapIter.close();
/* 154 */ if (agg_sorter == null) {
/* 155 */   agg_hashMap.free();
/* 156 */ }
/* 157 */   }
/* 158 */
/* 159 */ }
{code}

In the line 143, we don't need this NULL check because NULL is filtered out in 
`$"b" =!=2`.

  was:
In the master, optimized plans do not respect the nullability that `Filter` 
might change when having `IsNotNull`. This generates unnecessary code for NULL 
checks. For example:

{code}
scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
scala> val targetQuery = bIsNotNull.distinct
scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
res5: Boolean = true

scala> targetQuery.debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
*HashAggregate(keys=[b#19], functions=[], output=[b#19])
+- Exchange hashpartitioning(b#19, 200)
   +- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
  +- *Project [_2#16 AS b#19]
 +- *Filter isnotnull(_2#16)
+- LocalTableScan [_1#15, _2#16]

Generated code:
...
/* 124 */   protected void processNext() throws java.io.IOException {
...
/* 132 */ // output the result
/* 133 */
/* 134 */ while (agg_mapIter.next()) {
/* 135 */   wholestagecodegen_numOutputRows.add(1);
/* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
/* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
/* 138 */
/* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
/* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
/* 141 */   agg_rowWriter1.zeroOutNullBytes();
/* 142 */
// We don't need this NULL check because NULL is filtered out 
in `$"b" =!=2`
/* 143 */   if (agg_isNull4) {
/* 144 */ agg_rowWriter1.setNullAt(0);
/* 145 */   } else {
/* 146 */ agg_rowWriter1.write(0, agg_value4);
/* 147 */   }
/* 148 */   append(agg_result1);
/* 149 */
/* 150 */   if (shouldStop()) return;
/* 151 */ }
/* 152 */
/* 153 */ agg_mapIter.close();
/* 154 */ if (agg_sorter == null) {
/* 155 */   agg_hashMap.free();
/* 156 */ }
/* 157 */   }
/* 158 */
/* 159 */ }
{code}

In the line 143, we don't need this NULL check because NULL is filtered out in 
`$"b" =!=2`.


> Update nullability based on children's output in optimized logical plan
> ---
>
> Key: SPARK-21351
> URL: https://issues.apache.org/jira/browse/SPARK-21351
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In the master, optimized plans do not respect the nullability that `Filter` 
> might change when having `IsNotNull`.
> This generates unnecessary code for NULL checks. For example:
> {code}
> scala> val df = 

[jira] [Created] (SPARK-21351) Update nullability based on children's output in optimized logical plan

2017-07-09 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-21351:


 Summary: Update nullability based on children's output in 
optimized logical plan
 Key: SPARK-21351
 URL: https://issues.apache.org/jira/browse/SPARK-21351
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.1
Reporter: Takeshi Yamamuro
Priority: Minor


In the master, optimized plans do not respect the nullability that `Filter` 
might change when having `IsNotNull`. This generates unnecessary code for NULL 
checks. For example:

{code}
scala> val df = Seq((Some(1), Some(2))).toDF("a", "b")
scala> val bIsNotNull = df.where($"b" =!= 2).select($"b")
scala> val targetQuery = bIsNotNull.distinct
scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable
res5: Boolean = true

scala> targetQuery.debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 ==
*HashAggregate(keys=[b#19], functions=[], output=[b#19])
+- Exchange hashpartitioning(b#19, 200)
   +- *HashAggregate(keys=[b#19], functions=[], output=[b#19])
  +- *Project [_2#16 AS b#19]
 +- *Filter isnotnull(_2#16)
+- LocalTableScan [_1#15, _2#16]

Generated code:
...
/* 124 */   protected void processNext() throws java.io.IOException {
...
/* 132 */ // output the result
/* 133 */
/* 134 */ while (agg_mapIter.next()) {
/* 135 */   wholestagecodegen_numOutputRows.add(1);
/* 136 */   UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey();
/* 137 */   UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue();
/* 138 */
/* 139 */   boolean agg_isNull4 = agg_aggKey.isNullAt(0);
/* 140 */   int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0));
/* 141 */   agg_rowWriter1.zeroOutNullBytes();
/* 142 */
// We don't need this NULL check because NULL is filtered out 
in `$"b" =!=2`
/* 143 */   if (agg_isNull4) {
/* 144 */ agg_rowWriter1.setNullAt(0);
/* 145 */   } else {
/* 146 */ agg_rowWriter1.write(0, agg_value4);
/* 147 */   }
/* 148 */   append(agg_result1);
/* 149 */
/* 150 */   if (shouldStop()) return;
/* 151 */ }
/* 152 */
/* 153 */ agg_mapIter.close();
/* 154 */ if (agg_sorter == null) {
/* 155 */   agg_hashMap.free();
/* 156 */ }
/* 157 */   }
/* 158 */
/* 159 */ }
{code}

In the line 143, we don't need this NULL check because NULL is filtered out in 
`$"b" =!=2`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org