[jira] [Updated] (SPARK-21358) Argument of repartitionandsortwithinpartitions at pyspark
[ https://issues.apache.org/jira/browse/SPARK-21358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chie hayashida updated SPARK-21358: --- Summary: Argument of repartitionandsortwithinpartitions at pyspark (was: variable of repartitionandsortwithinpartitions at pyspark) > Argument of repartitionandsortwithinpartitions at pyspark > - > > Key: SPARK-21358 > URL: https://issues.apache.org/jira/browse/SPARK-21358 > Project: Spark > Issue Type: Improvement > Components: Documentation, Examples >Affects Versions: 2.1.1 >Reporter: chie hayashida >Priority: Minor > > In rdd.py, implementation of repartitionandsortwithinpartitions is below. > ``` >def repartitionAndSortWithinPartitions(self, numPartitions=None, > partitionFunc=portable_hash, >ascending=True, keyfunc=lambda x: > x): > ``` > And at document, there is following sample script. > ``` > >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, > 3)]) > >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, > 2) > ``` > The third argument (ascending) expected to be boolean, so following script is > better, I think. > ``` > >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, > 3)]) > >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, > True) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21358) variable of repartitionandsortwithinpartitions at pyspark
chie hayashida created SPARK-21358: -- Summary: variable of repartitionandsortwithinpartitions at pyspark Key: SPARK-21358 URL: https://issues.apache.org/jira/browse/SPARK-21358 Project: Spark Issue Type: Improvement Components: Documentation, Examples Affects Versions: 2.1.1 Reporter: chie hayashida Priority: Minor In rdd.py, implementation of repartitionandsortwithinpartitions is below. ``` def repartitionAndSortWithinPartitions(self, numPartitions=None, partitionFunc=portable_hash, ascending=True, keyfunc=lambda x: x): ``` And at document, there is following sample script. ``` >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)]) >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, 2) ``` The third argument (ascending) expected to be boolean, so following script is better, I think. ``` >>> rdd = sc.parallelize([(0, 5), (3, 8), (2, 6), (0, 8), (3, 8), (1, 3)]) >>> rdd2 = rdd.repartitionAndSortWithinPartitions(2, lambda x: x % 2, True) ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21357) FileInputDStream not remove out of date RDD
dadazheng created SPARK-21357: - Summary: FileInputDStream not remove out of date RDD Key: SPARK-21357 URL: https://issues.apache.org/jira/browse/SPARK-21357 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.1.0 Reporter: dadazheng The method in org.apache.spark.streaming.dstream.FileInputDSteam.clearMetadata at line 166 will not remove out of date RDDs in generatedRDDs. This will cause leak of memory and OOM. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21083) Store zero size and row count after analyzing empty table
[ https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-21083: Fix Version/s: 2.1.2 > Store zero size and row count after analyzing empty table > - > > Key: SPARK-21083 > URL: https://issues.apache.org/jira/browse/SPARK-21083 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.1.2, 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Comment: was deleted (was: !http://example.com/image.png!) > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test2.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: test2.JPG > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test2.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079799#comment-16079799 ] fengchaoge commented on SPARK-21337: OK i will have a try. thank you very much. > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079795#comment-16079795 ] Hyukjin Kwon commented on SPARK-21337: -- Probably, I think it would be nicer to narrow down via executing the query multiple times after removing apparently unrelated parts. For example, I don't think we need many fields to reproduce this problem. > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21356) CSV datasource failed to parse a value having newline in its value
[ https://issues.apache.org/jira/browse/SPARK-21356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21356. -- Resolution: Invalid I am resolving this as the workaround looks so easy and I am not sure if it makes sense to allow newline in its value without quotes for now. > CSV datasource failed to parse a value having newline in its value > -- > > Key: SPARK-21356 > URL: https://issues.apache.org/jira/browse/SPARK-21356 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > This is related with SPARK-21355. I guess this is also a rather corner case. > I found this during testing SPARK-21289. > It looks a bug in Univocity. > The codes below failed to parse newline in the value. > {code} > scala> spark.read.csv(Seq("a\nb", "abc").toDS).show() > +---+ > |_c0| > +---+ > | a| > |abc| > +---+ > {code} > But working around can be easily done with quotes as below: > {code} > scala> spark.read.csv(Seq("\"a\nb\"", "abc").toDS).show() > +---+ > |_c0| > +---+ > |a > b| > |abc| > +---+ > {code} > Meaning this works: > with the file below: > {code} > "a > b",abc > {code} > {code} > scala> spark.read.option("multiLine", true).csv("tmp.csv").show() > +---+---+ > |_c0|_c1| > +---+---+ > |a > b|abc| > +---+---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079773#comment-16079773 ] fengchaoge commented on SPARK-21337: 1. create database GBD_DM_PAC_SAFE; 2. use GBD_DM_PAC_SAFE; 3. create table app_claim_assess_rule_granularity; SQL like this,just for test: SELECT x___sql___.2jjg AS cjjg, x___sql___.3jjg AS djjg, (((CASE WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) < 0.20001) THEN 1 WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) < 0.5) THEN 2 ELSE 3 END) * 10) + (CASE WHEN ((CASE WHEN x___sql___.jcbj = 0 THEN CAST(NULL AS DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj END) < 300) THEN 1 WHEN ((CASE WHEN x___sql___.jcbj = 0 THEN CAST(NULL AS DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj END) < 2000) THEN 2 ELSE 3 END)) AS calculation_0290210162047568, x___sql___.updated_date AS calculation_0910125090644141, (CASE WHEN (x___sql___.small_type = '01') THEN '人工报价' ELSE (CASE WHEN (x___sql___.small_type = '02') THEN '指导人' ELSE x___sql___.small_type END) END) AS calculation_1700125090616887, x___sql___.impairment_amount AS calculation_1750125100625463, (CASE WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) < 0.20001) THEN 1 WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) < 0.5) THEN 2 ELSE 3 END) AS calculation_2170210160935298, (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) AS calculation_2390124154901057, (CASE WHEN (x___sql___.application_code = 'DSFS') THEN '定损发送规则' ELSE x___sql___.application_code END) AS calculation_2770125090429540, x___sql___.rule_name AS calculation_3060125090537403, (CASE WHEN ((CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) < 10) THEN '暂不考虑规则' ELSE (CASE WHEN CASE WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) < 0.20001) THEN 1 WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) < 0.5) THEN 2 ELSE 3 END) * 10) + (CASE WHEN ((CASE WHEN x___sql___.jcbj = 0 THEN CAST(NULL AS DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj END) < 300) THEN 1 WHEN ((CASE WHEN x___sql___.jcbj = 0 THEN CAST(NULL AS DOUBLE) ELSE x___sql___.impairment_amount / x___sql___.jcbj END) < 2000) THEN 2 ELSE 3 END)) = 13) THEN '最紧要优化规则' WHEN ((CASE WHEN ((CASE WHEN (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) = 0 THEN CAST(NULL AS DOUBLE) ELSE CAST(x___sql___.jcbj AS DOUBLE) / (CASE WHEN (x___sql___.id_clm_channel_process IS NULL) THEN 0 WHEN NOT (x___sql___.id_clm_channel_process IS NULL) THEN 1 ELSE CAST(NULL AS INT) END) END) <
[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079790#comment-16079790 ] fengchaoge commented on SPARK-21337: Attachments actually happened i have no idea about code generation. some one can help? very thanks much. > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21355) JSON datasource failed to parse a value having newline in its value
[ https://issues.apache.org/jira/browse/SPARK-21355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-21355. -- Resolution: Invalid I am resolving this per https://stackoverflow.com/a/42073. {quote} This is of course correct, but I'd like to add the reason for having to do this: the JSON spec at ietf.org/rfc/rfc4627.txt contains this sentence in section 2.5: "All Unicode characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark, reverse solidus, and the control characters (U+ through U+001F)." Since a newline is a control character, it must be escaped. {quote} > JSON datasource failed to parse a value having newline in its value > --- > > Key: SPARK-21355 > URL: https://issues.apache.org/jira/browse/SPARK-21355 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Minor > > I guess this is a rather corner case. I found this during testing SPARK-21289. > It looks a bug in Jackson. > The codes below failed to parse newline in the value. > {code} > scala> spark.read.json(Seq("{ \"f\": \"a\nb\"}", "{ \"f\": > \"abc\"}").toDS).show() > +---++ > |_corrupt_record| f| > +---++ > | { "f": "a > b"}|null| > | null| abc| > +---++ > {code} > Meaning this also does not work > with the JSON files as below: > {code} > {"f": " > d", "f0": 3} > {code} > {code} > scala> spark.read.option("multiLine", true).json("tmp.json").show() > ++ > | _corrupt_record| > ++ > |{"f": " > d", "f0"...| > ++ > {code} > Of course, the codes below work: > {code} > scala> spark.read.json(Seq("{ \"f\": \"ab\"}", "{ \"f\": > \"abc\"}").toDS).show() > +---+ > | f| > +---+ > | ab| > |abc| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079785#comment-16079785 ] fengchaoge commented on SPARK-21337: thank you very much, what should i do for next? Thank you for your guidance the table like this : CREATE TABLE app_claim_assess_rule_granularity( report_no string, case_times string, id_clm_channel_process string, loss_object_no string, assess_times string, loss_name string, max_loss_amount string, impairment_amount string, rule_code string, rule_name string, application_code string, brand_name string, manufacturer_name string, series_name string, group_name string, model_name string, end_case_date string, updated_date string, assess_um string, car_mark string, garage_code string, garage_name string , garage_type string , privilege_group_name string , small_type string, is_transfer string, praepostor_type string, channel_type string, channel_flag string, loss_type string, loss_agree_amount string, loss_count_agree string, department_code string, department_code_01 string, department_code_02_v string, department_code_03 string, department_code_04 string, department_code_name_01 string, department_code_name_02 string, department_code_name_03 string, department_code_name_04 string, assess_dept_code string, verify_department_code_01 string, verify_department_code_02 string, verify_department_code_03 string, verify_department_code_04 string, verify_department_code_name_01 string, verify_department_code_name_02 string, verify_department_code_name_03 string, verify_department_code_name_04 string, assess_quote_price_um string, assess_guide_um string, assess_center_guide_um string, rule_type string, loss_count_assess string, loss_name_rank string, loss_name_rule_rank string, both_trigger string) PARTITIONED BY ( department_code_02 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' WITH SERDEPROPERTIES ( 'field.delim'='\u0001', 'serialization.format'='\u0001') STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.RCFileOutputFormat' LOCATION 'hdfs://hdp-hdfs01/user/hive/warehouse/gbd_dm_pac_safe.db/app_claim_assess_rule_granularity' TBLPROPERTIES ( 'transient_lastDdlTime'='1499412897') > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fengchaoge updated SPARK-21337: --- Attachment: test1.JPG test.JPG > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19659. -- Resolution: Fixed Fix Version/s: 2.2.0 The major work is done in 2.2.0. But it's disabled by default because it cannot talk with old shuffle service. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Fix For: 2.2.0 > > Attachments: SPARK-19659-design-v1.pdf, SPARK-19659-design-v2.pdf > > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21337) SQL which has large ‘case when’ expressions may cause code generation beyond 64KB
[ https://issues.apache.org/jira/browse/SPARK-21337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079776#comment-16079776 ] Hyukjin Kwon commented on SPARK-21337: -- No, just copying and pasting the SQL without a further investigation does not help. Also, I can't reproduce this by the steps you gave in the master. In such case, I believe it should be narrowed down. It sounds virtually asking investigation rather than describing an issue. > SQL which has large ‘case when’ expressions may cause code generation beyond > 64KB > - > > Key: SPARK-21337 > URL: https://issues.apache.org/jira/browse/SPARK-21337 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1 > Environment: spark-2.1.1-hadoop-2.6.0-cdh-5.4.2 >Reporter: fengchaoge > Fix For: 2.1.1 > > Attachments: test1.JPG, test.JPG > > > when there are large 'case when ' expressions in spark sql,the CodeGenerator > failed to compile it. > Error message is followed by a huge dump of generated source code,at last > failed. > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.janino.JaninoRuntimeException: Code of method > apply_9$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > of class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > grows beyond 64 KB. > It seems that SPARK-13242 has solved this problem in spark-1.6.2,however it > apparence in spark-2.1.1 again. > https://issues.apache.org/jira/browse/SPARK-13242. > is there something wrong ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21355) JSON datasource failed to parse a value having newline in its value
Hyukjin Kwon created SPARK-21355: Summary: JSON datasource failed to parse a value having newline in its value Key: SPARK-21355 URL: https://issues.apache.org/jira/browse/SPARK-21355 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Minor I guess this is a rather corner case. I found this during testing SPARK-21289. It looks a bug in Jackson. The codes below failed to parse newline in the value. {code} scala> spark.read.json(Seq("{ \"f\": \"a\nb\"}", "{ \"f\": \"abc\"}").toDS).show() +---++ |_corrupt_record| f| +---++ | { "f": "a b"}|null| | null| abc| +---++ {code} Meaning this also does not work with the JSON files as below: {code} {"f": " d", "f0": 3} {code} {code} scala> spark.read.option("multiLine", true).json("tmp.json").show() ++ | _corrupt_record| ++ |{"f": " d", "f0"...| ++ {code} Of course, the codes below work: {code} scala> spark.read.json(Seq("{ \"f\": \"ab\"}", "{ \"f\": \"abc\"}").toDS).show() +---+ | f| +---+ | ab| |abc| +---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21356) CSV datasource failed to parse a value having newline in its value
Hyukjin Kwon created SPARK-21356: Summary: CSV datasource failed to parse a value having newline in its value Key: SPARK-21356 URL: https://issues.apache.org/jira/browse/SPARK-21356 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Trivial This is related with SPARK-21355. I guess this is also a rather corner case. I found this during testing SPARK-21289. It looks a bug in Univocity. The codes below failed to parse newline in the value. {code} scala> spark.read.csv(Seq("a\nb", "abc").toDS).show() +---+ |_c0| +---+ | a| |abc| +---+ {code} But working around can be easily done with quotes as below: {code} scala> spark.read.csv(Seq("\"a\nb\"", "abc").toDS).show() +---+ |_c0| +---+ |a b| |abc| +---+ {code} Meaning this works: with the file below: {code} "a b",abc {code} {code} scala> spark.read.option("multiLine", true).csv("tmp.csv").show() +---+---+ |_c0|_c1| +---+---+ |a b|abc| +---+---+ {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21354) INPUT FILE related functions do not support more than one sources
[ https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21354: Assignee: Xiao Li (was: Apache Spark) > INPUT FILE related functions do not support more than one sources > - > > Key: SPARK-21354 > URL: https://issues.apache.org/jira/browse/SPARK-21354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > hive> select *, INPUT__FILE__NAME FROM t1, t2; > FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One > Tables/Subqueries > {noformat} > The build-in functions {{input_file_name}}, {{input_file_block_start}}, > {{input_file_block_length}} do not support more than one sources, like what > Hive does. Currently, we do not block it and the outputs are ambiguous. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21354) INPUT FILE related functions do not support more than one sources
[ https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079743#comment-16079743 ] Apache Spark commented on SPARK-21354: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/18580 > INPUT FILE related functions do not support more than one sources > - > > Key: SPARK-21354 > URL: https://issues.apache.org/jira/browse/SPARK-21354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > hive> select *, INPUT__FILE__NAME FROM t1, t2; > FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One > Tables/Subqueries > {noformat} > The build-in functions {{input_file_name}}, {{input_file_block_start}}, > {{input_file_block_length}} do not support more than one sources, like what > Hive does. Currently, we do not block it and the outputs are ambiguous. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079751#comment-16079751 ] Dongjoon Hyun edited comment on SPARK-21349 at 7/9/17 11:29 PM: This issue is not about blindly raising the threashold, 100K. The default value will be the same for all users. What I mean is the size of task become bigger now than 3 years ago. Currenlty, it complains more frequently and misleadingly. For some apps, it always complains so that it becomes a meaningless warning, too. Is there any reason or criteria to decide 100K as the threadhold at that time three years ago? Then, at least, can we reevaluate the threshold? was (Author: dongjoon): This issue is not about blindly raising the threashold, 100K. The default value will be the same for all users. What I mean is the size of task become bigger now than 3 years ago. Currenlty, it complains more frequently and misleadingly. Is there any reason or criteria to decide 100K as the threadhold at that time three years ago? Then, at least, can we reevaluate the threshold? > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079751#comment-16079751 ] Dongjoon Hyun edited comment on SPARK-21349 at 7/9/17 11:28 PM: This issue is not about blindly raising the threashold, 100K. The default value will be the same for all users. What I mean is the size of task become bigger now than 3 years ago. Currenlty, it complains more frequently and misleadingly. Is there any reason or criteria to decide 100K as the threadhold at that time three years ago? Then, at least, can we reevaluate the threshold? was (Author: dongjoon): This issue is not about blindly raising the threashold, 100K. The default value will be the same for all users. What I mean is the size of task become bigger now than 3 years ago. Currenlty, it complains more frequently and misleadingly. Is there any reason or criteria to decide 100K as the threadhold at that time three years ago? Then, at least, can we evaluate the threshold? > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079751#comment-16079751 ] Dongjoon Hyun commented on SPARK-21349: --- This issue is not about blindly raising the threashold, 100K. The default value will be the same for all users. What I mean is the size of task become bigger now than 3 years ago. Currenlty, it complains more frequently and misleadingly. Is there any reason or criteria to decide 100K as the threadhold at that time three years ago? Then, at least, can we evaluate the threshold? > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079746#comment-16079746 ] Kay Ousterhout commented on SPARK-21349: Does that mean we should just raise this threshold for all users? > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21354) INPUT FILE related functions do not support more than one sources
[ https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21354: Description: {noformat} hive> select *, INPUT__FILE__NAME FROM t1, t2; FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries {noformat} The build-in functions {{input_file_name}}, {{input_file_block_start}}, {{input_file_block_length}} do not support more than one sources, like what Hive does. Currently, we do not block it and the outputs are ambiguous. was: {noformat} hive> select *, INPUT__FILE__NAME FROM t1, t2; FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries {noformat} The build-in functions {{input_file_name}}, {{input_file_block_start}}, {{input_file_block_length}} do not support more than one sources, like what Hive does > INPUT FILE related functions do not support more than one sources > - > > Key: SPARK-21354 > URL: https://issues.apache.org/jira/browse/SPARK-21354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > > {noformat} > hive> select *, INPUT__FILE__NAME FROM t1, t2; > FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One > Tables/Subqueries > {noformat} > The build-in functions {{input_file_name}}, {{input_file_block_start}}, > {{input_file_block_length}} do not support more than one sources, like what > Hive does. Currently, we do not block it and the outputs are ambiguous. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21354) INPUT FILE related functions do not support more than one sources
[ https://issues.apache.org/jira/browse/SPARK-21354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21354: Assignee: Apache Spark (was: Xiao Li) > INPUT FILE related functions do not support more than one sources > - > > Key: SPARK-21354 > URL: https://issues.apache.org/jira/browse/SPARK-21354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.1, 2.2.0 >Reporter: Xiao Li >Assignee: Apache Spark > > {noformat} > hive> select *, INPUT__FILE__NAME FROM t1, t2; > FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One > Tables/Subqueries > {noformat} > The build-in functions {{input_file_name}}, {{input_file_block_start}}, > {{input_file_block_length}} do not support more than one sources, like what > Hive does. Currently, we do not block it and the outputs are ambiguous. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21354) INPUT FILE related functions do not support more than one sources
Xiao Li created SPARK-21354: --- Summary: INPUT FILE related functions do not support more than one sources Key: SPARK-21354 URL: https://issues.apache.org/jira/browse/SPARK-21354 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.0.2, 2.2.0 Reporter: Xiao Li Assignee: Xiao Li {noformat} hive> select *, INPUT__FILE__NAME FROM t1, t2; FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries {noformat} The build-in functions {{input_file_name}}, {{input_file_block_start}}, {{input_file_block_length}} do not support more than one sources, like what Hive does -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21349: -- Description: Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, SPARK-2185. Although this is just a warning message, this issue tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. According to the Jenkins log, we also have 123 warnings even in our unit test. was:Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, SPARK-2185. Although this is just a warning message, this issue tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079728#comment-16079728 ] Dongjoon Hyun commented on SPARK-21349: --- Thank you for advice, [~kayousterhout]! For usability, we are buliding more complex tasks like Spark SQL before 3 years ago. This always happens for the specific user apps now. For that apps, it would be great if we configure this. In addition, we can make this as `internal` configuration instead of CONSTANT. > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079724#comment-16079724 ] Kay Ousterhout commented on SPARK-21349: Is this a major usability issue (and what's the use case where task sizes are regularly > 100KB)? I'm hesitant to make this a configuration parameter -- Spark already has a huge number of configuration parameters, making it hard for users to figure out which ones are relevant for them. > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-21349: -- Description: Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, SPARK-2185. Although this is just a warning message, this issue tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. (was: Although this is just a warning message, this issue tries to make `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users.) > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations
[ https://issues.apache.org/jira/browse/SPARK-21353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079633#comment-16079633 ] Apache Spark commented on SPARK-21353: -- User 'heary-cao' has created a pull request for this issue: https://github.com/apache/spark/pull/18555 > add checkValue in spark.internal.config about how to correctly set > configurations > - > > Key: SPARK-21353 > URL: https://issues.apache.org/jira/browse/SPARK-21353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: caoxuewen >Priority: Minor > > add checkValue for configurations: > spark.driver.memory > spark.executor.memory > spark.blockManager.port > spark.task.cpus > spark.dynamicAllocation.minExecutors > spark.task.maxFailures > spark.blacklist.task.maxTaskAttemptsPerExecutor > spark.blacklist.task.maxTaskAttemptsPerNode > spark.blacklist.application.maxFailedTasksPerExecutor > spark.blacklist.stage.maxFailedTasksPerExecutor > spark.blacklist.application.maxFailedExecutorsPerNode > spark.blacklist.stage.maxFailedExecutorsPerNode > spark.scheduler.listenerbus.metrics.maxListenerClassesTimed > spark.ui.retainedTasks > spark.blockManager.port > spark.driver.blockManager.port > spark.files.maxPartitionBytes > spark.files.openCostInBytes > spark.shuffle.accurateBlockThreshold > spark.shuffle.registration.timeout > spark.shuffle.registration.maxAttempts > spark.reducer.maxReqSizeShuffleToMem > and we copy the document from > http://spark.apache.org/docs/latest/configuration.html > spark.driver.userClassPathFirst > spark.driver.memory > spark.executor.userClassPathFirst > spark.executor.memory > spark.yarn.isPython > spark.task.cpus > spark.dynamicAllocation.minExecutors > spark.dynamicAllocation.initialExecutors > spark.dynamicAllocation.maxExecutors > spark.shuffle.service.enabled > spark.submit.pyFiles > spark.task.maxFailures > spark.blacklist.enabled > spark.blacklist.task.maxTaskAttemptsPerExecutor > spark.blacklist.task.maxTaskAttemptsPerNode > spark.blacklist.stage.maxFailedTasksPerExecutor > spark.blacklist.stage.maxFailedExecutorsPerNode > spark.pyspark.driver.python > spark.pyspark.python > spark.ui.retainedTasks > spark.io.encryption.enabled > spark.io.encryption.keygen.algorithm > spark.io.encryption.keySizeBits > spark.authenticate.enableSaslEncryption > spark.authenticate -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations
[ https://issues.apache.org/jira/browse/SPARK-21353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21353: Assignee: Apache Spark > add checkValue in spark.internal.config about how to correctly set > configurations > - > > Key: SPARK-21353 > URL: https://issues.apache.org/jira/browse/SPARK-21353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: caoxuewen >Assignee: Apache Spark >Priority: Minor > > add checkValue for configurations: > spark.driver.memory > spark.executor.memory > spark.blockManager.port > spark.task.cpus > spark.dynamicAllocation.minExecutors > spark.task.maxFailures > spark.blacklist.task.maxTaskAttemptsPerExecutor > spark.blacklist.task.maxTaskAttemptsPerNode > spark.blacklist.application.maxFailedTasksPerExecutor > spark.blacklist.stage.maxFailedTasksPerExecutor > spark.blacklist.application.maxFailedExecutorsPerNode > spark.blacklist.stage.maxFailedExecutorsPerNode > spark.scheduler.listenerbus.metrics.maxListenerClassesTimed > spark.ui.retainedTasks > spark.blockManager.port > spark.driver.blockManager.port > spark.files.maxPartitionBytes > spark.files.openCostInBytes > spark.shuffle.accurateBlockThreshold > spark.shuffle.registration.timeout > spark.shuffle.registration.maxAttempts > spark.reducer.maxReqSizeShuffleToMem > and we copy the document from > http://spark.apache.org/docs/latest/configuration.html > spark.driver.userClassPathFirst > spark.driver.memory > spark.executor.userClassPathFirst > spark.executor.memory > spark.yarn.isPython > spark.task.cpus > spark.dynamicAllocation.minExecutors > spark.dynamicAllocation.initialExecutors > spark.dynamicAllocation.maxExecutors > spark.shuffle.service.enabled > spark.submit.pyFiles > spark.task.maxFailures > spark.blacklist.enabled > spark.blacklist.task.maxTaskAttemptsPerExecutor > spark.blacklist.task.maxTaskAttemptsPerNode > spark.blacklist.stage.maxFailedTasksPerExecutor > spark.blacklist.stage.maxFailedExecutorsPerNode > spark.pyspark.driver.python > spark.pyspark.python > spark.ui.retainedTasks > spark.io.encryption.enabled > spark.io.encryption.keygen.algorithm > spark.io.encryption.keySizeBits > spark.authenticate.enableSaslEncryption > spark.authenticate -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations
[ https://issues.apache.org/jira/browse/SPARK-21353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21353: Assignee: (was: Apache Spark) > add checkValue in spark.internal.config about how to correctly set > configurations > - > > Key: SPARK-21353 > URL: https://issues.apache.org/jira/browse/SPARK-21353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: caoxuewen >Priority: Minor > > add checkValue for configurations: > spark.driver.memory > spark.executor.memory > spark.blockManager.port > spark.task.cpus > spark.dynamicAllocation.minExecutors > spark.task.maxFailures > spark.blacklist.task.maxTaskAttemptsPerExecutor > spark.blacklist.task.maxTaskAttemptsPerNode > spark.blacklist.application.maxFailedTasksPerExecutor > spark.blacklist.stage.maxFailedTasksPerExecutor > spark.blacklist.application.maxFailedExecutorsPerNode > spark.blacklist.stage.maxFailedExecutorsPerNode > spark.scheduler.listenerbus.metrics.maxListenerClassesTimed > spark.ui.retainedTasks > spark.blockManager.port > spark.driver.blockManager.port > spark.files.maxPartitionBytes > spark.files.openCostInBytes > spark.shuffle.accurateBlockThreshold > spark.shuffle.registration.timeout > spark.shuffle.registration.maxAttempts > spark.reducer.maxReqSizeShuffleToMem > and we copy the document from > http://spark.apache.org/docs/latest/configuration.html > spark.driver.userClassPathFirst > spark.driver.memory > spark.executor.userClassPathFirst > spark.executor.memory > spark.yarn.isPython > spark.task.cpus > spark.dynamicAllocation.minExecutors > spark.dynamicAllocation.initialExecutors > spark.dynamicAllocation.maxExecutors > spark.shuffle.service.enabled > spark.submit.pyFiles > spark.task.maxFailures > spark.blacklist.enabled > spark.blacklist.task.maxTaskAttemptsPerExecutor > spark.blacklist.task.maxTaskAttemptsPerNode > spark.blacklist.stage.maxFailedTasksPerExecutor > spark.blacklist.stage.maxFailedExecutorsPerNode > spark.pyspark.driver.python > spark.pyspark.python > spark.ui.retainedTasks > spark.io.encryption.enabled > spark.io.encryption.keygen.algorithm > spark.io.encryption.keySizeBits > spark.authenticate.enableSaslEncryption > spark.authenticate -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21353) add checkValue in spark.internal.config about how to correctly set configurations
caoxuewen created SPARK-21353: - Summary: add checkValue in spark.internal.config about how to correctly set configurations Key: SPARK-21353 URL: https://issues.apache.org/jira/browse/SPARK-21353 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: caoxuewen Priority: Minor add checkValue for configurations: spark.driver.memory spark.executor.memory spark.blockManager.port spark.task.cpus spark.dynamicAllocation.minExecutors spark.task.maxFailures spark.blacklist.task.maxTaskAttemptsPerExecutor spark.blacklist.task.maxTaskAttemptsPerNode spark.blacklist.application.maxFailedTasksPerExecutor spark.blacklist.stage.maxFailedTasksPerExecutor spark.blacklist.application.maxFailedExecutorsPerNode spark.blacklist.stage.maxFailedExecutorsPerNode spark.scheduler.listenerbus.metrics.maxListenerClassesTimed spark.ui.retainedTasks spark.blockManager.port spark.driver.blockManager.port spark.files.maxPartitionBytes spark.files.openCostInBytes spark.shuffle.accurateBlockThreshold spark.shuffle.registration.timeout spark.shuffle.registration.maxAttempts spark.reducer.maxReqSizeShuffleToMem and we copy the document from http://spark.apache.org/docs/latest/configuration.html spark.driver.userClassPathFirst spark.driver.memory spark.executor.userClassPathFirst spark.executor.memory spark.yarn.isPython spark.task.cpus spark.dynamicAllocation.minExecutors spark.dynamicAllocation.initialExecutors spark.dynamicAllocation.maxExecutors spark.shuffle.service.enabled spark.submit.pyFiles spark.task.maxFailures spark.blacklist.enabled spark.blacklist.task.maxTaskAttemptsPerExecutor spark.blacklist.task.maxTaskAttemptsPerNode spark.blacklist.stage.maxFailedTasksPerExecutor spark.blacklist.stage.maxFailedExecutorsPerNode spark.pyspark.driver.python spark.pyspark.python spark.ui.retainedTasks spark.io.encryption.enabled spark.io.encryption.keygen.algorithm spark.io.encryption.keySizeBits spark.authenticate.enableSaslEncryption spark.authenticate -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-21352: --- > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. > https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-21352. - > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. > https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21352. --- Resolution: Invalid > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. > https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21352. --- Resolution: Fixed That does not imply this is the right place. This is for proposing changes and diagnosing specific bugs. Do not reopen JIRAs. > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. > https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Gupta updated SPARK-21352: -- Description: I am trying to figure out the memory used by executors for a Spark Streaming job. For data I am using the rest endpoint for Spark AllExecutors and just summing up the metrics totalDuration * spark.executor.memory for every executor and then emitting the final sum as the memory usage. But this is coming out to be very small for application which ran whole day , is something wrong with the logic.Also I am using dynamic allocation and executorIdleTimeout is 5 seconds. Also I am also assuming that if some executor was removed for due to idle timeout and then was allocated to some other task then its totalDuration will be increased by the amount of time took by the executor to execute this new task. https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts was: I am trying to figure out the memory used by executors for a Spark Streaming job. For data I am using the rest endpoint for Spark AllExecutors and just summing up the metrics totalDuration * spark.executor.memory for every executor and then emitting the final sum as the memory usage. But this is coming out to be very small for application which ran whole day , is something wrong with the logic.Also I am using dynamic allocation and executorIdleTimeout is 5 seconds. Also I am also assuming that if some executor was removed for due to idle timeout and then was allocated to some other task then its totalDuration will be increased by the amount of time took by the executor to execute this new task. > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. > https://stackoverflow.com/questions/44995212/spark-streaming-memory-usage-doubts -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Gupta reopened SPARK-21352: --- There is no solution provided for the problem and neither stack overflow helping > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21083) Store zero size and row count after analyzing empty table
[ https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-21083: Fix Version/s: 2.2.1 > Store zero size and row count after analyzing empty table > - > > Key: SPARK-21083 > URL: https://issues.apache.org/jira/browse/SPARK-21083 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18016) Code Generation: Constant Pool Past Limit for Wide/Nested Dataset
[ https://issues.apache.org/jira/browse/SPARK-18016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079543#comment-16079543 ] Apache Spark commented on SPARK-18016: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/18579 > Code Generation: Constant Pool Past Limit for Wide/Nested Dataset > - > > Key: SPARK-18016 > URL: https://issues.apache.org/jira/browse/SPARK-18016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Aleksander Eskilson >Assignee: Aleksander Eskilson > Fix For: 2.3.0 > > > When attempting to encode collections of large Java objects to Datasets > having very wide or deeply nested schemas, code generation can fail, yielding: > {code} > Caused by: org.codehaus.janino.JaninoRuntimeException: Constant pool for > class > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection > has grown past JVM limit of 0x > at > org.codehaus.janino.util.ClassFile.addToConstantPool(ClassFile.java:499) > at > org.codehaus.janino.util.ClassFile.addConstantNameAndTypeInfo(ClassFile.java:439) > at > org.codehaus.janino.util.ClassFile.addConstantMethodrefInfo(ClassFile.java:358) > at > org.codehaus.janino.UnitCompiler.writeConstantMethodrefInfo(UnitCompiler.java:4) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4547) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) > at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377) > at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) > at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374) > at > org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369) > at > org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) > at >
[jira] [Resolved] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21352. --- Resolution: Invalid Please point questions to StackOverflow or the mailing list. > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21332) Incorrect result type inferred for some decimal expressions
[ https://issues.apache.org/jira/browse/SPARK-21332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079525#comment-16079525 ] Anton Okolnychyi commented on SPARK-21332: -- I know the root cause and will submit a PR soon. > Incorrect result type inferred for some decimal expressions > --- > > Key: SPARK-21332 > URL: https://issues.apache.org/jira/browse/SPARK-21332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Alexander Shkapsky > > Decimal expressions do not always follow the type inference rules explained > in DecimalPrecision.scala. An incorrect result type is produced when the > expressions contains more than 2 decimals. > For example: > spark-sql> CREATE TABLE Decimals(decimal_26_6 DECIMAL(26,6)); > ... > spark-sql> describe decimals; > ... > decimal_26_6 decimal(26,6) NULL > spark-sql> explain select decimal_26_6 * decimal_26_6 from decimals; > ... > == Physical Plan == > *Project [CheckOverflow((decimal_26_6#99 * decimal_26_6#99), > DecimalType(38,12)) AS (decimal_26_6 * decimal_26_6)#100] > +- HiveTableScan [decimal_26_6#99], MetastoreRelation default, decimals > However: > spark-sql> explain select decimal_26_6 * decimal_26_6 * decimal_26_6 from > decimals; > ... > == Physical Plan == > *Project [CheckOverflow((cast(CheckOverflow((decimal_26_6#104 * > decimal_26_6#104), DecimalType(38,12)) as decimal(26,6)) * decimal_26_6#104), > DecimalType(38,12)) AS ((decimal_26_6 * decimal_26_6) * decimal_26_6)#105] > +- HiveTableScan [decimal_26_6#104], MetastoreRelation default, decimals > The expected result type is DecimalType(38,18). > In Hive 1.1.0: > hive> explain select decimal_26_6 * decimal_26_6 from decimals; > OK > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: decimals > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column > stats: NONE > Select Operator > expressions: (decimal_26_6 * decimal_26_6) (type: decimal(38,12)) > outputColumnNames: _col0 > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column > stats: NONE > ListSink > Time taken: 0.772 seconds, Fetched: 17 row(s) > hive> explain select decimal_26_6 * decimal_26_6 * decimal_26_6 from decimals; > OK > STAGE DEPENDENCIES: > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > TableScan > alias: decimals > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column > stats: NONE > Select Operator > expressions: ((decimal_26_6 * decimal_26_6) * decimal_26_6) > (type: decimal(38,18)) > outputColumnNames: _col0 > Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column > stats: NONE > ListSink > Time taken: 0.064 seconds, Fetched: 17 row(s) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21352) Memory Usage in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-21352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubham Gupta updated SPARK-21352: -- Issue Type: Improvement (was: Bug) > Memory Usage in Spark Streaming > --- > > Key: SPARK-21352 > URL: https://issues.apache.org/jira/browse/SPARK-21352 > Project: Spark > Issue Type: Improvement > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.1.1 >Reporter: Shubham Gupta > Labels: newbie > > I am trying to figure out the memory used by executors for a Spark Streaming > job. For data I am using the rest endpoint for Spark AllExecutors and just > summing up the metrics totalDuration * spark.executor.memory for every > executor and then emitting the final sum as the memory usage. > But this is coming out to be very small for application which ran whole day , > is something wrong with the logic.Also I am using dynamic allocation and > executorIdleTimeout is 5 seconds. > Also I am also assuming that if some executor was removed for due to idle > timeout and then was allocated to some other task then its totalDuration will > be increased by the amount of time took by the executor to execute this new > task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21352) Memory Usage in Spark Streaming
Shubham Gupta created SPARK-21352: - Summary: Memory Usage in Spark Streaming Key: SPARK-21352 URL: https://issues.apache.org/jira/browse/SPARK-21352 Project: Spark Issue Type: Bug Components: DStreams, Spark Submit, YARN Affects Versions: 2.1.1 Reporter: Shubham Gupta I am trying to figure out the memory used by executors for a Spark Streaming job. For data I am using the rest endpoint for Spark AllExecutors and just summing up the metrics totalDuration * spark.executor.memory for every executor and then emitting the final sum as the memory usage. But this is coming out to be very small for application which ran whole day , is something wrong with the logic.Also I am using dynamic allocation and executorIdleTimeout is 5 seconds. Also I am also assuming that if some executor was removed for due to idle timeout and then was allocated to some other task then its totalDuration will be increased by the amount of time took by the executor to execute this new task. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079507#comment-16079507 ] Yan Facai (颜发才) commented on SPARK-21341: - Yes, [~sowen] is right. Why not to use save and load method? > Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel > - > > Key: SPARK-21341 > URL: https://issues.apache.org/jira/browse/SPARK-21341 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Zied Sellami > > I am using sparContext.saveAsObjectFile to save a complex object containing a > pipelineModel with a Word2Vec ML Transformer. When I load the object and call > myPipelineModel.transform, Word2VecModel raise a null pointer error on line > 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by > removing@transient annotation on val wordVectors and @transient lazy val on > getVectors function. > -Why this 2 val are transient ? > -Any solution to add a boolean function on the Word2Vec Transformer to force > the serialization of wordVectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21083) Store zero size and row count after analyzing empty table
[ https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079493#comment-16079493 ] Apache Spark commented on SPARK-21083: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/18577 > Store zero size and row count after analyzing empty table > - > > Key: SPARK-21083 > URL: https://issues.apache.org/jira/browse/SPARK-21083 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21351) Update nullability based on children's output in optimized logical plan
[ https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21351: Assignee: (was: Apache Spark) > Update nullability based on children's output in optimized logical plan > --- > > Key: SPARK-21351 > URL: https://issues.apache.org/jira/browse/SPARK-21351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In the master, optimized plans do not respect the nullability that `Filter` > might change when having `IsNotNull`. > This generates unnecessary code for NULL checks. For example: > {code} > scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") > scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") > scala> val targetQuery = bIsNotNull.distinct > scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable > res5: Boolean = true > scala> targetQuery.debugCodegen > Found 2 WholeStageCodegen subtrees. > == Subtree 1 / 2 == > *HashAggregate(keys=[b#19], functions=[], output=[b#19]) > +- Exchange hashpartitioning(b#19, 200) >+- *HashAggregate(keys=[b#19], functions=[], output=[b#19]) > +- *Project [_2#16 AS b#19] > +- *Filter isnotnull(_2#16) > +- LocalTableScan [_1#15, _2#16] > Generated code: > ... > /* 124 */ protected void processNext() throws java.io.IOException { > ... > /* 132 */ // output the result > /* 133 */ > /* 134 */ while (agg_mapIter.next()) { > /* 135 */ wholestagecodegen_numOutputRows.add(1); > /* 136 */ UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); > /* 137 */ UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); > /* 138 */ > /* 139 */ boolean agg_isNull4 = agg_aggKey.isNullAt(0); > /* 140 */ int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); > /* 141 */ agg_rowWriter1.zeroOutNullBytes(); > /* 142 */ > // We don't need this NULL check because NULL is filtered out > in `$"b" =!=2` > /* 143 */ if (agg_isNull4) { > /* 144 */ agg_rowWriter1.setNullAt(0); > /* 145 */ } else { > /* 146 */ agg_rowWriter1.write(0, agg_value4); > /* 147 */ } > /* 148 */ append(agg_result1); > /* 149 */ > /* 150 */ if (shouldStop()) return; > /* 151 */ } > /* 152 */ > /* 153 */ agg_mapIter.close(); > /* 154 */ if (agg_sorter == null) { > /* 155 */ agg_hashMap.free(); > /* 156 */ } > /* 157 */ } > /* 158 */ > /* 159 */ } > {code} > In the line 143, we don't need this NULL check because NULL is filtered out > in `$"b" =!=2`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21351) Update nullability based on children's output in optimized logical plan
[ https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21351: Assignee: Apache Spark > Update nullability based on children's output in optimized logical plan > --- > > Key: SPARK-21351 > URL: https://issues.apache.org/jira/browse/SPARK-21351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > In the master, optimized plans do not respect the nullability that `Filter` > might change when having `IsNotNull`. > This generates unnecessary code for NULL checks. For example: > {code} > scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") > scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") > scala> val targetQuery = bIsNotNull.distinct > scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable > res5: Boolean = true > scala> targetQuery.debugCodegen > Found 2 WholeStageCodegen subtrees. > == Subtree 1 / 2 == > *HashAggregate(keys=[b#19], functions=[], output=[b#19]) > +- Exchange hashpartitioning(b#19, 200) >+- *HashAggregate(keys=[b#19], functions=[], output=[b#19]) > +- *Project [_2#16 AS b#19] > +- *Filter isnotnull(_2#16) > +- LocalTableScan [_1#15, _2#16] > Generated code: > ... > /* 124 */ protected void processNext() throws java.io.IOException { > ... > /* 132 */ // output the result > /* 133 */ > /* 134 */ while (agg_mapIter.next()) { > /* 135 */ wholestagecodegen_numOutputRows.add(1); > /* 136 */ UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); > /* 137 */ UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); > /* 138 */ > /* 139 */ boolean agg_isNull4 = agg_aggKey.isNullAt(0); > /* 140 */ int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); > /* 141 */ agg_rowWriter1.zeroOutNullBytes(); > /* 142 */ > // We don't need this NULL check because NULL is filtered out > in `$"b" =!=2` > /* 143 */ if (agg_isNull4) { > /* 144 */ agg_rowWriter1.setNullAt(0); > /* 145 */ } else { > /* 146 */ agg_rowWriter1.write(0, agg_value4); > /* 147 */ } > /* 148 */ append(agg_result1); > /* 149 */ > /* 150 */ if (shouldStop()) return; > /* 151 */ } > /* 152 */ > /* 153 */ agg_mapIter.close(); > /* 154 */ if (agg_sorter == null) { > /* 155 */ agg_hashMap.free(); > /* 156 */ } > /* 157 */ } > /* 158 */ > /* 159 */ } > {code} > In the line 143, we don't need this NULL check because NULL is filtered out > in `$"b" =!=2`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21351) Update nullability based on children's output in optimized logical plan
[ https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079480#comment-16079480 ] Apache Spark commented on SPARK-21351: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/18576 > Update nullability based on children's output in optimized logical plan > --- > > Key: SPARK-21351 > URL: https://issues.apache.org/jira/browse/SPARK-21351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In the master, optimized plans do not respect the nullability that `Filter` > might change when having `IsNotNull`. > This generates unnecessary code for NULL checks. For example: > {code} > scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") > scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") > scala> val targetQuery = bIsNotNull.distinct > scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable > res5: Boolean = true > scala> targetQuery.debugCodegen > Found 2 WholeStageCodegen subtrees. > == Subtree 1 / 2 == > *HashAggregate(keys=[b#19], functions=[], output=[b#19]) > +- Exchange hashpartitioning(b#19, 200) >+- *HashAggregate(keys=[b#19], functions=[], output=[b#19]) > +- *Project [_2#16 AS b#19] > +- *Filter isnotnull(_2#16) > +- LocalTableScan [_1#15, _2#16] > Generated code: > ... > /* 124 */ protected void processNext() throws java.io.IOException { > ... > /* 132 */ // output the result > /* 133 */ > /* 134 */ while (agg_mapIter.next()) { > /* 135 */ wholestagecodegen_numOutputRows.add(1); > /* 136 */ UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); > /* 137 */ UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); > /* 138 */ > /* 139 */ boolean agg_isNull4 = agg_aggKey.isNullAt(0); > /* 140 */ int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); > /* 141 */ agg_rowWriter1.zeroOutNullBytes(); > /* 142 */ > // We don't need this NULL check because NULL is filtered out > in `$"b" =!=2` > /* 143 */ if (agg_isNull4) { > /* 144 */ agg_rowWriter1.setNullAt(0); > /* 145 */ } else { > /* 146 */ agg_rowWriter1.write(0, agg_value4); > /* 147 */ } > /* 148 */ append(agg_result1); > /* 149 */ > /* 150 */ if (shouldStop()) return; > /* 151 */ } > /* 152 */ > /* 153 */ agg_mapIter.close(); > /* 154 */ if (agg_sorter == null) { > /* 155 */ agg_hashMap.free(); > /* 156 */ } > /* 157 */ } > /* 158 */ > /* 159 */ } > {code} > In the line 143, we don't need this NULL check because NULL is filtered out > in `$"b" =!=2`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21083) Store zero size and row count after analyzing empty table
[ https://issues.apache.org/jira/browse/SPARK-21083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16079474#comment-16079474 ] Apache Spark commented on SPARK-21083: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/18575 > Store zero size and row count after analyzing empty table > - > > Key: SPARK-21083 > URL: https://issues.apache.org/jira/browse/SPARK-21083 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Zhenhua Wang >Assignee: Zhenhua Wang > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21351) Update nullability based on children's output in optimized logical plan
[ https://issues.apache.org/jira/browse/SPARK-21351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-21351: - Description: In the master, optimized plans do not respect the nullability that `Filter` might change when having `IsNotNull`. This generates unnecessary code for NULL checks. For example: {code} scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") scala> val targetQuery = bIsNotNull.distinct scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = true scala> targetQuery.debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == *HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Exchange hashpartitioning(b#19, 200) +- *HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- *Project [_2#16 AS b#19] +- *Filter isnotnull(_2#16) +- LocalTableScan [_1#15, _2#16] Generated code: ... /* 124 */ protected void processNext() throws java.io.IOException { ... /* 132 */ // output the result /* 133 */ /* 134 */ while (agg_mapIter.next()) { /* 135 */ wholestagecodegen_numOutputRows.add(1); /* 136 */ UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); /* 137 */ UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); /* 138 */ /* 139 */ boolean agg_isNull4 = agg_aggKey.isNullAt(0); /* 140 */ int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); /* 141 */ agg_rowWriter1.zeroOutNullBytes(); /* 142 */ // We don't need this NULL check because NULL is filtered out in `$"b" =!=2` /* 143 */ if (agg_isNull4) { /* 144 */ agg_rowWriter1.setNullAt(0); /* 145 */ } else { /* 146 */ agg_rowWriter1.write(0, agg_value4); /* 147 */ } /* 148 */ append(agg_result1); /* 149 */ /* 150 */ if (shouldStop()) return; /* 151 */ } /* 152 */ /* 153 */ agg_mapIter.close(); /* 154 */ if (agg_sorter == null) { /* 155 */ agg_hashMap.free(); /* 156 */ } /* 157 */ } /* 158 */ /* 159 */ } {code} In the line 143, we don't need this NULL check because NULL is filtered out in `$"b" =!=2`. was: In the master, optimized plans do not respect the nullability that `Filter` might change when having `IsNotNull`. This generates unnecessary code for NULL checks. For example: {code} scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") scala> val targetQuery = bIsNotNull.distinct scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = true scala> targetQuery.debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == *HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Exchange hashpartitioning(b#19, 200) +- *HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- *Project [_2#16 AS b#19] +- *Filter isnotnull(_2#16) +- LocalTableScan [_1#15, _2#16] Generated code: ... /* 124 */ protected void processNext() throws java.io.IOException { ... /* 132 */ // output the result /* 133 */ /* 134 */ while (agg_mapIter.next()) { /* 135 */ wholestagecodegen_numOutputRows.add(1); /* 136 */ UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); /* 137 */ UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); /* 138 */ /* 139 */ boolean agg_isNull4 = agg_aggKey.isNullAt(0); /* 140 */ int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); /* 141 */ agg_rowWriter1.zeroOutNullBytes(); /* 142 */ // We don't need this NULL check because NULL is filtered out in `$"b" =!=2` /* 143 */ if (agg_isNull4) { /* 144 */ agg_rowWriter1.setNullAt(0); /* 145 */ } else { /* 146 */ agg_rowWriter1.write(0, agg_value4); /* 147 */ } /* 148 */ append(agg_result1); /* 149 */ /* 150 */ if (shouldStop()) return; /* 151 */ } /* 152 */ /* 153 */ agg_mapIter.close(); /* 154 */ if (agg_sorter == null) { /* 155 */ agg_hashMap.free(); /* 156 */ } /* 157 */ } /* 158 */ /* 159 */ } {code} In the line 143, we don't need this NULL check because NULL is filtered out in `$"b" =!=2`. > Update nullability based on children's output in optimized logical plan > --- > > Key: SPARK-21351 > URL: https://issues.apache.org/jira/browse/SPARK-21351 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Takeshi Yamamuro >Priority: Minor > > In the master, optimized plans do not respect the nullability that `Filter` > might change when having `IsNotNull`. > This generates unnecessary code for NULL checks. For example: > {code} > scala> val df =
[jira] [Created] (SPARK-21351) Update nullability based on children's output in optimized logical plan
Takeshi Yamamuro created SPARK-21351: Summary: Update nullability based on children's output in optimized logical plan Key: SPARK-21351 URL: https://issues.apache.org/jira/browse/SPARK-21351 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.1 Reporter: Takeshi Yamamuro Priority: Minor In the master, optimized plans do not respect the nullability that `Filter` might change when having `IsNotNull`. This generates unnecessary code for NULL checks. For example: {code} scala> val df = Seq((Some(1), Some(2))).toDF("a", "b") scala> val bIsNotNull = df.where($"b" =!= 2).select($"b") scala> val targetQuery = bIsNotNull.distinct scala> val targetQuery.queryExecution.optimizedPlan.output(0).nullable res5: Boolean = true scala> targetQuery.debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == *HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- Exchange hashpartitioning(b#19, 200) +- *HashAggregate(keys=[b#19], functions=[], output=[b#19]) +- *Project [_2#16 AS b#19] +- *Filter isnotnull(_2#16) +- LocalTableScan [_1#15, _2#16] Generated code: ... /* 124 */ protected void processNext() throws java.io.IOException { ... /* 132 */ // output the result /* 133 */ /* 134 */ while (agg_mapIter.next()) { /* 135 */ wholestagecodegen_numOutputRows.add(1); /* 136 */ UnsafeRow agg_aggKey = (UnsafeRow) agg_mapIter.getKey(); /* 137 */ UnsafeRow agg_aggBuffer = (UnsafeRow) agg_mapIter.getValue(); /* 138 */ /* 139 */ boolean agg_isNull4 = agg_aggKey.isNullAt(0); /* 140 */ int agg_value4 = agg_isNull4 ? -1 : (agg_aggKey.getInt(0)); /* 141 */ agg_rowWriter1.zeroOutNullBytes(); /* 142 */ // We don't need this NULL check because NULL is filtered out in `$"b" =!=2` /* 143 */ if (agg_isNull4) { /* 144 */ agg_rowWriter1.setNullAt(0); /* 145 */ } else { /* 146 */ agg_rowWriter1.write(0, agg_value4); /* 147 */ } /* 148 */ append(agg_result1); /* 149 */ /* 150 */ if (shouldStop()) return; /* 151 */ } /* 152 */ /* 153 */ agg_mapIter.close(); /* 154 */ if (agg_sorter == null) { /* 155 */ agg_hashMap.free(); /* 156 */ } /* 157 */ } /* 158 */ /* 159 */ } {code} In the line 143, we don't need this NULL check because NULL is filtered out in `$"b" =!=2`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org