[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-26576: --- Summary: Broadcast hint not applied to partitioned table (was: Broadcast hint not applied to partitioned Parquet table) > Broadcast hint not applied to partitioned table > --- > > Key: SPARK-26576 > URL: https://issues.apache.org/jira/browse/SPARK-26576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: John Zhuge >Priority: Major > > Broadcast hint is not applied to partitioned Parquet table. Below > "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed > in Optimized Plan. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) > PARTITIONED BY (dateint INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_with_part` > : +- Relation[val#28,dateint#29] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- SubqueryAlias `jzhuge`.`parquet_with_part` >: +- Relation[val#28,dateint#29] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Optimized Logical Plan == > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- Project [val#28, dateint#29] >: +- Filter isnotnull(dateint#29) >: +- Relation[val#28,dateint#29] parquet >+- Project [val#32, dateint#33] > +- Filter isnotnull(dateint#33) > +- Relation[val#32,dateint#33] parquet > == Physical Plan == > *(5) Project [dateint#29, val#28, val#32] > +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner >:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, > 500), coordinator[target post-shuffle partition size: 67108864] >: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] > Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], > PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: > [], ReadSchema: struct >+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0 > +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: > 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle > partition size: 67108864] > {noformat} > Broadcast hint is applied to Parquet table without partition. Below > "BroadcastHashJoin" is chosen as expected. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint > INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_no_part` > : +- Relation[val#44,dateint#45] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- SubqueryAlias `jzhuge`.`parquet_no_part` >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Optimized Logical Plan == > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- Filter isnotnull(dateint#45) >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- Filter isnotnull(dateint#51) > +- Relation[val#50,dateint#51] parquet > == Physical Plan == > *(2) Project [dateint#45, val#44, val#50] > +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight >:- *(2) Project [val#44, dateint#45] >: +- *(2) Filter isnotnull(dateint#45) >: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] > Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], > PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: > struct >+- BroadcastExchange
[jira] [Commented] (SPARK-26491) Use ConfigEntry for hardcoded configs for test categories.
[ https://issues.apache.org/jira/browse/SPARK-26491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739167#comment-16739167 ] Dongjoon Hyun commented on SPARK-26491: --- The broken K8S integration compilation is fixed via https://github.com/apache/spark/pull/23505 . > Use ConfigEntry for hardcoded configs for test categories. > -- > > Key: SPARK-26491 > URL: https://issues.apache.org/jira/browse/SPARK-26491 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Assignee: Marco Gaido >Priority: Major > Fix For: 3.0.0 > > > Make the following hardcoded configs to use ConfigEntry. > {code} > spark.test > spark.testing > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739083#comment-16739083 ] deshanxiao commented on SPARK-26570: [~hyukjin.kwon] OK, I will try it. Thank you! > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26586) Streaming queries should have isolated SparkSessions and confs
Mukul Murthy created SPARK-26586: Summary: Streaming queries should have isolated SparkSessions and confs Key: SPARK-26586 URL: https://issues.apache.org/jira/browse/SPARK-26586 Project: Spark Issue Type: Bug Components: SQL, Structured Streaming Affects Versions: 2.4.0, 2.3.0 Reporter: Mukul Murthy When a stream is started, the stream's config is supposed to be frozen and all batches run with the config at start time. However, due to a race condition in creating streams, updating a conf value in the active spark session immediately after starting a stream can lead to the stream getting that updated value. The problem is that when StreamingQueryManager creates a MicrobatchExecution (or ContinuousExecution), it passes in the shared spark session, and the spark session isn't cloned until StreamExecution.start() is called. DataStreamWriter.start() should not return until the SparkSession is cloned. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26570) Out of memory when InMemoryFileIndex bulkListLeafFiles
[ https://issues.apache.org/jira/browse/SPARK-26570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738955#comment-16738955 ] Hyukjin Kwon commented on SPARK-26570: -- Would you be able to test this in upper version of Spark? > Out of memory when InMemoryFileIndex bulkListLeafFiles > -- > > Key: SPARK-26570 > URL: https://issues.apache.org/jira/browse/SPARK-26570 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: deshanxiao >Priority: Major > Attachments: screenshot-1.png > > > The *bulkListLeafFiles* will collect all filestatus in memory for every query > which may cause the oom of driver. I use the spark 2.3.2 meeting with the > problem. Maybe the latest one also exists the problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26574. -- Resolution: Invalid > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - Fix Version/s: (was: 0.8.2) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Labels: PA > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend
[ https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738956#comment-16738956 ] Nagaram Prasad Addepally commented on SPARK-26585: -- https://github.com/apache/spark/pull/23504 > [K8S] Add additional integration tests for K8s Scheduler Backend > - > > Key: SPARK-26585 > URL: https://issues.apache.org/jira/browse/SPARK-26585 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > I have reviewed the kubernetes integration tests and found out that following > cases are missing for testing scheduler backend functionality. > * Run application with driver and executor image specified independently > * Request Pods with custom CPU and Limits > * Request Pods with custom Memory and memory overhead factor > * Request Pods with custom Memory and memory overhead > * Pods are relaunched on failures (as per > spark.kubernetes.executor.lostCheck.maxAttempts) > Logging this Jira to add these tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend
[ https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26585: Assignee: (was: Apache Spark) > [K8S] Add additional integration tests for K8s Scheduler Backend > - > > Key: SPARK-26585 > URL: https://issues.apache.org/jira/browse/SPARK-26585 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > I have reviewed the kubernetes integration tests and found out that following > cases are missing for testing scheduler backend functionality. > * Run application with driver and executor image specified independently > * Request Pods with custom CPU and Limits > * Request Pods with custom Memory and memory overhead factor > * Request Pods with custom Memory and memory overhead > * Pods are relaunched on failures (as per > spark.kubernetes.executor.lostCheck.maxAttempts) > Logging this Jira to add these tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend
[ https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26585: Assignee: Apache Spark > [K8S] Add additional integration tests for K8s Scheduler Backend > - > > Key: SPARK-26585 > URL: https://issues.apache.org/jira/browse/SPARK-26585 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Nagaram Prasad Addepally >Assignee: Apache Spark >Priority: Major > > I have reviewed the kubernetes integration tests and found out that following > cases are missing for testing scheduler backend functionality. > * Run application with driver and executor image specified independently > * Request Pods with custom CPU and Limits > * Request Pods with custom Memory and memory overhead factor > * Request Pods with custom Memory and memory overhead > * Pods are relaunched on failures (as per > spark.kubernetes.executor.lostCheck.maxAttempts) > Logging this Jira to add these tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26579) SparkML DecisionTree, how does the algorithm identify categorical features?
[ https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738954#comment-16738954 ] Hyukjin Kwon commented on SPARK-26579: -- Let's ask question to mailing list rather then filing a JIRA here. You could have a better answer there. > SparkML DecisionTree, how does the algorithm identify categorical features? > --- > > Key: SPARK-26579 > URL: https://issues.apache.org/jira/browse/SPARK-26579 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.4.0 > Environment: os: Centos7 > software: pyspark. >Reporter: Xufeng Wang >Priority: Major > > I am confused about the decision tree and other tree based models. My current > project involves data with both nominal and continuous features. I have > converted the nominal data to continuous values using the StringIndexer > transformer from the ml.feature module. Then I vector assembled all the > feature values into a vector type column named features. The feature vector, > as I see it, are all double datatype. > While I keep getting the maxBins should be larger than the largest number for > all categorical features error, as I correct the maxBins size, I still see > some features (continuous type since the beginning) having the bigger than my > maxBins size values. Since the pipeline works with correct maxBins that is > not bigger than some continuous values, I should be able to say that the > algorithm automatically pick which features are categorical and which ones > are continuous. But how did it figure out which is which, as all of the > features are of double datatype? > Another question, if anyone can help, what is the tree type for spark > decision tree. Is it CART or else? > Last question, what are the procedures for treating categorical features in > tree based algorithms. > Thank you in advance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26579) SparkML DecisionTree, how does the algorithm identify categorical features?
[ https://issues.apache.org/jira/browse/SPARK-26579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26579. -- Resolution: Invalid > SparkML DecisionTree, how does the algorithm identify categorical features? > --- > > Key: SPARK-26579 > URL: https://issues.apache.org/jira/browse/SPARK-26579 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.4.0 > Environment: os: Centos7 > software: pyspark. >Reporter: Xufeng Wang >Priority: Major > > I am confused about the decision tree and other tree based models. My current > project involves data with both nominal and continuous features. I have > converted the nominal data to continuous values using the StringIndexer > transformer from the ml.feature module. Then I vector assembled all the > feature values into a vector type column named features. The feature vector, > as I see it, are all double datatype. > While I keep getting the maxBins should be larger than the largest number for > all categorical features error, as I correct the maxBins size, I still see > some features (continuous type since the beginning) having the bigger than my > maxBins size values. Since the pipeline works with correct maxBins that is > not bigger than some continuous values, I should be able to say that the > algorithm automatically pick which features are categorical and which ones > are continuous. But how did it figure out which is which, as all of the > features are of double datatype? > Another question, if anyone can help, what is the tree type for spark > decision tree. Is it CART or else? > Last question, what are the procedures for treating categorical features in > tree based algorithms. > Thank you in advance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - Flags: (was: Patch,Important) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - External issue URL: (was: https://pakegecloud.atlassian.net) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - Labels: (was: PA) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738952#comment-16738952 ] Hyukjin Kwon commented on SPARK-26574: -- Please fill the JIRA description, and reopen. > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - External issue ID: (was: roufi...@rtat.net) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - Component/s: (was: jenkins) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - Shepherd: (was: pakegecloud.atlassian.net) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Labels: PA > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26574) Cloud sql stronge
[ https://issues.apache.org/jira/browse/SPARK-26574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26574: - Target Version/s: (was: 2.4.0) > Cloud sql stronge > - > > Key: SPARK-26574 > URL: https://issues.apache.org/jira/browse/SPARK-26574 > Project: Spark > Issue Type: Bug > Components: jenkins, Kubernetes, Mesos, SQL >Affects Versions: 2.3.2 >Reporter: Roufique Hossain >Priority: Major > Labels: PA > Fix For: 0.8.2 > > Original Estimate: 8,509h > Remaining Estimate: 8,509h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26581) Spark Dataset write JSON with Multiline
[ https://issues.apache.org/jira/browse/SPARK-26581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26581. -- Resolution: Invalid Also, multiline concept is not applicable to write side. Let's also ask questions to Spark mailing list before filing an issue. > Spark Dataset write JSON with Multiline > --- > > Key: SPARK-26581 > URL: https://issues.apache.org/jira/browse/SPARK-26581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Anil >Priority: Major > > Hi, > Spark currently can only write JSON file for single node, if i have multiple > lines or nodes, spark writes nodes with curly braces " \{ }" without comma > "," in between both the nodes and there is no square brackets at start and > end of the file. How to achive this. i am trying to write the JSON file like:. > ds.write().format("JSON").option("multiline","true").save(path); > please help on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26581) Spark Dataset write JSON with Multiline
[ https://issues.apache.org/jira/browse/SPARK-26581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738948#comment-16738948 ] Hyukjin Kwon commented on SPARK-26581: -- {{multiline}} is not supported in write option. You can easily do it via manual conversion with DataFrame APIs. For instance, {code} ds.toJSON.mapPartitions { iter => // write [ for the first line, and ] for the last line }.write.text("...") {code} > Spark Dataset write JSON with Multiline > --- > > Key: SPARK-26581 > URL: https://issues.apache.org/jira/browse/SPARK-26581 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Anil >Priority: Major > > Hi, > Spark currently can only write JSON file for single node, if i have multiple > lines or nodes, spark writes nodes with curly braces " \{ }" without comma > "," in between both the nodes and there is no square brackets at start and > end of the file. How to achive this. i am trying to write the JSON file like:. > ds.write().format("JSON").option("multiline","true").save(path); > please help on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend
[ https://issues.apache.org/jira/browse/SPARK-26585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738944#comment-16738944 ] Nagaram Prasad Addepally commented on SPARK-26585: -- I am working on adding these tests. > [K8S] Add additional integration tests for K8s Scheduler Backend > - > > Key: SPARK-26585 > URL: https://issues.apache.org/jira/browse/SPARK-26585 > Project: Spark > Issue Type: Test > Components: Kubernetes >Affects Versions: 3.0.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > I have reviewed the kubernetes integration tests and found out that following > cases are missing for testing scheduler backend functionality. > * Run application with driver and executor image specified independently > * Request Pods with custom CPU and Limits > * Request Pods with custom Memory and memory overhead factor > * Request Pods with custom Memory and memory overhead > * Pods are relaunched on failures (as per > spark.kubernetes.executor.lostCheck.maxAttempts) > Logging this Jira to add these tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26585) [K8S] Add additional integration tests for K8s Scheduler Backend
Nagaram Prasad Addepally created SPARK-26585: Summary: [K8S] Add additional integration tests for K8s Scheduler Backend Key: SPARK-26585 URL: https://issues.apache.org/jira/browse/SPARK-26585 Project: Spark Issue Type: Test Components: Kubernetes Affects Versions: 3.0.0 Reporter: Nagaram Prasad Addepally I have reviewed the kubernetes integration tests and found out that following cases are missing for testing scheduler backend functionality. * Run application with driver and executor image specified independently * Request Pods with custom CPU and Limits * Request Pods with custom Memory and memory overhead factor * Request Pods with custom Memory and memory overhead * Pods are relaunched on failures (as per spark.kubernetes.executor.lostCheck.maxAttempts) Logging this Jira to add these tests -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed
[ https://issues.apache.org/jira/browse/SPARK-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738932#comment-16738932 ] nxet edited comment on SPARK-10781 at 1/10/19 2:52 AM: --- I met the same problem as some empty sequence files cause the failure of the whole job,but by MR can run normally(mapreduce.map.failures.maxpercent,mapreduce.reduce.failures.maxpercent),the following is my source files: _116.1 M 348.3 M /20181226/1545753600402.lzo_deflate_ _97.0 M 290.9 M /20181226/1545754236750.lzo_deflate_ _113.3 M 339.8 M /20181226/1545754856515.lzo_deflate_ _126.5 M 379.5 M /20181226/1545753600402.lzo_deflate_ _92.9 M 278.6 M /20181226/1545754233009.lzo_deflate_ _117.7 M 353.2 M /20181226/1545754850857.lzo_deflate_ _0 M 0 M /20181226/1545755455381.lzo_deflate_ _0 M 0 M /20181226/1545756056457.lzo_deflate_ was (Author: nxet): I met the same problem as some empty sequence files cause the failure of the whole job,but by MR can run normally(mapreduce.map.failures.maxpercent,mapreduce.reduce.failures.maxpercent),the following is my source files: _116.1 M 348.3 M /20181226/1545753600402.lzo_deflate 97.0 M 290.9 M /20181226/1545754236750.lzo_deflate 113.3 M 339.8 M /20181226/1545754856515.lzo_deflate 126.5 M 379.5 M /20181226/1545753600402.lzo_deflate 92.9 M 278.6 M /20181226/1545754233009.lzo_deflate 117.7 M 353.2 M /20181226/1545754850857.lzo_deflate 0 M 0 M /20181226/1545755455381.lzo_deflate 0 M 0 M /20181226/1545756056457.lzo_deflate_ > Allow certain number of failed tasks and allow job to succeed > - > > Key: SPARK-10781 > URL: https://issues.apache.org/jira/browse/SPARK-10781 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Priority: Major > Attachments: SPARK_10781_Proposed_Solution.pdf > > > MapReduce has this config mapreduce.map.failures.maxpercent and > mapreduce.reduce.failures.maxpercent which allows for a certain percent of > tasks to fail but the job to still succeed. > This could be a useful feature in Spark also if a job doesn't need all the > tasks to be successful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10781) Allow certain number of failed tasks and allow job to succeed
[ https://issues.apache.org/jira/browse/SPARK-10781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738932#comment-16738932 ] nxet commented on SPARK-10781: -- I met the same problem as some empty sequence files cause the failure of the whole job,but by MR can run normally(mapreduce.map.failures.maxpercent,mapreduce.reduce.failures.maxpercent),the following is my source files: _116.1 M 348.3 M /20181226/1545753600402.lzo_deflate 97.0 M 290.9 M /20181226/1545754236750.lzo_deflate 113.3 M 339.8 M /20181226/1545754856515.lzo_deflate 126.5 M 379.5 M /20181226/1545753600402.lzo_deflate 92.9 M 278.6 M /20181226/1545754233009.lzo_deflate 117.7 M 353.2 M /20181226/1545754850857.lzo_deflate 0 M 0 M /20181226/1545755455381.lzo_deflate 0 M 0 M /20181226/1545756056457.lzo_deflate_ > Allow certain number of failed tasks and allow job to succeed > - > > Key: SPARK-10781 > URL: https://issues.apache.org/jira/browse/SPARK-10781 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Priority: Major > Attachments: SPARK_10781_Proposed_Solution.pdf > > > MapReduce has this config mapreduce.map.failures.maxpercent and > mapreduce.reduce.failures.maxpercent which allows for a certain percent of > tasks to fail but the job to still succeed. > This could be a useful feature in Spark also if a job doesn't need all the > tasks to be successful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26546) Caching of DateTimeFormatter
[ https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26546. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23462 [https://github.com/apache/spark/pull/23462] > Caching of DateTimeFormatter > > > Key: SPARK-26546 > URL: https://issues.apache.org/jira/browse/SPARK-26546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently, instances of java.time.format.DateTimeFormatter are built each > time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter > is created which is time consuming operation because it should parse the > timestamp/date patterns. It could be useful to create a cache with key = > (pattern, locale) and value = instance of java.time.format.DateTimeFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26546) Caching of DateTimeFormatter
[ https://issues.apache.org/jira/browse/SPARK-26546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26546: Assignee: Maxim Gekk > Caching of DateTimeFormatter > > > Key: SPARK-26546 > URL: https://issues.apache.org/jira/browse/SPARK-26546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, instances of java.time.format.DateTimeFormatter are built each > time when new instance of Iso8601DateFormatter or Iso8601TimestampFormatter > is created which is time consuming operation because it should parse the > timestamp/date patterns. It could be useful to create a cache with key = > (pattern, locale) and value = instance of java.time.format.DateTimeFormatter. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26493) spark.sql.extensions should support multiple extensions
[ https://issues.apache.org/jira/browse/SPARK-26493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26493. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23398 [https://github.com/apache/spark/pull/23398] > spark.sql.extensions should support multiple extensions > --- > > Key: SPARK-26493 > URL: https://issues.apache.org/jira/browse/SPARK-26493 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jamison Bennett >Assignee: Jamison Bennett >Priority: Minor > Labels: starter > Fix For: 3.0.0 > > > The spark.sql.extensions configuration options should support multiple > extensions. It is currently possible to load multiple extensions using the > programatic interface (e.g. > SparkSession.builder().master("..").withExtensions(sparkSessionExtensions1).withExtensions(sparkSessionExtensions2).getOrCreate() > ) but the same cannot currently be done with the command line options > without writing a wrapper extensions that combines multiple extensions. > > Allowing multiple spark.sql.extensions, would allow the extensions to be > easily changes on the command line or via the configuration file. Multiple > extensions could be specified using a comma separated list of class names. > Allowing multiple extensions should maintain backwards compatibility because > existing spark.sql.extensions configuration settings shouldn't contain a > comma because the value is a class name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26493) spark.sql.extensions should support multiple extensions
[ https://issues.apache.org/jira/browse/SPARK-26493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26493: Assignee: Jamison Bennett > spark.sql.extensions should support multiple extensions > --- > > Key: SPARK-26493 > URL: https://issues.apache.org/jira/browse/SPARK-26493 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jamison Bennett >Assignee: Jamison Bennett >Priority: Minor > Labels: starter > > The spark.sql.extensions configuration options should support multiple > extensions. It is currently possible to load multiple extensions using the > programatic interface (e.g. > SparkSession.builder().master("..").withExtensions(sparkSessionExtensions1).withExtensions(sparkSessionExtensions2).getOrCreate() > ) but the same cannot currently be done with the command line options > without writing a wrapper extensions that combines multiple extensions. > > Allowing multiple spark.sql.extensions, would allow the extensions to be > easily changes on the command line or via the configuration file. Multiple > extensions could be specified using a comma separated list of class names. > Allowing multiple extensions should maintain backwards compatibility because > existing spark.sql.extensions configuration settings shouldn't contain a > comma because the value is a class name. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration
[ https://issues.apache.org/jira/browse/SPARK-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26584: Assignee: (was: Apache Spark) > Remove `spark.sql.orc.copyBatchToSpark` internal configuration > -- > > Key: SPARK-26584 > URL: https://issues.apache.org/jira/browse/SPARK-26584 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to remove internal ORC configuration to simplify the code > path for Spark 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration
[ https://issues.apache.org/jira/browse/SPARK-26584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26584: Assignee: Apache Spark > Remove `spark.sql.orc.copyBatchToSpark` internal configuration > -- > > Key: SPARK-26584 > URL: https://issues.apache.org/jira/browse/SPARK-26584 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue aims to remove internal ORC configuration to simplify the code > path for Spark 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26584) Remove `spark.sql.orc.copyBatchToSpark` internal configuration
Dongjoon Hyun created SPARK-26584: - Summary: Remove `spark.sql.orc.copyBatchToSpark` internal configuration Key: SPARK-26584 URL: https://issues.apache.org/jira/browse/SPARK-26584 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This issue aims to remove internal ORC configuration to simplify the code path for Spark 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26583: -- Component/s: Build > Add `paranamer` dependency to `core` module > --- > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example succeeds to compile but it fails > at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26583: -- Priority: Major (was: Blocker) > Add `paranamer` dependency to `core` module > --- > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example succeeds to compile but it fails > at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26583: -- Target Version/s: 2.4.1, 3.0.0 > Add `paranamer` dependency to `core` module > --- > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example succeeds to compile but it fails > at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26583: -- Description: With Scala-2.12 profile, Spark application fails while Spark is okay. For example, our documented `SimpleApp` example succeeds to compile but it fails at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128. https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html {code} $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- [INFO] my.test:simple:jar:1.0-SNAPSHOT [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- org.apache.avro:avro:jar:1.8.2:compile [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile {code} was: With Scala-2.12 profile, Spark application fails while Spark is okay. For example, our documented `SimpleApp` example fails because it doesn't use `paranamer 2.8` and hits SPARK-22128. https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html {code} $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- [INFO] my.test:simple:jar:1.0-SNAPSHOT [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- org.apache.avro:avro:jar:1.8.2:compile [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile {code} > Add `paranamer` dependency to `core` module > --- > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example succeeds to compile but it fails > at runtime because it doesn't use `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26583) Add `paranamer` to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738746#comment-16738746 ] Dongjoon Hyun commented on SPARK-26583: --- This is a minor issue, but this should be fixed before Spark 3.0.0 because this affects Spark applications . For Spark 2.4.0 Scala 2.12 artifacts, the situation is the same. Of course, users can add `paranamer` dependency into their pom. > Add `paranamer` to `core` module > > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example fails because it doesn't use > `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26583) Add `paranamer` dependency to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26583: -- Summary: Add `paranamer` dependency to `core` module (was: Add `paranamer` to `core` module) > Add `paranamer` dependency to `core` module > --- > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example fails because it doesn't use > `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26583) Add `paranamer` to `core` module
Dongjoon Hyun created SPARK-26583: - Summary: Add `paranamer` to `core` module Key: SPARK-26583 URL: https://issues.apache.org/jira/browse/SPARK-26583 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0, 3.0.0 Reporter: Dongjoon Hyun With Scala-2.12 profile, Spark application fails while Spark is okay. For example, our documented `SimpleApp` example fails because it doesn't use `paranamer 2.8` and hits SPARK-22128. https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html {code} $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- [INFO] my.test:simple:jar:1.0-SNAPSHOT [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile [INFO] \- org.apache.avro:avro:jar:1.8.2:compile [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26583) Add `paranamer` to `core` module
[ https://issues.apache.org/jira/browse/SPARK-26583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26583: -- Priority: Blocker (was: Major) > Add `paranamer` to `core` module > > > Key: SPARK-26583 > URL: https://issues.apache.org/jira/browse/SPARK-26583 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Blocker > > With Scala-2.12 profile, Spark application fails while Spark is okay. For > example, our documented `SimpleApp` example fails because it doesn't use > `paranamer 2.8` and hits SPARK-22128. > https://dist.apache.org/repos/dist/dev/spark/3.0.0-SNAPSHOT-2019_01_09_13_59-e853afb-docs/_site/quick-start.html > {code} > $ mvn dependency:tree -Dincludes=com.thoughtworks.paranamer > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ simple --- > [INFO] my.test:simple:jar:1.0-SNAPSHOT > [INFO] \- org.apache.spark:spark-sql_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO]\- org.apache.spark:spark-core_2.12:jar:3.0.0-SNAPSHOT:compile > [INFO] \- org.apache.avro:avro:jar:1.8.2:compile > [INFO] \- com.thoughtworks.paranamer:paranamer:jar:2.7:compile > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-26576: --- Affects Version/s: 2.2.2 > Broadcast hint not applied to partitioned Parquet table > --- > > Key: SPARK-26576 > URL: https://issues.apache.org/jira/browse/SPARK-26576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: John Zhuge >Priority: Major > > Broadcast hint is not applied to partitioned Parquet table. Below > "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed > in Optimized Plan. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) > PARTITIONED BY (dateint INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_with_part` > : +- Relation[val#28,dateint#29] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- SubqueryAlias `jzhuge`.`parquet_with_part` >: +- Relation[val#28,dateint#29] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Optimized Logical Plan == > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- Project [val#28, dateint#29] >: +- Filter isnotnull(dateint#29) >: +- Relation[val#28,dateint#29] parquet >+- Project [val#32, dateint#33] > +- Filter isnotnull(dateint#33) > +- Relation[val#32,dateint#33] parquet > == Physical Plan == > *(5) Project [dateint#29, val#28, val#32] > +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner >:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, > 500), coordinator[target post-shuffle partition size: 67108864] >: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] > Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], > PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: > [], ReadSchema: struct >+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0 > +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: > 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle > partition size: 67108864] > {noformat} > Broadcast hint is applied to Parquet table without partition. Below > "BroadcastHashJoin" is chosen as expected. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint > INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_no_part` > : +- Relation[val#44,dateint#45] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- SubqueryAlias `jzhuge`.`parquet_no_part` >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Optimized Logical Plan == > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- Filter isnotnull(dateint#45) >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- Filter isnotnull(dateint#51) > +- Relation[val#50,dateint#51] parquet > == Physical Plan == > *(2) Project [dateint#45, val#44, val#50] > +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight >:- *(2) Project [val#44, dateint#45] >: +- *(2) Filter isnotnull(dateint#45) >: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] > Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], > PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: > struct >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, > true] as bigint))) > +- *(1) Project
[jira] [Resolved] (SPARK-26448) retain the difference between 0.0 and -0.0
[ https://issues.apache.org/jira/browse/SPARK-26448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26448. - Resolution: Fixed Fix Version/s: 3.0.0 > retain the difference between 0.0 and -0.0 > -- > > Key: SPARK-26448 > URL: https://issues.apache.org/jira/browse/SPARK-26448 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26582) Update the optimizer rule NormalizeFloatingNumbers
Xiao Li created SPARK-26582: --- Summary: Update the optimizer rule NormalizeFloatingNumbers Key: SPARK-26582 URL: https://issues.apache.org/jira/browse/SPARK-26582 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Dilip Biswal See the discussion in https://github.com/apache/spark/pull/23388/files#r244421992 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738622#comment-16738622 ] Thomas Graves commented on SPARK-24374: --- [~luzengxiang] are you just saying when spark tries to kill the tasks running on tensorflow they don't really get killed? this could be tensorflow spark.task.reaper.killTimeout. > SPIP: Support Barrier Execution Mode in Apache Spark > > > Key: SPARK-24374 > URL: https://issues.apache.org/jira/browse/SPARK-24374 > Project: Spark > Issue Type: Epic > Components: ML, Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen, SPIP > Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf > > > (See details in the linked/attached SPIP doc.) > {quote} > The proposal here is to add a new scheduling model to Apache Spark so users > can properly embed distributed DL training as a Spark stage to simplify the > distributed training workflow. For example, Horovod uses MPI to implement > all-reduce to accelerate distributed TensorFlow training. The computation > model is different from MapReduce used by Spark. In Spark, a task in a stage > doesn’t depend on any other tasks in the same stage, and hence it can be > scheduled independently. In MPI, all workers start at the same time and pass > messages around. To embed this workload in Spark, we need to introduce a new > scheduling model, tentatively named “barrier scheduling”, which launches > tasks at the same time and provides users enough information and tooling to > embed distributed DL training. Spark can also provide an extra layer of fault > tolerance in case some tasks failed in the middle, where Spark would abort > all tasks and restart the stage. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26414) Race between SparkContext and YARN AM can cause NPE in UI setup code
[ https://issues.apache.org/jira/browse/SPARK-26414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-26414. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 3.0.0 > Race between SparkContext and YARN AM can cause NPE in UI setup code > > > Key: SPARK-26414 > URL: https://issues.apache.org/jira/browse/SPARK-26414 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 3.0.0 > > > There's a super narrow race between the SparkContext and the AM startup code: > - SC starts the AM and waits for it to go into running state > - AM goes into running state, unblocking SC > - AM sends AmIpFilter config to SC, adds the filter to the list and then the > filter configs > - unblocked SC is in the middle of setting up the UI and sees only the > filter, but not the configs > Then you get this: > {noformat} > ERROR org.apache.spark.SparkContext - Error initializing SparkContext. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.init(AmIpFilter.java:81) > at > org.spark_project.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139) > at > org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:881) > at > org.spark_project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349) > at > org.spark_project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778) > at > org.spark_project.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:520) > at > org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96) > at > org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.ui.WebUI.attachHandler(WebUI.scala:96) > at > org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522) > at > org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522) > at scala.Option.foreach(Option.scala:257) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26414) Race between SparkContext and YARN AM can cause NPE in UI setup code
[ https://issues.apache.org/jira/browse/SPARK-26414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738580#comment-16738580 ] Marcelo Vanzin commented on SPARK-26414: I actually ended up fixing this in the change for SPARK-24522, since it required adding some thread-safety to the code in question here. > Race between SparkContext and YARN AM can cause NPE in UI setup code > > > Key: SPARK-26414 > URL: https://issues.apache.org/jira/browse/SPARK-26414 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > There's a super narrow race between the SparkContext and the AM startup code: > - SC starts the AM and waits for it to go into running state > - AM goes into running state, unblocking SC > - AM sends AmIpFilter config to SC, adds the filter to the list and then the > filter configs > - unblocked SC is in the middle of setting up the UI and sees only the > filter, but not the configs > Then you get this: > {noformat} > ERROR org.apache.spark.SparkContext - Error initializing SparkContext. > java.lang.NullPointerException > at > org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter.init(AmIpFilter.java:81) > at > org.spark_project.jetty.servlet.FilterHolder.initialize(FilterHolder.java:139) > at > org.spark_project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:881) > at > org.spark_project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:349) > at > org.spark_project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:778) > at > org.spark_project.jetty.servlet.ServletContextHandler.doStart(ServletContextHandler.java:262) > at > org.spark_project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:68) > at org.apache.spark.ui.ServerInfo.addHandler(JettyUtils.scala:520) > at > org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96) > at > org.apache.spark.ui.WebUI$$anonfun$attachHandler$1.apply(WebUI.scala:96) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.ui.WebUI.attachHandler(WebUI.scala:96) > at > org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522) > at > org.apache.spark.SparkContext$$anonfun$22$$anonfun$apply$8.apply(SparkContext.scala:522) > at scala.Option.foreach(Option.scala:257) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25484) Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25484. --- Resolution: Fixed Assignee: Peter Toth Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/22617 > Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark > -- > > Key: SPARK-25484 > URL: https://issues.apache.org/jira/browse/SPARK-25484 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Peter Toth >Priority: Major > Fix For: 3.0.0 > > > Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark to print the output as a > separate file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737964#comment-16737964 ] John Zhuge edited comment on SPARK-26576 at 1/9/19 5:12 PM: No issue on the master branch. Please note "rightHint=(broadcast)" for the Join in Optimized Plan. {noformat} scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => df.join(broadcast(df), "dateint").explain(true)) == Parsed Logical Plan == 'Join UsingJoin(Inner,List(dateint)) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Analyzed Logical Plan == dateint: int, val: string, val: string Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Optimized Logical Plan == Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast) :- Project [val#34, dateint#35] : +- Filter isnotnull(dateint#35) : +- Relation[val#34,dateint#35] parquet +- Project [val#40, dateint#41] +- Filter isnotnull(dateint#41) +- Relation[val#40,dateint#41] parquet == Physical Plan == *(2) Project [dateint#35, val#34, val#40] +- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, true] as bigint))) +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct {noformat} >From a quick look at the source, EliminateResolvedHint pulls broadcast hint >into Join and eliminates the ResolvedHint node. It is called before >PruneFileSourcePartitions so the above code in >PhysicalOperation.collectProjectsAndFilters is never called on master branch >for the few cases I tried. was (Author: jzhuge): No issue on the master branch. Please note "rightHint=(broadcast)" for the Join in Optimized Plan. {noformat} scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => df.join(broadcast(df), "dateint").explain(true)) == Parsed Logical Plan == 'Join UsingJoin(Inner,List(dateint)) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Analyzed Logical Plan == dateint: int, val: string, val: string Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Optimized Logical Plan == Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast) :- Project [val#34, dateint#35] : +- Filter isnotnull(dateint#35) : +- Relation[val#34,dateint#35] parquet +- Project [val#40, dateint#41] +- Filter isnotnull(dateint#41) +- Relation[val#40,dateint#41] parquet == Physical Plan == *(2) Project [dateint#35, val#34, val#40] +- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, true] as bigint))) +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct {noformat} >From a quick look at the source, EliminateResolvedHint pulls broadcast hint >into Join and eliminates the ResolvedHint node. It is called before >PruneFileSourcePartitions so the above code in >PhysicalOperation.collectProjectsAndFilters is never called on master branch. > Broadcast hint not applied to partitioned Parquet table >
[jira] [Assigned] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26254: Assignee: Apache Spark > Move delegation token providers into a separate project > --- > > Key: SPARK-26254 > URL: https://issues.apache.org/jira/browse/SPARK-26254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Apache Spark >Priority: Major > > There was a discussion in > [PR#22598|https://github.com/apache/spark/pull/22598] that there are several > provided dependencies inside core project which shouldn't be there (for ex. > hive and kafka). This jira is to solve this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26254) Move delegation token providers into a separate project
[ https://issues.apache.org/jira/browse/SPARK-26254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26254: Assignee: (was: Apache Spark) > Move delegation token providers into a separate project > --- > > Key: SPARK-26254 > URL: https://issues.apache.org/jira/browse/SPARK-26254 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Priority: Major > > There was a discussion in > [PR#22598|https://github.com/apache/spark/pull/22598] that there are several > provided dependencies inside core project which shouldn't be there (for ex. > hive and kafka). This jira is to solve this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26581) Spark Dataset write JSON with Multiline
Anil created SPARK-26581: Summary: Spark Dataset write JSON with Multiline Key: SPARK-26581 URL: https://issues.apache.org/jira/browse/SPARK-26581 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Anil Hi, Spark currently can only write JSON file for single node, if i have multiple lines or nodes, spark writes nodes with curly braces " \{ }" without comma "," in between both the nodes and there is no square brackets at start and end of the file. How to achive this. i am trying to write the JSON file like:. ds.write().format("JSON").option("multiline","true").save(path); please help on this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738190#comment-16738190 ] Fabian Höring commented on SPARK-25433: --- [~dongjoon] [~hyukjin.kwon] If you think the blog post is interesting to other spark users maybe it can be shared somehow on the mailing list or to watchers of this ticket SPARK-13587. I didn't want to spam that's why I have only referenced it here. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25433) Add support for PEX in PySpark
[ https://issues.apache.org/jira/browse/SPARK-25433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738125#comment-16738125 ] Dongjoon Hyun commented on SPARK-25433: --- Thank you for the pointer, [~hyukjin.kwon] and [~fhoering]. > Add support for PEX in PySpark > -- > > Key: SPARK-25433 > URL: https://issues.apache.org/jira/browse/SPARK-25433 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.2 >Reporter: Fabian Höring >Priority: Minor > > The goal of this ticket is to ship and use custom code inside the spark > executors using [PEX|https://github.com/pantsbuild/pex] > This currently works fine with > [conda|https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html] > (disadvantages are that you have a separate conda package repo and ship the > python interpreter all the time) > Basically the workflow is > * to zip the local conda environment ([conda > pack|https://github.com/conda/conda-pack] also works) > * ship it to each executor as an archive > * modify PYSPARK_PYTHON to the local conda environment > I think it can work the same way with virtual env. There is the SPARK-13587 > ticket to provide nice entry points to spark-submit and SparkContext but > zipping your local virtual env and then just changing PYSPARK_PYTHON env > variable should already work. > I also have seen this > [blogpost|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]. > But recreating the virtual env each time doesn't seem to be a very scalable > solution. If you have hundreds of executors it will retrieve the packages on > each excecutor and recreate your virtual environment each time. Same problem > with this proposal SPARK-16367 from what I understood. > Another problem with virtual env is that your local environment is not easily > shippable to another machine. In particular there is the relocatable option > (see > [https://virtualenv.pypa.io/en/stable/userguide/#making-environments-relocatable], > > [https://stackoverflow.com/questions/7153113/virtualenv-relocatable-does-it-really-work)] > which makes it very complicated for the user to ship the virtual env and be > sure it works. > And here is where pex comes in. It is a nice way to create a single > executable zip file with all dependencies included. You have the pex command > line tool to build your package and when it is built you are sure it works. > This is in my opinion the most elegant way to ship python code (better than > virtual env and conda) > The problem why it doesn't work out of the box is that there can be only one > single entry point. So just shipping the pex files and setting PYSPARK_PYTHON > to the pex files doesn't work. You can nevertheless tune the env variable > [PEX_MODULE|https://pex.readthedocs.io/en/stable/api/index.html#module-pex.variables] > and runtime to provide different entry points. > PR: [https://github.com/apache/spark/pull/22422/files] > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26580) remove Scala 2.11 hack for Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-26580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26580: Assignee: Wenchen Fan (was: Apache Spark) > remove Scala 2.11 hack for Scala UDF > > > Key: SPARK-26580 > URL: https://issues.apache.org/jira/browse/SPARK-26580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26580) remove Scala 2.11 hack for Scala UDF
[ https://issues.apache.org/jira/browse/SPARK-26580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26580: Assignee: Apache Spark (was: Wenchen Fan) > remove Scala 2.11 hack for Scala UDF > > > Key: SPARK-26580 > URL: https://issues.apache.org/jira/browse/SPARK-26580 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Oscar Bonilla updated SPARK-26365: -- Affects Version/s: 2.4.0 > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Submit >Affects Versions: 2.3.2, 2.4.0 >Reporter: Oscar Bonilla >Priority: Minor > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737964#comment-16737964 ] John Zhuge commented on SPARK-26576: No issue on the master branch. Please note "rightHint=(broadcast)" for the Join in Optimized Plan. {noformat} scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => df.join(broadcast(df), "dateint").explain(true)) == Parsed Logical Plan == 'Join UsingJoin(Inner,List(dateint)) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Analyzed Logical Plan == dateint: int, val: string, val: string Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Optimized Logical Plan == Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast) :- Project [val#34, dateint#35] : +- Filter isnotnull(dateint#35) : +- Relation[val#34,dateint#35] parquet +- Project [val#40, dateint#41] +- Filter isnotnull(dateint#41) +- Relation[val#40,dateint#41] parquet == Physical Plan == *(2) Project [dateint#35, val#34, val#40] +- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, true] as bigint))) +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct {noformat} >From a quick look at the source, EliminateResolvedHint pulls broadcast hint >into Join and eliminates the ResolvedHint node. It is called before >PruneFileSourcePartitions so the above code in >PhysicalOperation.collectProjectsAndFilters is never called on master branch. > Broadcast hint not applied to partitioned Parquet table > --- > > Key: SPARK-26576 > URL: https://issues.apache.org/jira/browse/SPARK-26576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: John Zhuge >Priority: Major > > Broadcast hint is not applied to partitioned Parquet table. Below > "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed > in Optimized Plan. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) > PARTITIONED BY (dateint INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_with_part` > : +- Relation[val#28,dateint#29] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- SubqueryAlias `jzhuge`.`parquet_with_part` >: +- Relation[val#28,dateint#29] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Optimized Logical Plan == > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- Project [val#28, dateint#29] >: +- Filter isnotnull(dateint#29) >: +- Relation[val#28,dateint#29] parquet >+- Project [val#32, dateint#33] > +- Filter isnotnull(dateint#33) > +- Relation[val#32,dateint#33] parquet > == Physical Plan == > *(5) Project [dateint#29, val#28, val#32] > +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner >:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, > 500), coordinator[target post-shuffle partition size: 67108864] >: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] > Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], > PartitionCount: 0, PartitionFilters: