[jira] [Commented] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089274#comment-17089274 ] Wenchen Fan commented on SPARK-31170: - Yea it's a long-standing bug, but the fix is non-trivial, seems a bit risky for 2.4. cc [~dongjoon] [~holdenkarau] > Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir > > > Key: SPARK-31170 > URL: https://issues.apache.org/jira/browse/SPARK-31170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > In Spark CLI, we create a hive CliSessionState and it does not load the > hive-site.xml. So the configurations in hive-site.xml will not take effects > like other spark-hive integration apps. > Also, the warehouse directory is not correctly picked. If the `default` > database does not exist, the CliSessionState will create one during the first > time it talks to the metastore. The `Location` of the default DB will be > neither the value of spark.sql.warehousr.dir nor the user-specified value of > hive.metastore.warehourse.dir, but the default value of > hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31515) Canonicalize Cast should consider the value of needTimeZone
Yuanjian Li created SPARK-31515: --- Summary: Canonicalize Cast should consider the value of needTimeZone Key: SPARK-31515 URL: https://issues.apache.org/jira/browse/SPARK-31515 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result. Reporter: Yuanjian Li -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089245#comment-17089245 ] Zhou Jiashuai commented on SPARK-26385: --- I also use spark structured streaming with Kafka. After I upgrade my spark to spark-2.4.4-bin-hadoop2.7, I meet the problem. But before that, I use SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809 and there is no problem. My problem is similar to [~rajeevkumar]. With spark-2.4.4-bin-hadoop2.7 (run in yarn cluster mode), new tokens are created after the application running 18 hours. However, after 24 hours, the application throws exception and stops. The following log is the first entry of error log, which shows it picks the older token (with sequenceNumber=11994) other than the new token (with sequenceNumber=12010) when it tries to write to HDFS checkpointing location. {code:java} 20/04/16 23:21:36 ERROR Utils: Aborting task org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (token for username: HDFS_DELEGATION_TOKEN owner=usern...@bdp.com, renewer=yarn, realUser=, issueDate=1586962952299, maxDate=1587567752299, sequenceNumber=11994, masterKeyId=484) can't be found in cache at org.apache.hadoop.ipc.Client.call(Client.java:1475) at org.apache.hadoop.ipc.Client.call(Client.java:1412) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) at com.sun.proxy.$Proxy21.create(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296) at sun.reflect.GeneratedMethodAccessor65.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy22.create(Unknown Source) at org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648) at org.apache.hadoop.hdfs.DFSClient.primitiveCreate(DFSClient.java:1750) at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:102) at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:58) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:584) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:686) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:682) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:688) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:311) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:133) at org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:136) at org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:318) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.deltaFileStream$lzycompute(HDFSBackedStateStoreProvider.scala:95) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.deltaFileStream(HDFSBackedStateStoreProvider.scala:95) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.compressedStream$lzycompute(HDFSBackedStateStoreProvider.scala:96) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.compressedStream(HDFSBackedStateStoreProvider.scala:96) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.put(HDFSBackedStateStoreProvider.scala:109) at org.apache.spark.sql.execution.streaming.state.StreamingAggregationStateManagerImplV1.put(StreamingAggregationStateManager.scala:121) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.getNext(statefulOperators.scala:381) at org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.getNext(statefulOperators.scala:370) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source) at
[jira] [Updated] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive
[ https://issues.apache.org/jira/browse/SPARK-31514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sanchay Javeria updated SPARK-31514: Description: I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run a query via the {{spark-sql}} shell. The simplified setup basically looks like this: spark-sql shell running on one host in a Yarn cluster -> external hive-metastore running one host -> S3 to store table data. When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is what I see in the logs: {code:java} > bin/spark-sql --proxy-user proxy_user ... DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com (auth:KERBEROS) from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code} This means that Spark made a call to fetch the delegation token from the Hive metastore and then added it to the list of credentials for the UGI. [This is the piece of code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129] that does that. I also verified in the metastore logs that the {{get_delegation_token()}} call was being made. Now when I run a simple query like {{create table test_table (id int) location "s3://some/prefix";}} I get hit with an AWS credentials error. I modified the hive metastore code and added this right before the file system in Hadoop is initialized ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]): {code:java} public static FileSystem getFs(Path f, Configuration conf) throws MetaException { try { UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); LOG.info("UGI information: " + ugi); Collection> tokens = ugi.getCredentials().getAllTokens(); for(Token token : tokens) { LOG.info(token); } } catch (IOException e) { e.printStackTrace(); } ... {code} In the metastore logs, this does print the correct UGI information: {code:java} UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com (auth:KERBEROS){code} but there are no tokens present in the UGI. Looks like [Spark code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101] adds it with the alias {{hive.server2.delegation.token}} but I don't see it in the UGI. This makes me suspect that somehow the UGI scope is isolated and not being shared between spark-sql and hive metastore. How do I go about solving this? Any help will be really appreciated! was: I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run a query via the {{spark-sql}} shell. The simplified setup basically looks like this: spark-sql shell running on one host in a Yarn cluster -> external hive-metastore running one host -> S3 to store table data. When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is what I see in the logs: {code:java} > bin/spark-sql --proxy-user proxy_user ... DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com (auth:KERBEROS) from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code} This means that Spark made a call to fetch the delegation token from the Hive metastore and then added it to the list of credentials for the UGI. [This is the piece of code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129] that does that. I also verified in the metastore logs that the {{get_delegation_token()}} call was being made. Now when I run a simple query like {{create table test_table (id int) location "s3://some/prefix";}} I get hit with an AWS credentials error. I modified the hive metastore code and added this right before the file system in Hadoop is initialized ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]]): {code:java} public static FileSystem getFs(Path f, Configuration conf) throws MetaException { try { UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); LOG.info("UGI information: " + ugi); Collection> tokens = ugi.getCredentials().getAllTokens(); for(Token token : tokens) { LOG.info(token); } } catch (IOException e) { e.printStackTrace(); } ... {code} In the metastore logs, this does print the correct UGI information: {code:java} UGI information: proxy_user
[jira] [Created] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive
Sanchay Javeria created SPARK-31514: --- Summary: Kerberos: Spark UGI credentials are not getting passed down to Hive Key: SPARK-31514 URL: https://issues.apache.org/jira/browse/SPARK-31514 Project: Spark Issue Type: Question Components: SQL Affects Versions: 2.4.4 Reporter: Sanchay Javeria I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run a query via the {{spark-sql}} shell. The simplified setup basically looks like this: spark-sql shell running on one host in a Yarn cluster -> external hive-metastore running one host -> S3 to store table data. When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is what I see in the logs: {code:java} > bin/spark-sql --proxy-user proxy_user ... DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user against hive/_h...@realm.com at thrift://hive-metastore:9083 DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com (auth:KERBEROS) from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code} This means that Spark made a call to fetch the delegation token from the Hive metastore and then added it to the list of credentials for the UGI. [This is the piece of code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129] that does that. I also verified in the metastore logs that the {{get_delegation_token()}} call was being made. Now when I run a simple query like {{create table test_table (id int) location "s3://some/prefix";}} I get hit with an AWS credentials error. I modified the hive metastore code and added this right before the file system in Hadoop is initialized ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]]): {code:java} public static FileSystem getFs(Path f, Configuration conf) throws MetaException { try { UserGroupInformation ugi = UserGroupInformation.getCurrentUser(); LOG.info("UGI information: " + ugi); Collection> tokens = ugi.getCredentials().getAllTokens(); for(Token token : tokens) { LOG.info(token); } } catch (IOException e) { e.printStackTrace(); } ... {code} In the metastore logs, this does print the correct UGI information: {code:java} UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com (auth:KERBEROS){code} but there are no tokens present in the UGI. Looks like [Spark code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101] adds it with the alias {{hive.server2.delegation.token}} but I don't see it in the UGI. This makes me suspect that somehow the UGI scope is isolated and not being shared between spark-sql and hive metastore. How do I go about solving this? Any help will be really appreciated! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset
[ https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31509: - Affects Version/s: (was: 3.1.0) 2.4.0 > Recommend user to disable delay scheduling for barrier taskset > -- > > Key: SPARK-31509 > URL: https://issues.apache.org/jira/browse/SPARK-31509 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: wuyi >Priority: Major > > Currently, barrier taskset can only be scheduled when all tasks are launched > at the same time. As a result, delay scheduling becomes the biggest obstacle > for the barrier taskset to get scheduled. And the application can be easily > fail with barrier taskset when delay scheduling is on. So, probably, we > should recommend user to disable delay scheduling while using barrier > execution. > > BTW, we could still support delay scheduling with barrier taskset if we find > a way to resolve SPARK-24818 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion
[ https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31507: Description: Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. None of the systems listing below support these except PostgreSQL. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm https://prestodb.io/docs/current/functions/datetime.html https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw was: Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. None of the systems listing below support these except PostgreSQL. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm https://prestodb.io/docs/current/functions/datetime.html https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT > Remove millennium, century, decade, millisecond, microsecond and epoch from > extract fucntion > > > Key: SPARK-31507 > URL: https://issues.apache.org/jira/browse/SPARK-31507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > Extracting millennium, century, decade, millisecond, microsecond and epoch > from datetime is neither ANSI standard nor quite common in modern SQL > platforms. None of the systems listing below support these except PostgreSQL. > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm > https://prestodb.io/docs/current/functions/datetime.html > https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html > https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts > https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT > https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
[ https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31511. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28286 [https://github.com/apache/spark/pull/28286] > Make BytesToBytesMap iterator() thread-safe > --- > > Key: SPARK-31511 > URL: https://issues.apache.org/jira/browse/SPARK-31511 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.1.0 > > > BytesToBytesMap currently has a thread safe and unsafe iterator method. This > is somewhat confusing. We should make iterator() thread safe and remove the > safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089190#comment-17089190 ] Marcelo Masiero Vanzin commented on SPARK-1537: --- Well, the only thing to start with is the existing SHS code. EventLoggingListener + FsHistoryProvider. > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Masiero Vanzin >Priority: Major > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089179#comment-17089179 ] Erik Krogen commented on SPARK-31418: - PR was posted by [~vsowrirajan] here: https://github.com/apache/spark/pull/28287 > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31349) Document builtin aggregate function
[ https://issues.apache.org/jira/browse/SPARK-31349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31349. -- Fix Version/s: 3.0.0 Assignee: Takeshi Yamamuro Resolution: Fixed Fixed in SPARK-31429 > Document builtin aggregate function > --- > > Key: SPARK-31349 > URL: https://issues.apache.org/jira/browse/SPARK-31349 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: kevin yu >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089159#comment-17089159 ] Daniel Templeton commented on SPARK-1537: - Thanks for the response, [~vanzin]. Yeah, I think we would be interested in exploring the work involved to do the integration. We're in the process of introducing Spark into Hadoop clusters that primarily run Scalding today. We're using ATSv2 as the store for all of the Scalding metrics, so it would make sense for us to do the same with Spark. Any required reading that we should do as we decide how best to tackle this? Pointers, tips, tricks, potholes, or any other info would be welcome. Thanks! > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Masiero Vanzin >Priority: Major > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31369) Add Documentation for JSON functions
[ https://issues.apache.org/jira/browse/SPARK-31369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31369. -- Fix Version/s: 3.0.0 Assignee: Takeshi Yamamuro Resolution: Fixed Fixed in SPARK-31429. > Add Documentation for JSON functions > > > Key: SPARK-31369 > URL: https://issues.apache.org/jira/browse/SPARK-31369 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Rakesh Raushan >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31513) Improve the structure of auto-generated built-in function pages in SQL references
Takeshi Yamamuro created SPARK-31513: Summary: Improve the structure of auto-generated built-in function pages in SQL references Key: SPARK-31513 URL: https://issues.apache.org/jira/browse/SPARK-31513 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro We now put examples per group in one section in the builtin function documents like this (See: SPARK-31429); {code} ### Aggrgate Functions Examples ### Window Functions Examples ... {code} But, it might be better to put everything about one function (signature, desc, examples, etc.). So, we need more discussions to improve the structure in the page. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31512) make window function using order by optional
philipse created SPARK-31512: Summary: make window function using order by optional Key: SPARK-31512 URL: https://issues.apache.org/jira/browse/SPARK-31512 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5 Environment: spark2.4.5 hadoop2.7.7 Reporter: philipse Hi all In other sql dialect ,order by is not the must when using window function,we may make it pararmteterized the below is the case : *select row_number()over() from test1* Error: org.apache.spark.sql.AnalysisException: Window function row_number() requires window to be ordered, please add ORDER BY clause. For example SELECT row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY window_ordering) from table; (state=,code=0) So. i suggest make it as a choice,or we will meet the error when migrate sql from other dialect,such as hive -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server
[ https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089137#comment-17089137 ] Marcelo Masiero Vanzin commented on SPARK-1537: --- [~templedf] sorry forgot to reply. ATSv1 wasn't a good match for this, and by the time ATSv2 was developed, interest in this feature had long lost traction in the Spark community. So this was closed. Also you probably can do this without requiring the code to live in Spark. But if you actually want to contribute the integration, there's nothing preventing you from opening a new bug and posting a PR. > Add integration with Yarn's Application Timeline Server > --- > > Key: SPARK-1537 > URL: https://issues.apache.org/jira/browse/SPARK-1537 > Project: Spark > Issue Type: New Feature > Components: YARN >Reporter: Marcelo Masiero Vanzin >Priority: Major > Attachments: SPARK-1537.txt, spark-1573.patch > > > It would be nice to have Spark integrate with Yarn's Application Timeline > Server (see YARN-321, YARN-1530). This would allow users running Spark on > Yarn to have a single place to go for all their history needs, and avoid > having to manage a separate service (Spark's built-in server). > At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, > although there is still some ongoing work. But the basics are there, and I > wouldn't expect them to change (much) at this point. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31233) Enhance RpcTimeoutException Log Message
[ https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31233: -- Summary: Enhance RpcTimeoutException Log Message (was: TimeoutException contains null remoteAddr) > Enhance RpcTimeoutException Log Message > --- > > Key: SPARK-31233 > URL: https://issues.apache.org/jira/browse/SPARK-31233 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Yi Huang >Assignee: Yi Huang >Priority: Minor > Fix For: 3.1.0 > > > Application log: > {code:java} > Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures > timed out after [800 seconds]. This timeout is controlled by > spark.network.timeout: > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Driver log: > {code:java} > [block-manager-ask-thread-pool-149] WARN > org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - > Cannot receive any reply from null in 800 seconds. This timeout is controlled > by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot > receive any reply from null in 800 seconds. This timeout is controlled by > spark.network.timeout{code} > The log message does not provide RpcAddress of the destination RpcEndpoint. > It is due to > {noformat} > * The `rpcAddress` may be null, in which case the endpoint is registered via > a client-only > * connection and can only be reached via the client that sent the endpoint > reference.{noformat} > Solution: > using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint > resides in client mode. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31233) TimeoutException contains null remoteAddr
[ https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31233: - Assignee: Yi Huang > TimeoutException contains null remoteAddr > - > > Key: SPARK-31233 > URL: https://issues.apache.org/jira/browse/SPARK-31233 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Yi Huang >Assignee: Yi Huang >Priority: Minor > > Application log: > {code:java} > Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures > timed out after [800 seconds]. This timeout is controlled by > spark.network.timeout: > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Driver log: > {code:java} > [block-manager-ask-thread-pool-149] WARN > org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - > Cannot receive any reply from null in 800 seconds. This timeout is controlled > by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot > receive any reply from null in 800 seconds. This timeout is controlled by > spark.network.timeout{code} > The log message does not provide RpcAddress of the destination RpcEndpoint. > It is due to > {noformat} > * The `rpcAddress` may be null, in which case the endpoint is registered via > a client-only > * connection and can only be reached via the client that sent the endpoint > reference.{noformat} > Solution: > using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint > resides in client mode. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31233) TimeoutException contains null remoteAddr
[ https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31233. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28002 [https://github.com/apache/spark/pull/28002] > TimeoutException contains null remoteAddr > - > > Key: SPARK-31233 > URL: https://issues.apache.org/jira/browse/SPARK-31233 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Yi Huang >Assignee: Yi Huang >Priority: Minor > Fix For: 3.1.0 > > > Application log: > {code:java} > Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures > timed out after [800 seconds]. This timeout is controlled by > spark.network.timeout: > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Driver log: > {code:java} > [block-manager-ask-thread-pool-149] WARN > org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - > Cannot receive any reply from null in 800 seconds. This timeout is controlled > by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot > receive any reply from null in 800 seconds. This timeout is controlled by > spark.network.timeout{code} > The log message does not provide RpcAddress of the destination RpcEndpoint. > It is due to > {noformat} > * The `rpcAddress` may be null, in which case the endpoint is registered via > a client-only > * connection and can only be reached via the client that sent the endpoint > reference.{noformat} > Solution: > using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint > resides in client mode. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr
[ https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31233: -- Issue Type: Improvement (was: Bug) > TimeoutException contains null remoteAddr > - > > Key: SPARK-31233 > URL: https://issues.apache.org/jira/browse/SPARK-31233 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.4 >Reporter: Yi Huang >Priority: Minor > > Application log: > {code:java} > Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures > timed out after [800 seconds]. This timeout is controlled by > spark.network.timeout: > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Driver log: > {code:java} > [block-manager-ask-thread-pool-149] WARN > org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - > Cannot receive any reply from null in 800 seconds. This timeout is controlled > by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot > receive any reply from null in 800 seconds. This timeout is controlled by > spark.network.timeout{code} > The log message does not provide RpcAddress of the destination RpcEndpoint. > It is due to > {noformat} > * The `rpcAddress` may be null, in which case the endpoint is registered via > a client-only > * connection and can only be reached via the client that sent the endpoint > reference.{noformat} > Solution: > using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint > resides in client mode. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr
[ https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31233: -- Affects Version/s: (was: 2.3.4) 3.1.0 > TimeoutException contains null remoteAddr > - > > Key: SPARK-31233 > URL: https://issues.apache.org/jira/browse/SPARK-31233 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Yi Huang >Priority: Minor > > Application log: > {code:java} > Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures > timed out after [800 seconds]. This timeout is controlled by > spark.network.timeout: > org.apache.spark.rpc.RpcTimeout.org > $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) > {code} > Driver log: > {code:java} > [block-manager-ask-thread-pool-149] WARN > org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - > Cannot receive any reply from null in 800 seconds. This timeout is controlled > by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot > receive any reply from null in 800 seconds. This timeout is controlled by > spark.network.timeout{code} > The log message does not provide RpcAddress of the destination RpcEndpoint. > It is due to > {noformat} > * The `rpcAddress` may be null, in which case the endpoint is registered via > a client-only > * connection and can only be reached via the client that sent the endpoint > reference.{noformat} > Solution: > using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint > resides in client mode. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30162. --- Fix Version/s: 3.0.0 Resolution: Fixed Please feel free to reopen this if you are facing a bug with 3.0.0-RC1. For me, it looks correctly. > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png, partition_pruning.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be >
[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089041#comment-17089041 ] Dongjoon Hyun commented on SPARK-30162: --- In 3.0.0-RC1, I can see `number of partitions read: 1`. !partition_pruning.png! > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png, partition_pruning.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be > overridden by the value set by the cluster manager (via
[jira] [Updated] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30162: -- Attachment: partition_pruning.png > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png, partition_pruning.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be > overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in > mesos/standalone/kubernetes and LOCAL_DIRS in YARN). > Welcome to >
[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031 ] Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:25 PM: - Hi, [~sowen]. Is there a reason for you to remove the fixed version? It's fixed since `3.0.0-preview2`, isn't it? **Apache Spark 3.0.0-preview2** {code} >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#3L] +- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L)) +- *(1) ColumnarToRow +- BatchScan[id#3L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)] {code} **Apache Spark 3.0.0-RC1** {code} >>> sql("set spark.sql.sources.useV1SourceList=''") DataFrame[key: string, value: string] >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#53L] +- *(1) Filter (isnotnull(id#53L) AND (5 > id#53L)) +- *(1) ColumnarToRow +- BatchScan[id#53L] ParquetScan DataFilters: [isnotnull(id#53L), (5 > id#53L)], Location: InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)] {code} was (Author: dongjoon): Hi, [~sowen]. Is there a reason for you to remove the fixed version? It's fixed since `3.0.0-preview2`, isn't it? **Apache Spark 3.0.0-preview2** {code} >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#3L] +- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L)) +- *(1) ColumnarToRow +- BatchScan[id#3L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)] {code} **Apache Spark 3.0.0-RC1** {code} >>> sql("set spark.sql.sources.useV1SourceList") DataFrame[key: string, value: string] >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#13L] +- *(1) Filter (isnotnull(id#13L) AND (5 > id#13L)) +- *(1) ColumnarToRow +- FileScan parquet [id#13L] Batched: true, DataFilters: [isnotnull(id#13L), (5 > id#13L)], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,5)], ReadSchema: struct {code} > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1)
[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089037#comment-17089037 ] Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:18 PM: - Could you try 3.0.0-RC1, [~nasirali]? - https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/ was (Author: dongjoon): Could you try 3.0.0-RC1, [~nasirali]? > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36
[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089037#comment-17089037 ] Dongjoon Hyun commented on SPARK-30162: --- Could you try 3.0.0-RC1, [~nasirali]? > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be > overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in > mesos/standalone/kubernetes and LOCAL_DIRS in
[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031 ] Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:15 PM: - Hi, [~sowen]. Is there a reason for you to remove the fixed version? It's fixed since `3.0.0-preview2`, isn't it? **Apache Spark 3.0.0-preview2** {code} >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#3L] +- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L)) +- *(1) ColumnarToRow +- BatchScan[id#3L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)] {code} **Apache Spark 3.0.0-RC1** {code} >>> sql("set spark.sql.sources.useV1SourceList") DataFrame[key: string, value: string] >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#13L] +- *(1) Filter (isnotnull(id#13L) AND (5 > id#13L)) +- *(1) ColumnarToRow +- FileScan parquet [id#13L] Batched: true, DataFilters: [isnotnull(id#13L), (5 > id#13L)], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,5)], ReadSchema: struct {code} was (Author: dongjoon): Hi, [~sowen]. Is there a reason for you to remove the fixed version? It's fixed since `3.0.0-preview2`, isn't it? {code} >>> spark.version '3.0.0-preview2' >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#3L] +- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L)) +- *(1) ColumnarToRow +- BatchScan[id#3L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)] {code} > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized >
[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089034#comment-17089034 ] Nasir Ali commented on SPARK-30162: --- [~dongjoon] This bug has not been fixed. Please have a look at my previous comment and uploaded screenshots. Yes you added logs to debug but it doesn't perform filtering on partition key yet. > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 > 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 19/12/09 07:05:36 WARN SparkConf: Note that
[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031 ] Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:10 PM: - Hi, [~sowen]. Is there a reason for you to remove the fixed version? It's fixed since `3.0.0-preview2`, isn't it? {code} >>> spark.version '3.0.0-preview2' >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#3L] +- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L)) +- *(1) ColumnarToRow +- BatchScan[id#3L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)] {code} was (Author: dongjoon): Hi, [~sowen]. Is there a reason for you to remove the fixed version? {code} >>> spark.version '3.0.0' >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#32L] +- *(1) Filter (isnotnull(id#32L) AND (5 > id#32L)) +- *(1) ColumnarToRow +- FileScan parquet [id#32L] Batched: true, DataFilters: [isnotnull(id#32L), (5 > id#32L)], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,5)], ReadSchema: struct {code} > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows
[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation
[ https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031 ] Dongjoon Hyun commented on SPARK-30162: --- Hi, [~sowen]. Is there a reason for you to remove the fixed version? {code} >>> spark.version '3.0.0' >>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo") >>> spark.read.parquet("/tmp/foo").filter("5 > id").explain() == Physical Plan == *(1) Project [id#32L] +- *(1) Filter (isnotnull(id#32L) AND (5 > id#32L)) +- *(1) ColumnarToRow +- FileScan parquet [id#32L] Batched: true, DataFilters: [isnotnull(id#32L), (5 > id#32L)], Format: Parquet, Location: InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,5)], ReadSchema: struct {code} > Add PushedFilters to metadata in Parquet DSv2 implementation > > > Key: SPARK-30162 > URL: https://issues.apache.org/jira/browse/SPARK-30162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: pyspark 3.0 preview > Ubuntu/Centos > pyarrow 0.14.1 >Reporter: Nasir Ali >Assignee: Hyukjin Kwon >Priority: Minor > Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from > 2020-01-01 21-01-32.png > > > Filters are not pushed down in Spark 3.0 preview. Also the output of > "explain" method is different. It is hard to debug in 3.0 whether filters > were pushed down or not. Below code could reproduce the bug: > > {code:java} > // code placeholder > df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"), > ("usr1",13.00, "2018-03-11T12:27:18+00:00"), > ("usr1",25.00, "2018-03-12T11:27:18+00:00"), > ("usr1",20.00, "2018-03-13T15:27:18+00:00"), > ("usr1",17.00, "2018-03-14T12:27:18+00:00"), > ("usr2",99.00, "2018-03-15T11:27:18+00:00"), > ("usr2",156.00, "2018-03-22T11:27:18+00:00"), > ("usr2",17.00, "2018-03-31T11:27:18+00:00"), > ("usr2",25.00, "2018-03-15T11:27:18+00:00"), > ("usr2",25.00, "2018-03-16T11:27:18+00:00") > ], >["user","id", "ts"]) > df = df.withColumn('ts', df.ts.cast('timestamp')) > df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = > spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True) > {code} > {code:java} > // Spark 2.4 output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#40 = usr2) > +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan == > Filter (isnotnull(user#40) && (user#40 = usr2)) > +- Relation[id#38,ts#39,user#40] parquet== Physical Plan == > *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, > Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, > PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], > ReadSchema: struct{code} > {code:java} > // Spark 3.0.0-preview output > == Parsed Logical Plan == > 'Filter ('user = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed > Logical Plan == > id: double, ts: timestamp, user: string > Filter (user#2 = usr2) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized > Logical Plan == > Filter (isnotnull(user#2) AND (user#2 = usr2)) > +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical > Plan == > *(1) Project [id#0, ts#1, user#2] > +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2)) >+- *(1) ColumnarToRow > +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: > InMemoryFileIndex[file:/home/cnali/data], ReadSchema: > struct > {code} > I have tested it on much larger dataset. Spark 3.0 tries to load whole data > and then apply filter. Whereas Spark 2.4 push down the filter. Above output > shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview. > > Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and > it's hard to debug. spark.sql.orc.cache.stripe.details.size=1 doesn't > work. > > {code:java} > // pyspark 3 shell output > $ pyspark > Python 3.6.8 (default, Aug 7 2019, 17:28:10) > [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux > Type "help", "copyright", "credits" or "license" for more information. > Warning: Ignoring non-spark config property: > java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8 >
[jira] [Created] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
Herman van Hövell created SPARK-31511: - Summary: Make BytesToBytesMap iterator() thread-safe Key: SPARK-31511 URL: https://issues.apache.org/jira/browse/SPARK-31511 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5, 3.0.0 Reporter: Herman van Hövell Assignee: Herman van Hövell BytesToBytesMap currently has a thread safe and unsafe iterator method. This is somewhat confusing. We should make iterator() thread safe and remove the safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088997#comment-17088997 ] Nicholas Chammas commented on SPARK-31170: -- Isn't this also an issue in Spark 2.4.5? {code:java} $ spark-sql ... 20/04/21 15:12:26 INFO metastore: Mestastore configuration hive.metastore.warehouse.dir changed from /user/hive/warehouse to file:/Users/myusername/spark-warehouse ... spark-sql> create table test(a int); ... 20/04/21 15:12:48 WARN HiveMetaStore: Location: file:/user/hive/warehouse/test specified for non-external table:test 20/04/21 15:12:48 INFO FileUtils: Creating directory if it doesn't exist: file:/user/hive/warehouse/test Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/test is not a directory or unable to create one);{code} If I understood the problem correctly, then the fix should perhaps also be backported to 2.4.x. > Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir > > > Key: SPARK-31170 > URL: https://issues.apache.org/jira/browse/SPARK-31170 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > In Spark CLI, we create a hive CliSessionState and it does not load the > hive-site.xml. So the configurations in hive-site.xml will not take effects > like other spark-hive integration apps. > Also, the warehouse directory is not correctly picked. If the `default` > database does not exist, the CliSessionState will create one during the first > time it talks to the metastore. The `Location` of the default DB will be > neither the value of spark.sql.warehousr.dir nor the user-specified value of > hive.metastore.warehourse.dir, but the default value of > hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31381) Upgrade built-in Hive 2.3.6 to 2.3.7
[ https://issues.apache.org/jira/browse/SPARK-31381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31381: Parent: SPARK-28684 Issue Type: Sub-task (was: Improvement) > Upgrade built-in Hive 2.3.6 to 2.3.7 > > > Key: SPARK-31381 > URL: https://issues.apache.org/jira/browse/SPARK-31381 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Hive 2.3.7 fixed these issues: > HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 > or newer > HIVE-21980:Parsing time can be high in case of deeply nested subqueries > HIVE-22249: Support Parquet through HCatalog -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31503) fix the SQL string of the TRIM functions
[ https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31503. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28281 [https://github.com/apache/spark/pull/28281] > fix the SQL string of the TRIM functions > > > Key: SPARK-31503 > URL: https://issues.apache.org/jira/browse/SPARK-31503 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088868#comment-17088868 ] Kushal Yellam commented on SPARK-10925: --- [~aseigneurin] pls help me on above issue. > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497)
[jira] [Commented] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function
[ https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088820#comment-17088820 ] Wenchen Fan commented on SPARK-31474: - Thanks for catching! I've pushed to fix to branch-3.0. > Consistancy between dayofweek/dow in extract expression and dayofweek function > -- > > Key: SPARK-31474 > URL: https://issues.apache.org/jira/browse/SPARK-31474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > spark-sql> SELECT extract(dayofweek from '2009-07-26'); > 1 > spark-sql> SELECT extract(dow from '2009-07-26'); > 0 > spark-sql> SELECT extract(isodow from '2009-07-26'); > 7 > spark-sql> SELECT dayofweek('2009-07-26'); > 1 > spark-sql> SELECT weekday('2009-07-26'); > 6 > {code} > Currently, there are 4 types of day-of-week range: > the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of > Sunday(1) to Saturday(7) > extracting dow(3.0.0) results as of Sunday(0) to Saturday(6) > extracting isodow (3.0.0) results as of Monday(1) to Sunday(7) > the function weekday(2.4.0) results as of Monday(0) to Sunday(6) > Actually, extracting dayofweek and dow are both derived from PostgreSQL but > have different meanings. > https://issues.apache.org/jira/browse/SPARK-23903 > https://issues.apache.org/jira/browse/SPARK-28623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function
[ https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088820#comment-17088820 ] Wenchen Fan edited comment on SPARK-31474 at 4/21/20, 4:02 PM: --- Thanks for catching! I've pushed a fix to branch-3.0. was (Author: cloud_fan): Thanks for catching! I've pushed to fix to branch-3.0. > Consistancy between dayofweek/dow in extract expression and dayofweek function > -- > > Key: SPARK-31474 > URL: https://issues.apache.org/jira/browse/SPARK-31474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > spark-sql> SELECT extract(dayofweek from '2009-07-26'); > 1 > spark-sql> SELECT extract(dow from '2009-07-26'); > 0 > spark-sql> SELECT extract(isodow from '2009-07-26'); > 7 > spark-sql> SELECT dayofweek('2009-07-26'); > 1 > spark-sql> SELECT weekday('2009-07-26'); > 6 > {code} > Currently, there are 4 types of day-of-week range: > the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of > Sunday(1) to Saturday(7) > extracting dow(3.0.0) results as of Sunday(0) to Saturday(6) > extracting isodow (3.0.0) results as of Monday(1) to Sunday(7) > the function weekday(2.4.0) results as of Monday(0) to Sunday(6) > Actually, extracting dayofweek and dow are both derived from PostgreSQL but > have different meanings. > https://issues.apache.org/jira/browse/SPARK-23903 > https://issues.apache.org/jira/browse/SPARK-28623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31510) Set setwd in R documentation build
Hyukjin Kwon created SPARK-31510: Summary: Set setwd in R documentation build Key: SPARK-31510 URL: https://issues.apache.org/jira/browse/SPARK-31510 Project: Spark Issue Type: Improvement Components: R Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Assignee: Hyukjin Kwon {code} > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) Loading required package: usethis Error: Could not find package root, is your working directory inside a package? {code} Seems like in some environments it fails as above. https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31510) Set setwd in R documentation build
[ https://issues.apache.org/jira/browse/SPARK-31510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31510: - Priority: Trivial (was: Major) > Set setwd in R documentation build > -- > > Key: SPARK-31510 > URL: https://issues.apache.org/jira/browse/SPARK-31510 > Project: Spark > Issue Type: Improvement > Components: R >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Trivial > > {code} > > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd")) > Loading required package: usethis > Error: Could not find package root, is your working directory inside a > package? > {code} > Seems like in some environments it fails as above. > https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root > https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088785#comment-17088785 ] Rajeev Kumar edited comment on SPARK-26385 at 4/21/20, 3:31 PM: I am also facing the same issue for spark-2.4.4-bin-hadoop2.7. I am using spark structured streaming with Kafka. Reading the stream from Kafka and saving it to HBase. I am putting the logs from my application (after removing ip and username). When application starts it prints this log. We can see it is creating the HDFS_DELEGATION_TOKEN (token id = 6972072) {quote}20/03/17 13:24:09 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 6972072 for on ha-hdfs:20/03/17 13:24:09 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_631280530_1, ugi=@ (auth:KERBEROS)]]20/03/17 13:24:09 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 6972073 for on ha-hdfs:20/03/17 13:24:09 INFO HadoopFSDelegationTokenProvider: Renewal interval is 86400039 for token HDFS_DELEGATION_TOKEN20/03/17 13:24:10 DEBUG HadoopDelegationTokenManager: Service hive does not require a token. Check your configuration to see if security is disabled or not.20/03/17 13:24:11 DEBUG HBaseDelegationTokenProvider: Attempting to fetch HBase security token.20/03/17 13:24:12 INFO HBaseDelegationTokenProvider: Get token from HBase: Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@d1)20/03/17 13:24:12 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 h. {quote} After 18 hours as mentioned in log, it created new tokens also. Token number is increased (7041621). {quote}20/03/18 07:24:10 INFO AMCredentialRenewer: Attempting to login to KDC using principal: 20/03/18 07:24:10 INFO AMCredentialRenewer: Successfully logged into KDC.20/03/18 07:24:16 DEBUG HadoopFSDelegationTokenProvider: Delegation token renewer is: rm/@20/03/18 07:24:16 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-2072296893_22, ugi=@ (auth:KERBEROS)]]20/03/18 07:24:16 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 7041621 for on ha-hdfs:20/03/18 07:24:16 DEBUG HadoopDelegationTokenManager: Service hive does not require a token. Check your configuration to see if security is disabled or not.20/03/18 07:24:16 DEBUG HBaseDelegationTokenProvider: Attempting to fetch HBase security token.20/03/18 07:24:16 INFO HBaseDelegationTokenProvider: Get token from HBase: Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102)20/03/18 07:24:16 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 h.20/03/18 07:24:16 INFO AMCredentialRenewer: Updating delegation tokens.20/03/18 07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for current user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating delegation tokens List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM)20/03/18 07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for current user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating delegation tokens List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM) {quote} Everything goes fine till 24 hours. After that I see LeaseRenewer exception. But it is picking the older token number (6972072).This behaviour is same even if I use "--conf spark.hadoop.fs.hdfs.impl.disable.cache=true" {quote}20/03/18 13:24:28 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 6972072 for ) is expired20/03/18 13:24:28 WARN LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_631280530_1] for 30 seconds. Will retry shortly ...org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 6972072 for ) is expired at org.apache.hadoop.ipc.Client.call(Client.java:1475) at
[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache
[ https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088785#comment-17088785 ] Rajeev Kumar commented on SPARK-26385: -- I am also facing the same issue for spark-2.4.4-bin-hadoop2.7. I am using spark structured streaming with Kafka. Reading the stream from Kafka and saving it to HBase.I am also facing the same issue for spark-2.4.4-bin-hadoop2.7. I am using spark structured streaming with Kafka. Reading the stream from Kafka and saving it to HBase. I am putting the logs from my application (after removing ip and username). When application starts it prints this log. We can see it is creating the HDFS_DELEGATION_TOKEN (token id = 6972072) {quote} 20/03/17 13:24:09 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 6972072 for on ha-hdfs:20/03/17 13:24:09 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_631280530_1, ugi=@ (auth:KERBEROS)]]20/03/17 13:24:09 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 6972073 for on ha-hdfs:20/03/17 13:24:09 INFO HadoopFSDelegationTokenProvider: Renewal interval is 86400039 for token HDFS_DELEGATION_TOKEN20/03/17 13:24:10 DEBUG HadoopDelegationTokenManager: Service hive does not require a token. Check your configuration to see if security is disabled or not.20/03/17 13:24:11 DEBUG HBaseDelegationTokenProvider: Attempting to fetch HBase security token.20/03/17 13:24:12 INFO HBaseDelegationTokenProvider: Get token from HBase: Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@d1)20/03/17 13:24:12 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 h. {quote} After 18 hours as mentioned in log, it created new tokens also. Token number is increased (7041621). {quote} 20/03/18 07:24:10 INFO AMCredentialRenewer: Attempting to login to KDC using principal: 20/03/18 07:24:10 INFO AMCredentialRenewer: Successfully logged into KDC.20/03/18 07:24:16 DEBUG HadoopFSDelegationTokenProvider: Delegation token renewer is: rm/@20/03/18 07:24:16 INFO HadoopFSDelegationTokenProvider: getting token for: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-2072296893_22, ugi=@ (auth:KERBEROS)]]20/03/18 07:24:16 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 7041621 for on ha-hdfs:20/03/18 07:24:16 DEBUG HadoopDelegationTokenManager: Service hive does not require a token. Check your configuration to see if security is disabled or not.20/03/18 07:24:16 DEBUG HBaseDelegationTokenProvider: Attempting to fetch HBase security token.20/03/18 07:24:16 INFO HBaseDelegationTokenProvider: Get token from HBase: Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102)20/03/18 07:24:16 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 h.20/03/18 07:24:16 INFO AMCredentialRenewer: Updating delegation tokens.20/03/18 07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for current user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating delegation tokens List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM)20/03/18 07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for current user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating delegation tokens List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: (org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM) {quote} Everything goes fine till 24 hours. After that I see LeaseRenewer exception. But it is picking the older token number (6972072).This behaviour is same even if I use "--conf spark.hadoop.fs.hdfs.impl.disable.cache=true" {quote} 20/03/18 13:24:28 WARN Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 6972072 for ) is expired20/03/18 13:24:28 WARN LeaseRenewer: Failed to renew lease for [DFSClient_NONMAPREDUCE_631280530_1] for 30 seconds. Will retry shortly ...org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (HDFS_DELEGATION_TOKEN token 6972072 for
[jira] [Resolved] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata
[ https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31361. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28137 [https://github.com/apache/spark/pull/28137] > Rebase datetime in parquet/avro according to file metadata > -- > > Key: SPARK-31361 > URL: https://issues.apache.org/jira/browse/SPARK-31361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714 ] Kushal Yellam edited comment on SPARK-10925 at 4/21/20, 1:58 PM: - Hi Team I'm facing following issue with joining two dataframes. in operator !Join LeftOuter, (((cast(Contract_ID#7212 as bigint) = contract_id#5029L) && (record_yearmonth#9867 = record_yearmonth#5103)) && (Admin_System#7659 = admin_system#5098)). Attribute(s) with the same name appear in the operation: record_yearmonth. Please check if the right attribute(s) are used.;; was (Author: kyellam): Hi Team > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at
[jira] [Issue Comment Deleted] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kushal Yellam updated SPARK-10925: -- Comment: was deleted (was: Hi Team ) > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088715#comment-17088715 ] Kushal Yellam commented on SPARK-10925: --- Hi Team > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at
[jira] [Commented] (SPARK-10925) Exception when joining DataFrames
[ https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714 ] Kushal Yellam commented on SPARK-10925: --- Hi Team > Exception when joining DataFrames > - > > Key: SPARK-10925 > URL: https://issues.apache.org/jira/browse/SPARK-10925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 > Environment: Tested with Spark 1.5.0 and Spark 1.5.1 >Reporter: Alexis Seigneurin >Priority: Major > Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, > TestCase2.scala > > > I get an exception when joining a DataFrame with another DataFrame. The > second DataFrame was created by performing an aggregation on the first > DataFrame. > My complete workflow is: > # read the DataFrame > # apply an UDF on column "name" > # apply an UDF on column "surname" > # apply an UDF on column "birthDate" > # aggregate on "name" and re-join with the DF > # aggregate on "surname" and re-join with the DF > If I remove one step, the process completes normally. > Here is the exception: > {code} > Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved > attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in > operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS > birthDate_cleaned#8]; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:132) > at > org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553) > at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520) > at TestCase2$.main(TestCase2.scala:51) > at TestCase2.main(TestCase2.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at
[jira] [Commented] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function
[ https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088704#comment-17088704 ] Jason Darrell Lowe commented on SPARK-31474: The branch-3.0 build appears to have been broken by this commit. {noformat} [INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ spark-catalyst_2.12 --- [INFO] Using incremental compilation using Mixed compile order [INFO] Compiler bridge file: /home/user/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar [INFO] Compiling 304 Scala sources and 98 Java sources to /home/user/src/spark/sql/catalyst/target/scala-2.12/classes ... [ERROR] [Error] /home/user/src/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:140: not found: type Timestamp [ERROR] one error found {noformat} The patch removes the {{Timestamp}} import from datetimeExpressions.scala but there are still references to that type within the file. [~cloud_fan] was this cherry-picked to branch-3.0 without building it? > Consistancy between dayofweek/dow in extract expression and dayofweek function > -- > > Key: SPARK-31474 > URL: https://issues.apache.org/jira/browse/SPARK-31474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > spark-sql> SELECT extract(dayofweek from '2009-07-26'); > 1 > spark-sql> SELECT extract(dow from '2009-07-26'); > 0 > spark-sql> SELECT extract(isodow from '2009-07-26'); > 7 > spark-sql> SELECT dayofweek('2009-07-26'); > 1 > spark-sql> SELECT weekday('2009-07-26'); > 6 > {code} > Currently, there are 4 types of day-of-week range: > the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of > Sunday(1) to Saturday(7) > extracting dow(3.0.0) results as of Sunday(0) to Saturday(6) > extracting isodow (3.0.0) results as of Monday(1) to Sunday(7) > the function weekday(2.4.0) results as of Monday(0) to Sunday(6) > Actually, extracting dayofweek and dow are both derived from PostgreSQL but > have different meanings. > https://issues.apache.org/jira/browse/SPARK-23903 > https://issues.apache.org/jira/browse/SPARK-28623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset
[ https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31509: - Description: Currently, barrier taskset can only be scheduled when all tasks are launched at the same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling while using barrier execution. BTW, we could still support delay scheduling with barrier taskset if we find a way to resolve SPARK-24818 was: Currently, barrier taskset can only be scheduled when all tasks are launched at the same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling while using barrier execution. BTW, we could still support delay scheduling with barrier taskset if we could resolve SPARK-24818 > Recommend user to disable delay scheduling for barrier taskset > -- > > Key: SPARK-31509 > URL: https://issues.apache.org/jira/browse/SPARK-31509 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, barrier taskset can only be scheduled when all tasks are launched > at the same time. As a result, delay scheduling becomes the biggest obstacle > for the barrier taskset to get scheduled. And the application can be easily > fail with barrier taskset when delay scheduling is on. So, probably, we > should recommend user to disable delay scheduling while using barrier > execution. > > BTW, we could still support delay scheduling with barrier taskset if we find > a way to resolve SPARK-24818 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset
[ https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31509: - Description: Currently, barrier taskset can only be scheduled when all tasks are launched at the same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling while using barrier execution. BTW, we could still support delay scheduling with barrier taskset if we could resolve SPARK-24818 was: Currently, barrier taskset can only be scheduled when all tasks are launched at same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling. BTW, we could still support delay scheduling with barrier taskset if we could resolve SPARK-24818 > Recommend user to disable delay scheduling for barrier taskset > -- > > Key: SPARK-31509 > URL: https://issues.apache.org/jira/browse/SPARK-31509 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, barrier taskset can only be scheduled when all tasks are launched > at the same time. As a result, delay scheduling becomes the biggest obstacle > for the barrier taskset to get scheduled. And the application can be easily > fail with barrier taskset when delay scheduling is on. So, probably, we > should recommend user to disable delay scheduling while using barrier > execution. > > BTW, we could still support delay scheduling with barrier taskset if we could > resolve SPARK-24818 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset
[ https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31509: - Description: Currently, barrier taskset can only be scheduled when all tasks are launched at same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling. BTW, we could still support delay scheduling with barrier taskset if we could resolve SPARK-24818 was: Currently, barrier taskset can only be scheduled if all tasks are launched at same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling. BTW, we could still support delay scheduling with barrier taskset if we could resolve SPARK-24818 > Recommend user to disable delay scheduling for barrier taskset > -- > > Key: SPARK-31509 > URL: https://issues.apache.org/jira/browse/SPARK-31509 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Currently, barrier taskset can only be scheduled when all tasks are launched > at same time. As a result, delay scheduling becomes the biggest obstacle for > the barrier taskset to get scheduled. And the application can be easily fail > with barrier taskset when delay scheduling is on. So, probably, we should > recommend user to disable delay scheduling. > > BTW, we could still support delay scheduling with barrier taskset if we could > resolve SPARK-24818 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset
wuyi created SPARK-31509: Summary: Recommend user to disable delay scheduling for barrier taskset Key: SPARK-31509 URL: https://issues.apache.org/jira/browse/SPARK-31509 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: wuyi Currently, barrier taskset can only be scheduled if all tasks are launched at same time. As a result, delay scheduling becomes the biggest obstacle for the barrier taskset to get scheduled. And the application can be easily fail with barrier taskset when delay scheduling is on. So, probably, we should recommend user to disable delay scheduling. BTW, we could still support delay scheduling with barrier taskset if we could resolve SPARK-24818 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31504) Output fields in formatted Explain should have determined order.
[ https://issues.apache.org/jira/browse/SPARK-31504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31504. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28282 [https://github.com/apache/spark/pull/28282] > Output fields in formatted Explain should have determined order. > > > Key: SPARK-31504 > URL: https://issues.apache.org/jira/browse/SPARK-31504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > Currently, formatted explain use `producedAttributes`, which is AttributeSet, > to generate the "Output" of a leaf node. As a result, it will lead to the > different output of formatted explain for the same plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31504) Output fields in formatted Explain should have determined order.
[ https://issues.apache.org/jira/browse/SPARK-31504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31504: --- Assignee: wuyi > Output fields in formatted Explain should have determined order. > > > Key: SPARK-31504 > URL: https://issues.apache.org/jira/browse/SPARK-31504 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > Currently, formatted explain use `producedAttributes`, which is AttributeSet, > to generate the "Output" of a leaf node. As a result, it will lead to the > different output of formatted explain for the same plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function
[ https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31474. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28248 [https://github.com/apache/spark/pull/28248] > Consistancy between dayofweek/dow in extract expression and dayofweek function > -- > > Key: SPARK-31474 > URL: https://issues.apache.org/jira/browse/SPARK-31474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > spark-sql> SELECT extract(dayofweek from '2009-07-26'); > 1 > spark-sql> SELECT extract(dow from '2009-07-26'); > 0 > spark-sql> SELECT extract(isodow from '2009-07-26'); > 7 > spark-sql> SELECT dayofweek('2009-07-26'); > 1 > spark-sql> SELECT weekday('2009-07-26'); > 6 > {code} > Currently, there are 4 types of day-of-week range: > the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of > Sunday(1) to Saturday(7) > extracting dow(3.0.0) results as of Sunday(0) to Saturday(6) > extracting isodow (3.0.0) results as of Monday(1) to Sunday(7) > the function weekday(2.4.0) results as of Monday(0) to Sunday(6) > Actually, extracting dayofweek and dow are both derived from PostgreSQL but > have different meanings. > https://issues.apache.org/jira/browse/SPARK-23903 > https://issues.apache.org/jira/browse/SPARK-28623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function
[ https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31474: --- Assignee: Kent Yao > Consistancy between dayofweek/dow in extract expression and dayofweek function > -- > > Key: SPARK-31474 > URL: https://issues.apache.org/jira/browse/SPARK-31474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > {code:sql} > spark-sql> SELECT extract(dayofweek from '2009-07-26'); > 1 > spark-sql> SELECT extract(dow from '2009-07-26'); > 0 > spark-sql> SELECT extract(isodow from '2009-07-26'); > 7 > spark-sql> SELECT dayofweek('2009-07-26'); > 1 > spark-sql> SELECT weekday('2009-07-26'); > 6 > {code} > Currently, there are 4 types of day-of-week range: > the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of > Sunday(1) to Saturday(7) > extracting dow(3.0.0) results as of Sunday(0) to Saturday(6) > extracting isodow (3.0.0) results as of Monday(1) to Sunday(7) > the function weekday(2.4.0) results as of Monday(0) to Sunday(6) > Actually, extracting dayofweek and dow are both derived from PostgreSQL but > have different meanings. > https://issues.apache.org/jira/browse/SPARK-23903 > https://issues.apache.org/jira/browse/SPARK-28623 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31329) Modify executor monitor to allow for moving shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-31329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088569#comment-17088569 ] Dale Richardson edited comment on SPARK-31329 at 4/21/20, 11:36 AM: Hi Holden, I've been thinking about this issue as well. Blocks can't move, but they can be replicated. Any reason we can't just replicate the blocks out, allow the existing code paths to update the block locations with the block manager master, then unregister the current blocks? We should chat and coordinate efforts. was (Author: tigerquoll): Hi Holden, I've been thinking about this issue as well. Blocks can't move, but they can be replicated. Any reason we can't just replicate the blocks out, allow the existing code paths to update the block locations with the block manager master, then unregister the current blocks? > Modify executor monitor to allow for moving shuffle blocks > -- > > Key: SPARK-31329 > URL: https://issues.apache.org/jira/browse/SPARK-31329 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > To enable Spark-20629 we need to revisit code that assumes shuffle blocks > don't move. Currently, the executor monitor assumes that shuffle blocks are > immovable. We should modify this code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31329) Modify executor monitor to allow for moving shuffle blocks
[ https://issues.apache.org/jira/browse/SPARK-31329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088569#comment-17088569 ] Dale Richardson commented on SPARK-31329: - Hi Holden, I've been thinking about this issue as well. Blocks can't move, but they can be replicated. Any reason we can't just replicate the blocks out, allow the existing code paths to update the block locations with the block manager master, then unregister the current blocks? > Modify executor monitor to allow for moving shuffle blocks > -- > > Key: SPARK-31329 > URL: https://issues.apache.org/jira/browse/SPARK-31329 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > > To enable Spark-20629 we need to revisit code that assumes shuffle blocks > don't move. Currently, the executor monitor assumes that shuffle blocks are > immovable. We should modify this code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31508) string type compare with numberic cause data inaccurate
[ https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] philipse updated SPARK-31508: - Summary: string type compare with numberic cause data inaccurate (was: string type compare with numberic case data inaccurate) > string type compare with numberic cause data inaccurate > --- > > Key: SPARK-31508 > URL: https://issues.apache.org/jira/browse/SPARK-31508 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hadoop2.7 > spark2.4.5 >Reporter: philipse >Priority: Major > > Hi all > > Sparksql may should convert values to double if string type compare with > number type.the cases shows as below > 1, create table > create table test1(id string); > > 2,insert data into table > insert into test1 select 'avc'; > insert into test1 select '2'; > insert into test1 select '0a'; > insert into test1 select ''; > insert into test1 select > '22'; > 3.Let's check what's happening > select * from test_gf13871.test1 where id > 0 > the results shows below > *2* > ** > Really amazing,the big number 222...cannot be selected. > while when i check in hive,the 222...shows normal. > 4.try to explain the command,we may know what happened,if the data is big > enough than max_int_value,it will not selected,we may need to convert to > double instand. > !image-2020-04-21-18-49-58-850.png! > I wanna know if we have fixed or planned in 3.0 or later version.,please feel > free to give any advice, > > Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31508) string type compare with numberic case data inaccurate
philipse created SPARK-31508: Summary: string type compare with numberic case data inaccurate Key: SPARK-31508 URL: https://issues.apache.org/jira/browse/SPARK-31508 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Environment: hadoop2.7 spark2.4.5 Reporter: philipse Hi all Sparksql may should convert values to double if string type compare with number type.the cases shows as below 1, create table create table test1(id string); 2,insert data into table insert into test1 select 'avc'; insert into test1 select '2'; insert into test1 select '0a'; insert into test1 select ''; insert into test1 select '22'; 3.Let's check what's happening select * from test_gf13871.test1 where id > 0 the results shows below *2* ** Really amazing,the big number 222...cannot be selected. while when i check in hive,the 222...shows normal. 4.try to explain the command,we may know what happened,if the data is big enough than max_int_value,it will not selected,we may need to convert to double instand. !image-2020-04-21-18-49-58-850.png! I wanna know if we have fixed or planned in 3.0 or later version.,please feel free to give any advice, Many Thanks -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088514#comment-17088514 ] Gabor Somogyi commented on SPARK-28367: --- I've taken a look at the possibilities given by the new API in [KIP-396|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484]. I've found the following problems: * Consumer properties don't match 100% with AdminClient so Consumer properties can't be used for instantiation (at the first glance I think adding this to the API would be an overkill) * With the new API by using AdminClient Spark looses the possibility to use the assign, subscribe and subscribePattern APIs (implementing this logic would be feasible since Kafka consumer does this on client side as well but would be ugly). My main conclusion is that adding AdminClient and using Consumer in a parallel way would be super hacky. I would use either Consumer (which doesn't provide metadata only at the moment) or AdminClient (where it must be checked whether all existing features can be filled + how to add properties). > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata
[ https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31361: Summary: Rebase datetime in parquet/avro according to file metadata (was: Rebase datetime in parquet/avro according to file written Spark version) > Rebase datetime in parquet/avro according to file metadata > -- > > Key: SPARK-31361 > URL: https://issues.apache.org/jira/browse/SPARK-31361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion
[ https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-31507: - Summary: Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion (was: Remove millennium, century, decade, and epoch from extract fucntion) > Remove millennium, century, decade, millisecond, microsecond and epoch from > extract fucntion > > > Key: SPARK-31507 > URL: https://issues.apache.org/jira/browse/SPARK-31507 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > Extracting millennium, century, decade, millisecond, microsecond and epoch > from datetime is neither ANSI standard nor quite common in modern SQL > platforms. None of the systems listing below support these except PostgreSQL. > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF > https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm > https://prestodb.io/docs/current/functions/datetime.html > https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html > https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts > https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31507) Remove millennium, century, decade, and epoch from extract fucntion
Kent Yao created SPARK-31507: Summary: Remove millennium, century, decade, and epoch from extract fucntion Key: SPARK-31507 URL: https://issues.apache.org/jira/browse/SPARK-31507 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. None of the systems listing below support these except PostgreSQL. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm https://prestodb.io/docs/current/functions/datetime.html https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|https://math.nist.gov/spblas] also has [native optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] , but current BLAS in Spark only has native optimization for Dense and naive Java implementation for Sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|https://math.nist.gov/spblas] also has [native optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|https://math.nist.gov/spblas] also has [native > optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] > , but current BLAS in Spark only has native optimization for Dense and naive > Java implementation for Sparse. > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|https://math.nist.gov/spblas] also has [native optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|https://math.nist.gov/spblas]also has [native optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|https://math.nist.gov/spblas] also has [native > optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] > , but current BLAS in Spark only has native optimization for dense and naive > Java implementation for sparse. > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|https://math.nist.gov/spblas]also has [native optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] [link title|http://example.com]also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|https://math.nist.gov/spblas]also has [native > optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines] > , but current BLAS in Spark only has native optimization for dense and naive > Java implementation for sparse. > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. [link title|http://example.com] We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native > optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]] > , but current BLAS in Spark only has native optimization for dense and naive > Java implementation for sparse. > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] [link title|http://example.com]also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]] , but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|[https://math.nist.gov/spblas/]] > [link title|http://example.com]also has [native > optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]] > , but current BLAS in Spark only has native optimization for dense and naive > Java implementation for sparse. > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. [link title|http://example.com] We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native > optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], > but current BLAS in Spark only has native optimization for dense and naive > Java implementation for sparse. > > [link title|http://example.com] > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization
[ https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaochang Wu updated SPARK-31506: - Description: BLAS are highly optimized routines by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], but current BLAS in Spark only has native optimization for dense and naive Java implementation for sparse. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. was: BLAS routines are highly optimized routine by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], but current BLAS in Spark only has naive Java implementation for it. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. > Sparse BLAS native optimization > --- > > Key: SPARK-31506 > URL: https://issues.apache.org/jira/browse/SPARK-31506 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiaochang Wu >Priority: Major > > BLAS are highly optimized routines by the industry. If take advantage of it, > we can leverage hardware vendors' low-level optimization when available. > [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native > optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], > but current BLAS in Spark only has native optimization for dense and naive > Java implementation for sparse. > We would like to introduce native optimization support for Sparse BLAS > operations related to machine learning and produce benchmarks. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31506) Sparse BLAS native optimization
Xiaochang Wu created SPARK-31506: Summary: Sparse BLAS native optimization Key: SPARK-31506 URL: https://issues.apache.org/jira/browse/SPARK-31506 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: Xiaochang Wu BLAS routines are highly optimized routine by the industry. If take advantage of it, we can leverage hardware vendors' low-level optimization when available. [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]], but current BLAS in Spark only has naive Java implementation for it. We would like to introduce native optimization support for Sparse BLAS operations related to machine learning and produce benchmarks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31504) Output fields in formatted Explain should have determined order.
wuyi created SPARK-31504: Summary: Output fields in formatted Explain should have determined order. Key: SPARK-31504 URL: https://issues.apache.org/jira/browse/SPARK-31504 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Currently, formatted explain use `producedAttributes`, which is AttributeSet, to generate the "Output" of a leaf node. As a result, it will lead to the different output of formatted explain for the same plan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31505) Single precision floating point support in machine learning
Xiaochang Wu created SPARK-31505: Summary: Single precision floating point support in machine learning Key: SPARK-31505 URL: https://issues.apache.org/jira/browse/SPARK-31505 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.1.0 Reporter: Xiaochang Wu MLlib assumed the algorithm data type to "double" in all its BLAS and algorithm code. In some situations, there is no need to use high precision, end users may want to trade precision with performance given machines have much more single precision FP bandwiths than double precision ones. It may require templating all FP data type and rewrite all existing code. We would like to fast prototype some key algorithms and produce benchmarks on using float instead of double. Feel free to comment on this if there is a need! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31503) fix the SQL string of the TRIM functions
Wenchen Fan created SPARK-31503: --- Summary: fix the SQL string of the TRIM functions Key: SPARK-31503 URL: https://issues.apache.org/jira/browse/SPARK-31503 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29274) Should not coerce decimal type to double type when it's join column
[ https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088337#comment-17088337 ] jinwensc edited comment on SPARK-29274 at 4/21/20, 6:29 AM: bigdecimal compare to string, string should be promote to bigdecimal. this bug fixed [https://github.com/apache/spark/pull/28278] was (Author: fasheng): bigdecimal compare to string, string should be promote to bigdecimal. > Should not coerce decimal type to double type when it's join column > --- > > Key: SPARK-29274 > URL: https://issues.apache.org/jira/browse/SPARK-29274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Assignee: Pengfei Chang >Priority: Major > Attachments: image-2019-09-27-20-20-24-238.png > > > How to reproduce this issue: > {code:sql} > create table t1 (incdata_id decimal(21,0), v string) using parquet; > create table t2 (incdata_id string, v string) using parquet; > explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id); > == Physical Plan == > *(5) SortMergeJoin > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as > double)))], > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as > double)))], Inner > :- *(2) Sort > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as > double))) ASC NULLS FIRST], false, 0 > : +- Exchange > hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 > as double))), 200), true, [id=#104] > : +- *(1) Filter isnotnull(incdata_id#31) > :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB) > +- *(4) Sort > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as > double))) ASC NULLS FIRST], false, 0 >+- Exchange > hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 > as double))), 200), true, [id=#112] > +- *(3) Filter isnotnull(incdata_id#33) > +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB) > {code} > {code:sql} > select cast(v1 as double) as v3, cast(v2 as double) as v4, > cast(v1 as double) = cast(v2 as double), v1 = v2 > from (select cast('1001636981212' as decimal(21, 0)) as v1, > cast('1001636981213' as decimal(21, 0)) as v2) t; > 1.00163697E20 1.00163697E20 truefalse > {code} > > It's a realy case in our production: > !image-2019-09-27-20-20-24-238.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29274) Should not coerce decimal type to double type when it's join column
[ https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088337#comment-17088337 ] jinwensc commented on SPARK-29274: -- bigdecimal compare to string, string should be promote to bigdecimal. > Should not coerce decimal type to double type when it's join column > --- > > Key: SPARK-29274 > URL: https://issues.apache.org/jira/browse/SPARK-29274 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Assignee: Pengfei Chang >Priority: Major > Attachments: image-2019-09-27-20-20-24-238.png > > > How to reproduce this issue: > {code:sql} > create table t1 (incdata_id decimal(21,0), v string) using parquet; > create table t2 (incdata_id string, v string) using parquet; > explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id); > == Physical Plan == > *(5) SortMergeJoin > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as > double)))], > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as > double)))], Inner > :- *(2) Sort > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as > double))) ASC NULLS FIRST], false, 0 > : +- Exchange > hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 > as double))), 200), true, [id=#104] > : +- *(1) Filter isnotnull(incdata_id#31) > :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB) > +- *(4) Sort > [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as > double))) ASC NULLS FIRST], false, 0 >+- Exchange > hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 > as double))), 200), true, [id=#112] > +- *(3) Filter isnotnull(incdata_id#33) > +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, > [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB) > {code} > {code:sql} > select cast(v1 as double) as v3, cast(v2 as double) as v4, > cast(v1 as double) = cast(v2 as double), v1 = v2 > from (select cast('1001636981212' as decimal(21, 0)) as v1, > cast('1001636981213' as decimal(21, 0)) as v2) t; > 1.00163697E20 1.00163697E20 truefalse > {code} > > It's a realy case in our production: > !image-2019-09-27-20-20-24-238.png|width=100%! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org