[jira] [Commented] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-04-21 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089274#comment-17089274
 ] 

Wenchen Fan commented on SPARK-31170:
-

Yea it's a long-standing bug, but the fix is non-trivial, seems a bit risky for 
2.4. cc [~dongjoon] [~holdenkarau]

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31515) Canonicalize Cast should consider the value of needTimeZone

2020-04-21 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-31515:
---

 Summary: Canonicalize Cast should consider the value of 
needTimeZone
 Key: SPARK-31515
 URL: https://issues.apache.org/jira/browse/SPARK-31515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: If we don't need to use `timeZone` information casting 
`from` type to `to` type, then the timeZoneId should not influence the 
canonicalize result. 
Reporter: Yuanjian Li






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2020-04-21 Thread Zhou Jiashuai (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089245#comment-17089245
 ] 

Zhou Jiashuai commented on SPARK-26385:
---

I also use spark structured streaming with Kafka. After I upgrade my spark to 
spark-2.4.4-bin-hadoop2.7, I meet the problem. But before that, I use 
SPARK2-2.3.0.cloudera3-1.cdh5.13.3.p0.458809 and there is no problem.

My problem is similar to [~rajeevkumar].

With spark-2.4.4-bin-hadoop2.7 (run in yarn cluster mode), new tokens are 
created after the application running 18 hours.

However, after 24 hours, the application throws exception and stops. The 
following log is the first entry of error log, which shows it picks the older 
token (with sequenceNumber=11994) other than the new token (with 
sequenceNumber=12010) when it tries to write to HDFS checkpointing location.
{code:java}
20/04/16 23:21:36 ERROR Utils: Aborting task
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (token for username: HDFS_DELEGATION_TOKEN owner=usern...@bdp.com, 
renewer=yarn, realUser=, issueDate=1586962952299, maxDate=1587567752299, 
sequenceNumber=11994, masterKeyId=484) can't be found in cache
at org.apache.hadoop.ipc.Client.call(Client.java:1475)
at org.apache.hadoop.ipc.Client.call(Client.java:1412)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy21.create(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:296)
at sun.reflect.GeneratedMethodAccessor65.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy22.create(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:1648)
at org.apache.hadoop.hdfs.DFSClient.primitiveCreate(DFSClient.java:1750)
at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:102)
at org.apache.hadoop.fs.Hdfs.createInternal(Hdfs.java:58)
at 
org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:584)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:686)
at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:682)
at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90)
at org.apache.hadoop.fs.FileContext.create(FileContext.java:688)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createTempFile(CheckpointFileManager.scala:311)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:133)
at 
org.apache.spark.sql.execution.streaming.CheckpointFileManager$RenameBasedFSDataOutputStream.(CheckpointFileManager.scala:136)
at 
org.apache.spark.sql.execution.streaming.FileContextBasedCheckpointFileManager.createAtomic(CheckpointFileManager.scala:318)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.deltaFileStream$lzycompute(HDFSBackedStateStoreProvider.scala:95)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.deltaFileStream(HDFSBackedStateStoreProvider.scala:95)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.compressedStream$lzycompute(HDFSBackedStateStoreProvider.scala:96)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.compressedStream(HDFSBackedStateStoreProvider.scala:96)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.put(HDFSBackedStateStoreProvider.scala:109)
at 
org.apache.spark.sql.execution.streaming.state.StreamingAggregationStateManagerImplV1.put(StreamingAggregationStateManager.scala:121)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.getNext(statefulOperators.scala:381)
at 
org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$2.getNext(statefulOperators.scala:370)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown
 Source)
at 

[jira] [Updated] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive

2020-04-21 Thread Sanchay Javeria (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanchay Javeria updated SPARK-31514:

Description: 
I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run 
a query via the {{spark-sql}} shell.

The simplified setup basically looks like this: spark-sql shell running on one 
host in a Yarn cluster -> external hive-metastore running one host -> S3 to 
store table data.

When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is what 
I see in the logs:
{code:java}
> bin/spark-sql --proxy-user proxy_user 

...
DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user 
against hive/_h...@realm.com at thrift://hive-metastore:9083 

DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com 
(auth:KERBEROS) 
from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code}
This means that Spark made a call to fetch the delegation token from the Hive 
metastore and then added it to the list of credentials for the UGI. [This is 
the piece of 
code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129]
 that does that. I also verified in the metastore logs that the 
{{get_delegation_token()}} call was being made.

Now when I run a simple query like {{create table test_table (id int) location 
"s3://some/prefix";}} I get hit with an AWS credentials error. I modified the 
hive metastore code and added this right before the file system in Hadoop is 
initialized ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]):
{code:java}
 public static FileSystem getFs(Path f, Configuration conf) throws 
MetaException {
try {
  UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
  LOG.info("UGI information: " + ugi);
  Collection> tokens = 
ugi.getCredentials().getAllTokens();
  for(Token token : tokens) {
LOG.info(token);
  }
} catch (IOException e) {
  e.printStackTrace();
}
...
{code}
In the metastore logs, this does print the correct UGI information:
{code:java}
UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com 
(auth:KERBEROS){code}
but there are no tokens present in the UGI. Looks like [Spark 
code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101]
 adds it with the alias {{hive.server2.delegation.token}} but I don't see it in 
the UGI. This makes me suspect that somehow the UGI scope is isolated and not 
being shared between spark-sql and hive metastore. How do I go about solving 
this? Any help will be really appreciated!

  was:
I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run 
a query via the {{spark-sql}} shell.

The simplified setup basically looks like this: spark-sql shell running on one 
host in a Yarn cluster -> external hive-metastore running one host -> S3 to 
store table data.

When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is what 
I see in the logs:
{code:java}
> bin/spark-sql --proxy-user proxy_user 

...
DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user 
against hive/_h...@realm.com at thrift://hive-metastore:9083 

DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com 
(auth:KERBEROS) 
from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code}
This means that Spark made a call to fetch the delegation token from the Hive 
metastore and then added it to the list of credentials for the UGI. [This is 
the piece of 
code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129]
 that does that. I also verified in the metastore logs that the 
{{get_delegation_token()}} call was being made.

Now when I run a simple query like {{create table test_table (id int) location 
"s3://some/prefix";}} I get hit with an AWS credentials error. I modified the 
hive metastore code and added this right before the file system in Hadoop is 
initialized ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]]):
{code:java}
 public static FileSystem getFs(Path f, Configuration conf) throws 
MetaException {
try {
  UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
  LOG.info("UGI information: " + ugi);
  Collection> tokens = 
ugi.getCredentials().getAllTokens();
  for(Token token : tokens) {
LOG.info(token);
  }
} catch (IOException e) {
  e.printStackTrace();
}
...
{code}
In the metastore logs, this does print the correct UGI information:
{code:java}
UGI information: proxy_user 

[jira] [Created] (SPARK-31514) Kerberos: Spark UGI credentials are not getting passed down to Hive

2020-04-21 Thread Sanchay Javeria (Jira)
Sanchay Javeria created SPARK-31514:
---

 Summary: Kerberos: Spark UGI credentials are not getting passed 
down to Hive
 Key: SPARK-31514
 URL: https://issues.apache.org/jira/browse/SPARK-31514
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.4
Reporter: Sanchay Javeria


I'm using Spark-2.4, I have a Kerberos enabled cluster where I'm trying to run 
a query via the {{spark-sql}} shell.

The simplified setup basically looks like this: spark-sql shell running on one 
host in a Yarn cluster -> external hive-metastore running one host -> S3 to 
store table data.

When I launch the {{spark-sql}} shell with DEBUG logging enabled, this is what 
I see in the logs:
{code:java}
> bin/spark-sql --proxy-user proxy_user 

...
DEBUG HiveDelegationTokenProvider: Getting Hive delegation token for proxy_user 
against hive/_h...@realm.com at thrift://hive-metastore:9083 

DEBUG UserGroupInformation: PrivilegedAction as:spark/spark_h...@realm.com 
(auth:KERBEROS) 
from:org.apache.spark.deploy.security.HiveDelegationTokenProvider.doAsRealUser(HiveDelegationTokenProvider.scala:130){code}
This means that Spark made a call to fetch the delegation token from the Hive 
metastore and then added it to the list of credentials for the UGI. [This is 
the piece of 
code|https://github.com/apache/spark/blob/branch-2.4/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L129]
 that does that. I also verified in the metastore logs that the 
{{get_delegation_token()}} call was being made.

Now when I run a simple query like {{create table test_table (id int) location 
"s3://some/prefix";}} I get hit with an AWS credentials error. I modified the 
hive metastore code and added this right before the file system in Hadoop is 
initialized ([org/apache/hadoop/hive/metastore/Warehouse.java|#L116]]):
{code:java}
 public static FileSystem getFs(Path f, Configuration conf) throws 
MetaException {
try {
  UserGroupInformation ugi = UserGroupInformation.getCurrentUser();
  LOG.info("UGI information: " + ugi);
  Collection> tokens = 
ugi.getCredentials().getAllTokens();
  for(Token token : tokens) {
LOG.info(token);
  }
} catch (IOException e) {
  e.printStackTrace();
}
...
{code}
In the metastore logs, this does print the correct UGI information:
{code:java}
UGI information: proxy_user (auth:PROXY) via hive/hive-metast...@realm.com 
(auth:KERBEROS){code}
but there are no tokens present in the UGI. Looks like [Spark 
code|https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/deploy/security/HiveDelegationTokenProvider.scala#L101]
 adds it with the alias {{hive.server2.delegation.token}} but I don't see it in 
the UGI. This makes me suspect that somehow the UGI scope is isolated and not 
being shared between spark-sql and hive metastore. How do I go about solving 
this? Any help will be really appreciated!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset

2020-04-21 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31509:
-
Affects Version/s: (was: 3.1.0)
   2.4.0

> Recommend user to disable delay scheduling for barrier taskset
> --
>
> Key: SPARK-31509
> URL: https://issues.apache.org/jira/browse/SPARK-31509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: wuyi
>Priority: Major
>
> Currently, barrier taskset can only be scheduled when all tasks are launched 
> at the same time. As a result, delay scheduling becomes the biggest obstacle 
> for the barrier taskset  to get scheduled. And the application can be easily 
> fail with barrier taskset when delay scheduling is on. So, probably, we 
> should recommend user to disable delay scheduling while using barrier 
> execution.
>  
> BTW, we could still support delay scheduling with barrier taskset if we find 
> a way to resolve SPARK-24818



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion

2020-04-21 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31507:

Description: 
Extracting millennium, century, decade, millisecond, microsecond and epoch from 
datetime is neither ANSI standard nor quite common in modern SQL platforms. 
None of the systems listing below support these except PostgreSQL.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw


  was:
Extracting millennium, century, decade, millisecond, microsecond and epoch from 
datetime is neither ANSI standard nor quite common in modern SQL platforms. 
None of the systems listing below support these except PostgreSQL.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT



> Remove millennium, century, decade, millisecond, microsecond and epoch from 
> extract fucntion
> 
>
> Key: SPARK-31507
> URL: https://issues.apache.org/jira/browse/SPARK-31507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Extracting millennium, century, decade, millisecond, microsecond and epoch 
> from datetime is neither ANSI standard nor quite common in modern SQL 
> platforms. None of the systems listing below support these except PostgreSQL.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
> https://prestodb.io/docs/current/functions/datetime.html
> https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html
> https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts
> https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT
> https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/SIkE2wnHyQBnU4AGWRZSRw



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31511.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28286
[https://github.com/apache/spark/pull/28286]

> Make BytesToBytesMap iterator() thread-safe
> ---
>
> Key: SPARK-31511
> URL: https://issues.apache.org/jira/browse/SPARK-31511
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.1.0
>
>
> BytesToBytesMap currently has a thread safe and unsafe iterator method. This 
> is somewhat confusing. We should make iterator() thread safe and remove the 
> safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2020-04-21 Thread Marcelo Masiero Vanzin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089190#comment-17089190
 ] 

Marcelo Masiero Vanzin commented on SPARK-1537:
---

Well, the only thing to start with is the existing SHS code. 
EventLoggingListener + FsHistoryProvider.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation

2020-04-21 Thread Erik Krogen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089179#comment-17089179
 ] 

Erik Krogen commented on SPARK-31418:
-

PR was posted by [~vsowrirajan] here: https://github.com/apache/spark/pull/28287

> Blacklisting feature aborts Spark job without retrying for max num retries in 
> case of Dynamic allocation
> 
>
> Key: SPARK-31418
> URL: https://issues.apache.org/jira/browse/SPARK-31418
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.5
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>
> With Spark blacklisting, if a task fails on an executor, the executor gets 
> blacklisted for the task. In order to retry the task, it checks if there are 
> idle blacklisted executor which can be killed and replaced to retry the task 
> if not it aborts the job without doing max retries.
> In the context of dynamic allocation this can be better, instead of killing 
> the blacklisted idle executor (its possible there are no idle blacklisted 
> executor), request an additional executor and retry the task.
> This can be easily reproduced with a simple job like below, although this 
> example should fail eventually just to show that its not retried 
> spark.task.maxFailures times: 
> {code:java}
> def test(a: Int) = { a.asInstanceOf[String] }
> sc.parallelize(1 to 10, 10).map(x => test(x)).collect 
> {code}
> with dynamic allocation enabled and min executors set to 1. But there are 
> various other cases where this can fail as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31349) Document builtin aggregate function

2020-04-21 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31349.
--
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Fixed in SPARK-31429

> Document builtin aggregate function
> ---
>
> Key: SPARK-31349
> URL: https://issues.apache.org/jira/browse/SPARK-31349
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: kevin yu
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2020-04-21 Thread Daniel Templeton (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089159#comment-17089159
 ] 

Daniel Templeton commented on SPARK-1537:
-

Thanks for the response, [~vanzin].  Yeah, I think we would be interested in 
exploring the work involved to do the integration.  We're in the process of 
introducing Spark into Hadoop clusters that primarily run Scalding today.  
We're using ATSv2 as the store for all of the Scalding metrics, so it would 
make sense for us to do the same with Spark.

Any required reading that we should do as we decide how best to tackle this?  
Pointers, tips, tricks, potholes, or any other info would be welcome.  Thanks!

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31369) Add Documentation for JSON functions

2020-04-21 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31369.
--
Fix Version/s: 3.0.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Fixed in SPARK-31429.

> Add Documentation for JSON functions
> 
>
> Key: SPARK-31369
> URL: https://issues.apache.org/jira/browse/SPARK-31369
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31513) Improve the structure of auto-generated built-in function pages in SQL references

2020-04-21 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-31513:


 Summary: Improve the structure of auto-generated built-in function 
pages in SQL references
 Key: SPARK-31513
 URL: https://issues.apache.org/jira/browse/SPARK-31513
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


We now put examples per group in one section in the builtin function documents 
like this (See: SPARK-31429);
{code}
### Aggrgate Functions

 Examples


### Window Functions

 Examples

...
{code}
But, it might be better to put everything about one function (signature, desc, 
examples, etc.). So, we need more discussions to improve the structure in the 
page.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31512) make window function using order by optional

2020-04-21 Thread philipse (Jira)
philipse created SPARK-31512:


 Summary: make window function using order by optional
 Key: SPARK-31512
 URL: https://issues.apache.org/jira/browse/SPARK-31512
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5
 Environment: spark2.4.5

hadoop2.7.7
Reporter: philipse


Hi all

In other sql dialect ,order by is not the must when using window function,we 
may make it pararmteterized

the below is the case :

*select row_number()over() from test1*

Error: org.apache.spark.sql.AnalysisException: Window function row_number() 
requires window to be ordered, please add ORDER BY clause. For example SELECT 
row_number()(value_expr) OVER (PARTITION BY window_partition ORDER BY 
window_ordering) from table; (state=,code=0)

 

So. i suggest make it as a choice,or we will meet the error when migrate sql 
from other dialect,such as hive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1537) Add integration with Yarn's Application Timeline Server

2020-04-21 Thread Marcelo Masiero Vanzin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-1537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089137#comment-17089137
 ] 

Marcelo Masiero Vanzin commented on SPARK-1537:
---

[~templedf] sorry forgot to reply.

ATSv1 wasn't a good match for this, and by the time ATSv2 was developed, 
interest in this feature had long lost traction in the Spark community. So this 
was closed.

Also you probably can do this without requiring the code to live in Spark.

But if you actually want to contribute the integration, there's nothing 
preventing you from opening a new bug and posting a PR.

> Add integration with Yarn's Application Timeline Server
> ---
>
> Key: SPARK-1537
> URL: https://issues.apache.org/jira/browse/SPARK-1537
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Marcelo Masiero Vanzin
>Priority: Major
> Attachments: SPARK-1537.txt, spark-1573.patch
>
>
> It would be nice to have Spark integrate with Yarn's Application Timeline 
> Server (see YARN-321, YARN-1530). This would allow users running Spark on 
> Yarn to have a single place to go for all their history needs, and avoid 
> having to manage a separate service (Spark's built-in server).
> At the moment, there's a working version of the ATS in the Hadoop 2.4 branch, 
> although there is still some ongoing work. But the basics are there, and I 
> wouldn't expect them to change (much) at this point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31233) Enhance RpcTimeoutException Log Message

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31233:
--
Summary: Enhance RpcTimeoutException Log Message  (was: TimeoutException 
contains null remoteAddr)

> Enhance RpcTimeoutException Log Message
> ---
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yi Huang
>Assignee: Yi Huang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
> resides in client mode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31233) TimeoutException contains null remoteAddr

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31233:
-

Assignee: Yi Huang

> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yi Huang
>Assignee: Yi Huang
>Priority: Minor
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
> resides in client mode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31233) TimeoutException contains null remoteAddr

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31233.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28002
[https://github.com/apache/spark/pull/28002]

> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yi Huang
>Assignee: Yi Huang
>Priority: Minor
> Fix For: 3.1.0
>
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
> resides in client mode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31233:
--
Issue Type: Improvement  (was: Bug)

> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.4
>Reporter: Yi Huang
>Priority: Minor
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
> resides in client mode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31233) TimeoutException contains null remoteAddr

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31233:
--
Affects Version/s: (was: 2.3.4)
   3.1.0

> TimeoutException contains null remoteAddr
> -
>
> Key: SPARK-31233
> URL: https://issues.apache.org/jira/browse/SPARK-31233
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Yi Huang
>Priority: Minor
>
> Application log: 
> {code:java}
> Failed to process batch org.apache.spark.rpc.RpcTimeoutException: Futures 
> timed out after [800 seconds]. This timeout is controlled by 
> spark.network.timeout:
> org.apache.spark.rpc.RpcTimeout.org
> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
> {code}
> Driver log:
> {code:java}
> [block-manager-ask-thread-pool-149] WARN 
> org.apache.spark.storage.BlockManagerMaster - Failed to remove RDD 25344 - 
> Cannot receive any reply from null in 800 seconds. This timeout is controlled 
> by spark.network.timeout org.apache.spark.rpc.RpcTimeoutException: Cannot 
> receive any reply from null in 800 seconds. This timeout is controlled by 
> spark.network.timeout{code}
> The log message does not provide RpcAddress of the destination RpcEndpoint. 
> It is due to 
> {noformat}
> * The `rpcAddress` may be null, in which case the endpoint is registered via 
> a client-only
> * connection and can only be reached via the client that sent the endpoint 
> reference.{noformat}
>  Solution:
> using rpcAdress from client of the NettyRpcEndpoingRef once such endpoint 
> resides in client mode.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30162.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Please feel free to reopen this if you are facing a bug with 3.0.0-RC1. For me, 
it looks correctly.

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png, partition_pruning.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be 
> 

[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089041#comment-17089041
 ] 

Dongjoon Hyun commented on SPARK-30162:
---

In 3.0.0-RC1, I can see `number of partitions read: 1`.
!partition_pruning.png!

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png, partition_pruning.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be 
> overridden by the value set by the cluster manager (via 

[jira] [Updated] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30162:
--
Attachment: partition_pruning.png

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png, partition_pruning.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be 
> overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
> mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
> Welcome to
> 

[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031
 ] 

Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:25 PM:
-

Hi, [~sowen].
Is there a reason for you to remove the fixed version? It's fixed since 
`3.0.0-preview2`, isn't it?

**Apache Spark 3.0.0-preview2**
{code}
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#3L]
+- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L))
   +- *(1) ColumnarToRow
  +- BatchScan[id#3L] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: 
[IsNotNull(id), LessThan(id,5)]
{code}

**Apache Spark 3.0.0-RC1**
{code}
>>> sql("set spark.sql.sources.useV1SourceList=''")
DataFrame[key: string, value: string]
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#53L]
+- *(1) Filter (isnotnull(id#53L) AND (5 > id#53L))
   +- *(1) ColumnarToRow
  +- BatchScan[id#53L] ParquetScan DataFilters: [isnotnull(id#53L), (5 > 
id#53L)], Location: InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], 
ReadSchema: struct, PushedFilters: [IsNotNull(id), LessThan(id,5)]
{code}


was (Author: dongjoon):
Hi, [~sowen].
Is there a reason for you to remove the fixed version? It's fixed since 
`3.0.0-preview2`, isn't it?

**Apache Spark 3.0.0-preview2**
{code}
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#3L]
+- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L))
   +- *(1) ColumnarToRow
  +- BatchScan[id#3L] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: 
[IsNotNull(id), LessThan(id,5)]
{code}

**Apache Spark 3.0.0-RC1**
{code}
>>> sql("set spark.sql.sources.useV1SourceList")
DataFrame[key: string, value: string]
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#13L]
+- *(1) Filter (isnotnull(id#13L) AND (5 > id#13L))
   +- *(1) ColumnarToRow
  +- FileScan parquet [id#13L] Batched: true, DataFilters: 
[isnotnull(id#13L), (5 > id#13L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: 
[IsNotNull(id), LessThan(id,5)], ReadSchema: struct
{code}

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) 

[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089037#comment-17089037
 ] 

Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:18 PM:
-

Could you try 3.0.0-RC1, [~nasirali]?
- https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/


was (Author: dongjoon):
Could you try 3.0.0-RC1, [~nasirali]?

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 

[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089037#comment-17089037
 ] 

Dongjoon Hyun commented on SPARK-30162:
---

Could you try 3.0.0-RC1, [~nasirali]?

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be 
> overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
> mesos/standalone/kubernetes and LOCAL_DIRS in 

[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031
 ] 

Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:15 PM:
-

Hi, [~sowen].
Is there a reason for you to remove the fixed version? It's fixed since 
`3.0.0-preview2`, isn't it?

**Apache Spark 3.0.0-preview2**
{code}
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#3L]
+- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L))
   +- *(1) ColumnarToRow
  +- BatchScan[id#3L] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: 
[IsNotNull(id), LessThan(id,5)]
{code}

**Apache Spark 3.0.0-RC1**
{code}
>>> sql("set spark.sql.sources.useV1SourceList")
DataFrame[key: string, value: string]
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#13L]
+- *(1) Filter (isnotnull(id#13L) AND (5 > id#13L))
   +- *(1) ColumnarToRow
  +- FileScan parquet [id#13L] Batched: true, DataFilters: 
[isnotnull(id#13L), (5 > id#13L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: 
[IsNotNull(id), LessThan(id,5)], ReadSchema: struct
{code}


was (Author: dongjoon):
Hi, [~sowen].
Is there a reason for you to remove the fixed version? It's fixed since 
`3.0.0-preview2`, isn't it?
{code}
>>> spark.version
'3.0.0-preview2'
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#3L]
+- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L))
   +- *(1) ColumnarToRow
  +- BatchScan[id#3L] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: 
[IsNotNull(id), LessThan(id,5)]
{code}

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> 

[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Nasir Ali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089034#comment-17089034
 ] 

Nasir Ali commented on SPARK-30162:
---

[~dongjoon] This bug has not been fixed. Please have a look at my previous 
comment and uploaded screenshots. Yes you added logs to debug but it doesn't 
perform filtering on partition key yet.

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that 

[jira] [Comment Edited] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031
 ] 

Dongjoon Hyun edited comment on SPARK-30162 at 4/21/20, 8:10 PM:
-

Hi, [~sowen].
Is there a reason for you to remove the fixed version? It's fixed since 
`3.0.0-preview2`, isn't it?
{code}
>>> spark.version
'3.0.0-preview2'
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#3L]
+- *(1) Filter (isnotnull(id#3L) AND (5 > id#3L))
   +- *(1) ColumnarToRow
  +- BatchScan[id#3L] ParquetScan Location: 
InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct, PushedFilters: 
[IsNotNull(id), LessThan(id,5)]
{code}


was (Author: dongjoon):
Hi, [~sowen].
Is there a reason for you to remove the fixed version?
{code}
>>> spark.version
'3.0.0'
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#32L]
+- *(1) Filter (isnotnull(id#32L) AND (5 > id#32L))
   +- *(1) ColumnarToRow
  +- FileScan parquet [id#32L] Batched: true, DataFilters: 
[isnotnull(id#32L), (5 > id#32L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: 
[IsNotNull(id), LessThan(id,5)], ReadSchema: struct
{code}

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows 

[jira] [Commented] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089031#comment-17089031
 ] 

Dongjoon Hyun commented on SPARK-30162:
---

Hi, [~sowen].
Is there a reason for you to remove the fixed version?
{code}
>>> spark.version
'3.0.0'
>>> spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
>>> spark.read.parquet("/tmp/foo").filter("5 > id").explain()
== Physical Plan ==
*(1) Project [id#32L]
+- *(1) Filter (isnotnull(id#32L) AND (5 > id#32L))
   +- *(1) ColumnarToRow
  +- FileScan parquet [id#32L] Batched: true, DataFilters: 
[isnotnull(id#32L), (5 > id#32L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/tmp/foo], PartitionFilters: [], PushedFilters: 
[IsNotNull(id), LessThan(id,5)], ReadSchema: struct
{code}

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 

[jira] [Created] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-04-21 Thread Jira
Herman van Hövell created SPARK-31511:
-

 Summary: Make BytesToBytesMap iterator() thread-safe
 Key: SPARK-31511
 URL: https://issues.apache.org/jira/browse/SPARK-31511
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Herman van Hövell
Assignee: Herman van Hövell


BytesToBytesMap currently has a thread safe and unsafe iterator method. This is 
somewhat confusing. We should make iterator() thread safe and remove the 
safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-04-21 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088997#comment-17088997
 ] 

Nicholas Chammas commented on SPARK-31170:
--

Isn't this also an issue in Spark 2.4.5?
{code:java}
$ spark-sql
...
20/04/21 15:12:26 INFO metastore: Mestastore configuration 
hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
file:/Users/myusername/spark-warehouse
...
spark-sql> create table test(a int);
...
20/04/21 15:12:48 WARN HiveMetaStore: Location: file:/user/hive/warehouse/test 
specified for non-external table:test
20/04/21 15:12:48 INFO FileUtils: Creating directory if it doesn't exist: 
file:/user/hive/warehouse/test
Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:file:/user/hive/warehouse/test is not a directory or 
unable to create one);{code}
If I understood the problem correctly, then the fix should perhaps also be 
backported to 2.4.x.

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31381) Upgrade built-in Hive 2.3.6 to 2.3.7

2020-04-21 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31381:

Parent: SPARK-28684
Issue Type: Sub-task  (was: Improvement)

> Upgrade built-in Hive 2.3.6 to 2.3.7
> 
>
> Key: SPARK-31381
> URL: https://issues.apache.org/jira/browse/SPARK-31381
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Hive 2.3.7 fixed these issues:
> HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 
> or newer
> HIVE-21980:Parsing time can be high in case of deeply nested subqueries
> HIVE-22249: Support Parquet through HCatalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31503) fix the SQL string of the TRIM functions

2020-04-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31503.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28281
[https://github.com/apache/spark/pull/28281]

> fix the SQL string of the TRIM functions
> 
>
> Key: SPARK-31503
> URL: https://issues.apache.org/jira/browse/SPARK-31503
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2020-04-21 Thread Kushal Yellam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088868#comment-17088868
 ] 

Kushal Yellam commented on SPARK-10925:
---

[~aseigneurin] pls help me on above issue.

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)

[jira] [Commented] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function

2020-04-21 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088820#comment-17088820
 ] 

Wenchen Fan commented on SPARK-31474:
-

Thanks for catching! I've pushed to fix to branch-3.0.

> Consistancy between dayofweek/dow in extract expression and dayofweek function
> --
>
> Key: SPARK-31474
> URL: https://issues.apache.org/jira/browse/SPARK-31474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> SELECT extract(dayofweek from '2009-07-26');
> 1
> spark-sql> SELECT extract(dow from '2009-07-26');
> 0
> spark-sql> SELECT extract(isodow from '2009-07-26');
> 7
> spark-sql> SELECT dayofweek('2009-07-26');
> 1
> spark-sql> SELECT weekday('2009-07-26');
> 6
> {code}
> Currently, there are 4 types of day-of-week range: 
> the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of 
> Sunday(1) to Saturday(7)
> extracting dow(3.0.0) results as of Sunday(0) to Saturday(6)
> extracting isodow (3.0.0)  results as of Monday(1) to Sunday(7)
> the function weekday(2.4.0) results as of Monday(0) to Sunday(6)
> Actually, extracting dayofweek and dow are both derived from PostgreSQL but 
> have different meanings.
> https://issues.apache.org/jira/browse/SPARK-23903
> https://issues.apache.org/jira/browse/SPARK-28623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function

2020-04-21 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088820#comment-17088820
 ] 

Wenchen Fan edited comment on SPARK-31474 at 4/21/20, 4:02 PM:
---

Thanks for catching! I've pushed a fix to branch-3.0.


was (Author: cloud_fan):
Thanks for catching! I've pushed to fix to branch-3.0.

> Consistancy between dayofweek/dow in extract expression and dayofweek function
> --
>
> Key: SPARK-31474
> URL: https://issues.apache.org/jira/browse/SPARK-31474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> SELECT extract(dayofweek from '2009-07-26');
> 1
> spark-sql> SELECT extract(dow from '2009-07-26');
> 0
> spark-sql> SELECT extract(isodow from '2009-07-26');
> 7
> spark-sql> SELECT dayofweek('2009-07-26');
> 1
> spark-sql> SELECT weekday('2009-07-26');
> 6
> {code}
> Currently, there are 4 types of day-of-week range: 
> the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of 
> Sunday(1) to Saturday(7)
> extracting dow(3.0.0) results as of Sunday(0) to Saturday(6)
> extracting isodow (3.0.0)  results as of Monday(1) to Sunday(7)
> the function weekday(2.4.0) results as of Monday(0) to Sunday(6)
> Actually, extracting dayofweek and dow are both derived from PostgreSQL but 
> have different meanings.
> https://issues.apache.org/jira/browse/SPARK-23903
> https://issues.apache.org/jira/browse/SPARK-28623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31510) Set setwd in R documentation build

2020-04-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31510:


 Summary: Set setwd in R documentation build
 Key: SPARK-31510
 URL: https://issues.apache.org/jira/browse/SPARK-31510
 Project: Spark
  Issue Type: Improvement
  Components: R
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon
Assignee: Hyukjin Kwon


{code}
> library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
Loading required package: usethis
Error: Could not find package root, is your working directory inside a package?
{code}

Seems like in some environments it fails as above. 

https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root
https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31510) Set setwd in R documentation build

2020-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31510:
-
Priority: Trivial  (was: Major)

> Set setwd in R documentation build
> --
>
> Key: SPARK-31510
> URL: https://issues.apache.org/jira/browse/SPARK-31510
> Project: Spark
>  Issue Type: Improvement
>  Components: R
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Trivial
>
> {code}
> > library(devtools); devtools::document(pkg="./pkg", roclets=c("rd"))
> Loading required package: usethis
> Error: Could not find package root, is your working directory inside a 
> package?
> {code}
> Seems like in some environments it fails as above. 
> https://stackoverflow.com/questions/52670051/how-to-troubleshoot-error-could-not-find-package-root
> https://groups.google.com/forum/#!topic/rdevtools/79jjjdc_wjg



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2020-04-21 Thread Rajeev Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088785#comment-17088785
 ] 

Rajeev Kumar edited comment on SPARK-26385 at 4/21/20, 3:31 PM:


I am also facing the same issue for spark-2.4.4-bin-hadoop2.7. I am using spark 
structured streaming with Kafka. Reading the stream from Kafka and saving it to 
HBase.


 I am putting the logs from my application (after removing ip and username).
 When application starts it prints this log. We can see it is creating the 
HDFS_DELEGATION_TOKEN (token id = 6972072)
{quote}20/03/17 13:24:09 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 
6972072 for  on ha-hdfs:20/03/17 13:24:09 INFO 
HadoopFSDelegationTokenProvider: getting token for: 
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_631280530_1, 
ugi=@ (auth:KERBEROS)]]20/03/17 13:24:09 INFO DFSClient: Created 
HDFS_DELEGATION_TOKEN token 6972073 for  on ha-hdfs:20/03/17 
13:24:09 INFO HadoopFSDelegationTokenProvider: Renewal interval is 86400039 for 
token HDFS_DELEGATION_TOKEN20/03/17 13:24:10 DEBUG 
HadoopDelegationTokenManager: Service hive does not require a token. Check your 
configuration to see if security is disabled or not.20/03/17 13:24:11 DEBUG 
HBaseDelegationTokenProvider: Attempting to fetch HBase security token.20/03/17 
13:24:12 INFO HBaseDelegationTokenProvider: Get token from HBase: Kind: 
HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@d1)20/03/17
 13:24:12 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 h.
{quote}
After 18 hours as mentioned in log, it created new tokens also. Token number is 
increased (7041621).
{quote}20/03/18 07:24:10 INFO AMCredentialRenewer: Attempting to login to KDC 
using principal: 20/03/18 07:24:10 INFO AMCredentialRenewer: Successfully 
logged into KDC.20/03/18 07:24:16 DEBUG HadoopFSDelegationTokenProvider: 
Delegation token renewer is: rm/@20/03/18 07:24:16 INFO 
HadoopFSDelegationTokenProvider: getting token for: 
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-2072296893_22, 
ugi=@  (auth:KERBEROS)]]20/03/18 07:24:16 INFO DFSClient: 
Created HDFS_DELEGATION_TOKEN token 7041621 for  on 
ha-hdfs:20/03/18 07:24:16 DEBUG HadoopDelegationTokenManager: Service 
hive does not require a token. Check your configuration to see if security is 
disabled or not.20/03/18 07:24:16 DEBUG HBaseDelegationTokenProvider: 
Attempting to fetch HBase security token.20/03/18 07:24:16 INFO 
HBaseDelegationTokenProvider: Get token from HBase: Kind: HBASE_AUTH_TOKEN, 
Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102)20/03/18
 07:24:16 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 
h.20/03/18 07:24:16 INFO AMCredentialRenewer: Updating delegation 
tokens.20/03/18 07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for 
current user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating 
delegation tokens List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); 
org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: 
HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN 
token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; 
Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM)20/03/18 
07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for current 
user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating delegation tokens 
List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); 
org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: 
HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN 
token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; 
Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM)
{quote}
Everything goes fine till 24 hours. After that I see LeaseRenewer exception. 
But it is picking the older token number (6972072).This behaviour is same even 
if I use "--conf spark.hadoop.fs.hdfs.impl.disable.cache=true"
{quote}20/03/18 13:24:28 WARN Client: Exception encountered while connecting to 
the server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 6972072 for ) is expired20/03/18 
13:24:28 WARN LeaseRenewer: Failed to renew lease for 
[DFSClient_NONMAPREDUCE_631280530_1] for 30 seconds.  Will retry shortly 
...org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 6972072 for ) is expired        at 
org.apache.hadoop.ipc.Client.call(Client.java:1475)        at 

[jira] [Commented] (SPARK-26385) YARN - Spark Stateful Structured streaming HDFS_DELEGATION_TOKEN not found in cache

2020-04-21 Thread Rajeev Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088785#comment-17088785
 ] 

Rajeev Kumar commented on SPARK-26385:
--

I am also facing the same issue for spark-2.4.4-bin-hadoop2.7. I am using spark 
structured streaming with Kafka. Reading the stream from Kafka and saving it to 
HBase.I am also facing the same issue for spark-2.4.4-bin-hadoop2.7. I am using 
spark structured streaming with Kafka. Reading the stream from Kafka and saving 
it to HBase.
I am putting the logs from my application (after removing ip and username).
When application starts it prints this log. We can see it is creating the 
HDFS_DELEGATION_TOKEN (token id = 6972072)


{quote}

20/03/17 13:24:09 INFO DFSClient: Created HDFS_DELEGATION_TOKEN token 6972072 
for  on ha-hdfs:20/03/17 13:24:09 INFO 
HadoopFSDelegationTokenProvider: getting token for: 
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_631280530_1, 
ugi=@ (auth:KERBEROS)]]20/03/17 13:24:09 INFO DFSClient: Created 
HDFS_DELEGATION_TOKEN token 6972073 for  on ha-hdfs:20/03/17 
13:24:09 INFO HadoopFSDelegationTokenProvider: Renewal interval is 86400039 for 
token HDFS_DELEGATION_TOKEN20/03/17 13:24:10 DEBUG 
HadoopDelegationTokenManager: Service hive does not require a token. Check your 
configuration to see if security is disabled or not.20/03/17 13:24:11 DEBUG 
HBaseDelegationTokenProvider: Attempting to fetch HBase security token.20/03/17 
13:24:12 INFO HBaseDelegationTokenProvider: Get token from HBase: Kind: 
HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@d1)20/03/17
 13:24:12 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 h.

{quote}


After 18 hours as mentioned in log, it created new tokens also. Token number is 
increased (7041621).

{quote}

20/03/18 07:24:10 INFO AMCredentialRenewer: Attempting to login to KDC using 
principal: 20/03/18 07:24:10 INFO AMCredentialRenewer: Successfully 
logged into KDC.20/03/18 07:24:16 DEBUG HadoopFSDelegationTokenProvider: 
Delegation token renewer is: rm/@20/03/18 07:24:16 INFO 
HadoopFSDelegationTokenProvider: getting token for: 
DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-2072296893_22, 
ugi=@  (auth:KERBEROS)]]20/03/18 07:24:16 INFO DFSClient: 
Created HDFS_DELEGATION_TOKEN token 7041621 for  on 
ha-hdfs:20/03/18 07:24:16 DEBUG HadoopDelegationTokenManager: Service 
hive does not require a token. Check your configuration to see if security is 
disabled or not.20/03/18 07:24:16 DEBUG HBaseDelegationTokenProvider: 
Attempting to fetch HBase security token.20/03/18 07:24:16 INFO 
HBaseDelegationTokenProvider: Get token from HBase: Kind: HBASE_AUTH_TOKEN, 
Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102)20/03/18
 07:24:16 INFO AMCredentialRenewer: Scheduling login from keytab in 18.0 
h.20/03/18 07:24:16 INFO AMCredentialRenewer: Updating delegation 
tokens.20/03/18 07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for 
current user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating 
delegation tokens List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); 
org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: 
HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN 
token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; 
Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM)20/03/18 
07:24:17 INFO SparkHadoopUtil: Updating delegation tokens for current 
user.20/03/18 07:24:17 DEBUG SparkHadoopUtil: Adding/updating delegation tokens 
List(Kind: HBASE_AUTH_TOKEN, Service: 812777c29d67, Ident: 
(org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102); 
org.apache.hadoop.hbase.security.token.AuthenticationTokenIdentifier@102, Kind: 
HDFS_DELEGATION_TOKEN, Service: ha-hdfs:, Ident: (HDFS_DELEGATION_TOKEN 
token 7041621 for ); HDFS_DELEGATION_TOKEN token 7041621 for ; 
Renewer: yarn; Issued: 3/18/20 7:24 AM; Max Date: 3/25/20 7:24 AM)

{quote}


Everything goes fine till 24 hours. After that I see LeaseRenewer exception. 
But it is picking the older token number (6972072).This behaviour is same even 
if I use "--conf spark.hadoop.fs.hdfs.impl.disable.cache=true"


{quote}

20/03/18 13:24:28 WARN Client: Exception encountered while connecting to the 
server : 
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 6972072 for ) is expired20/03/18 
13:24:28 WARN LeaseRenewer: Failed to renew lease for 
[DFSClient_NONMAPREDUCE_631280530_1] for 30 seconds.  Will retry shortly 
...org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
 token (HDFS_DELEGATION_TOKEN token 6972072 for 

[jira] [Resolved] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata

2020-04-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31361.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28137
[https://github.com/apache/spark/pull/28137]

> Rebase datetime in parquet/avro according to file metadata
> --
>
> Key: SPARK-31361
> URL: https://issues.apache.org/jira/browse/SPARK-31361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10925) Exception when joining DataFrames

2020-04-21 Thread Kushal Yellam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714
 ] 

Kushal Yellam edited comment on SPARK-10925 at 4/21/20, 1:58 PM:
-

Hi Team 

I'm facing following issue with joining two dataframes.

in operator !Join LeftOuter,
(((cast(Contract_ID#7212 as bigint) = contract_id#5029L) &&
(record_yearmonth#9867 = record_yearmonth#5103)) && (Admin_System#7659 = 
admin_system#5098)).
Attribute(s) with the same name appear in the operation: record_yearmonth. 
Please check if the right attribute(s) are used.;;


was (Author: kyellam):
Hi Team 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at 

[jira] [Issue Comment Deleted] (SPARK-10925) Exception when joining DataFrames

2020-04-21 Thread Kushal Yellam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kushal Yellam updated SPARK-10925:
--
Comment: was deleted

(was: Hi Team )

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2020-04-21 Thread Kushal Yellam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088715#comment-17088715
 ] 

Kushal Yellam commented on SPARK-10925:
---

Hi Team 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 

[jira] [Commented] (SPARK-10925) Exception when joining DataFrames

2020-04-21 Thread Kushal Yellam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-10925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088714#comment-17088714
 ] 

Kushal Yellam commented on SPARK-10925:
---

Hi Team 

> Exception when joining DataFrames
> -
>
> Key: SPARK-10925
> URL: https://issues.apache.org/jira/browse/SPARK-10925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
> Environment: Tested with Spark 1.5.0 and Spark 1.5.1
>Reporter: Alexis Seigneurin
>Priority: Major
> Attachments: Photo 05-10-2015 14 31 16.jpg, TestCase.scala, 
> TestCase2.scala
>
>
> I get an exception when joining a DataFrame with another DataFrame. The 
> second DataFrame was created by performing an aggregation on the first 
> DataFrame.
> My complete workflow is:
> # read the DataFrame
> # apply an UDF on column "name"
> # apply an UDF on column "surname"
> # apply an UDF on column "birthDate"
> # aggregate on "name" and re-join with the DF
> # aggregate on "surname" and re-join with the DF
> If I remove one step, the process completes normally.
> Here is the exception:
> {code}
> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
> attribute(s) surname#20 missing from id#0,birthDate#3,name#10,surname#7 in 
> operator !Project [id#0,birthDate#3,name#10,surname#20,UDF(birthDate#3) AS 
> birthDate_cleaned#8];
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:154)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:102)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:102)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:49)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:44)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:132)
>   at 
> org.apache.spark.sql.DataFrame.org$apache$spark$sql$DataFrame$$logicalPlanToDataFrame(DataFrame.scala:154)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:553)
>   at org.apache.spark.sql.DataFrame.join(DataFrame.scala:520)
>   at TestCase2$.main(TestCase2.scala:51)
>   at TestCase2.main(TestCase2.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 

[jira] [Commented] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function

2020-04-21 Thread Jason Darrell Lowe (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088704#comment-17088704
 ] 

Jason Darrell Lowe commented on SPARK-31474:


The branch-3.0 build appears to have been broken by this commit.
{noformat}
[INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @ 
spark-catalyst_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file: 
/home/user/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] Compiling 304 Scala sources and 98 Java sources to 
/home/user/src/spark/sql/catalyst/target/scala-2.12/classes ...
[ERROR] [Error] 
/home/user/src/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:140:
 not found: type Timestamp
[ERROR] one error found
{noformat}

The patch removes the {{Timestamp}} import from datetimeExpressions.scala but 
there are still references to that type within the file.  [~cloud_fan] was this 
cherry-picked to branch-3.0 without building it?

> Consistancy between dayofweek/dow in extract expression and dayofweek function
> --
>
> Key: SPARK-31474
> URL: https://issues.apache.org/jira/browse/SPARK-31474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> SELECT extract(dayofweek from '2009-07-26');
> 1
> spark-sql> SELECT extract(dow from '2009-07-26');
> 0
> spark-sql> SELECT extract(isodow from '2009-07-26');
> 7
> spark-sql> SELECT dayofweek('2009-07-26');
> 1
> spark-sql> SELECT weekday('2009-07-26');
> 6
> {code}
> Currently, there are 4 types of day-of-week range: 
> the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of 
> Sunday(1) to Saturday(7)
> extracting dow(3.0.0) results as of Sunday(0) to Saturday(6)
> extracting isodow (3.0.0)  results as of Monday(1) to Sunday(7)
> the function weekday(2.4.0) results as of Monday(0) to Sunday(6)
> Actually, extracting dayofweek and dow are both derived from PostgreSQL but 
> have different meanings.
> https://issues.apache.org/jira/browse/SPARK-23903
> https://issues.apache.org/jira/browse/SPARK-28623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset

2020-04-21 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31509:
-
Description: 
Currently, barrier taskset can only be scheduled when all tasks are launched at 
the same time. As a result, delay scheduling becomes the biggest obstacle for 
the barrier taskset  to get scheduled. And the application can be easily fail 
with barrier taskset when delay scheduling is on. So, probably, we should 
recommend user to disable delay scheduling while using barrier execution.

 

BTW, we could still support delay scheduling with barrier taskset if we find a 
way to resolve SPARK-24818

  was:
Currently, barrier taskset can only be scheduled when all tasks are launched at 
the same time. As a result, delay scheduling becomes the biggest obstacle for 
the barrier taskset  to get scheduled. And the application can be easily fail 
with barrier taskset when delay scheduling is on. So, probably, we should 
recommend user to disable delay scheduling while using barrier execution.

 

BTW, we could still support delay scheduling with barrier taskset if we could 
resolve SPARK-24818


> Recommend user to disable delay scheduling for barrier taskset
> --
>
> Key: SPARK-31509
> URL: https://issues.apache.org/jira/browse/SPARK-31509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, barrier taskset can only be scheduled when all tasks are launched 
> at the same time. As a result, delay scheduling becomes the biggest obstacle 
> for the barrier taskset  to get scheduled. And the application can be easily 
> fail with barrier taskset when delay scheduling is on. So, probably, we 
> should recommend user to disable delay scheduling while using barrier 
> execution.
>  
> BTW, we could still support delay scheduling with barrier taskset if we find 
> a way to resolve SPARK-24818



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset

2020-04-21 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31509:
-
Description: 
Currently, barrier taskset can only be scheduled when all tasks are launched at 
the same time. As a result, delay scheduling becomes the biggest obstacle for 
the barrier taskset  to get scheduled. And the application can be easily fail 
with barrier taskset when delay scheduling is on. So, probably, we should 
recommend user to disable delay scheduling while using barrier execution.

 

BTW, we could still support delay scheduling with barrier taskset if we could 
resolve SPARK-24818

  was:
Currently, barrier taskset can only be scheduled when all tasks are launched at 
same time. As a result, delay scheduling becomes the biggest obstacle for the 
barrier taskset  to get scheduled. And the application can be easily fail with 
barrier taskset when delay scheduling is on. So, probably, we should recommend 
user to disable delay scheduling.

 

BTW, we could still support delay scheduling with barrier taskset if we could 
resolve SPARK-24818


> Recommend user to disable delay scheduling for barrier taskset
> --
>
> Key: SPARK-31509
> URL: https://issues.apache.org/jira/browse/SPARK-31509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, barrier taskset can only be scheduled when all tasks are launched 
> at the same time. As a result, delay scheduling becomes the biggest obstacle 
> for the barrier taskset  to get scheduled. And the application can be easily 
> fail with barrier taskset when delay scheduling is on. So, probably, we 
> should recommend user to disable delay scheduling while using barrier 
> execution.
>  
> BTW, we could still support delay scheduling with barrier taskset if we could 
> resolve SPARK-24818



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset

2020-04-21 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31509:
-
Description: 
Currently, barrier taskset can only be scheduled when all tasks are launched at 
same time. As a result, delay scheduling becomes the biggest obstacle for the 
barrier taskset  to get scheduled. And the application can be easily fail with 
barrier taskset when delay scheduling is on. So, probably, we should recommend 
user to disable delay scheduling.

 

BTW, we could still support delay scheduling with barrier taskset if we could 
resolve SPARK-24818

  was:
Currently, barrier taskset can only be scheduled if all tasks are launched at 
same time. As a result, delay scheduling becomes the biggest obstacle for the 
barrier taskset  to get scheduled. And the application can be easily fail with 
barrier taskset when delay scheduling is on. So, probably, we should recommend 
user to disable delay scheduling.

 

BTW, we could still support delay scheduling with barrier taskset if we could 
resolve SPARK-24818


> Recommend user to disable delay scheduling for barrier taskset
> --
>
> Key: SPARK-31509
> URL: https://issues.apache.org/jira/browse/SPARK-31509
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, barrier taskset can only be scheduled when all tasks are launched 
> at same time. As a result, delay scheduling becomes the biggest obstacle for 
> the barrier taskset  to get scheduled. And the application can be easily fail 
> with barrier taskset when delay scheduling is on. So, probably, we should 
> recommend user to disable delay scheduling.
>  
> BTW, we could still support delay scheduling with barrier taskset if we could 
> resolve SPARK-24818



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31509) Recommend user to disable delay scheduling for barrier taskset

2020-04-21 Thread wuyi (Jira)
wuyi created SPARK-31509:


 Summary: Recommend user to disable delay scheduling for barrier 
taskset
 Key: SPARK-31509
 URL: https://issues.apache.org/jira/browse/SPARK-31509
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: wuyi


Currently, barrier taskset can only be scheduled if all tasks are launched at 
same time. As a result, delay scheduling becomes the biggest obstacle for the 
barrier taskset  to get scheduled. And the application can be easily fail with 
barrier taskset when delay scheduling is on. So, probably, we should recommend 
user to disable delay scheduling.

 

BTW, we could still support delay scheduling with barrier taskset if we could 
resolve SPARK-24818



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31504) Output fields in formatted Explain should have determined order.

2020-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31504.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28282
[https://github.com/apache/spark/pull/28282]

> Output fields in formatted Explain should have determined order.
> 
>
> Key: SPARK-31504
> URL: https://issues.apache.org/jira/browse/SPARK-31504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, formatted explain use `producedAttributes`, which is AttributeSet, 
> to generate the "Output" of a leaf node. As a result, it will lead to the 
> different output of formatted explain for the same plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31504) Output fields in formatted Explain should have determined order.

2020-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31504:
---

Assignee: wuyi

> Output fields in formatted Explain should have determined order.
> 
>
> Key: SPARK-31504
> URL: https://issues.apache.org/jira/browse/SPARK-31504
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> Currently, formatted explain use `producedAttributes`, which is AttributeSet, 
> to generate the "Output" of a leaf node. As a result, it will lead to the 
> different output of formatted explain for the same plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function

2020-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31474.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28248
[https://github.com/apache/spark/pull/28248]

> Consistancy between dayofweek/dow in extract expression and dayofweek function
> --
>
> Key: SPARK-31474
> URL: https://issues.apache.org/jira/browse/SPARK-31474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> spark-sql> SELECT extract(dayofweek from '2009-07-26');
> 1
> spark-sql> SELECT extract(dow from '2009-07-26');
> 0
> spark-sql> SELECT extract(isodow from '2009-07-26');
> 7
> spark-sql> SELECT dayofweek('2009-07-26');
> 1
> spark-sql> SELECT weekday('2009-07-26');
> 6
> {code}
> Currently, there are 4 types of day-of-week range: 
> the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of 
> Sunday(1) to Saturday(7)
> extracting dow(3.0.0) results as of Sunday(0) to Saturday(6)
> extracting isodow (3.0.0)  results as of Monday(1) to Sunday(7)
> the function weekday(2.4.0) results as of Monday(0) to Sunday(6)
> Actually, extracting dayofweek and dow are both derived from PostgreSQL but 
> have different meanings.
> https://issues.apache.org/jira/browse/SPARK-23903
> https://issues.apache.org/jira/browse/SPARK-28623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31474) Consistancy between dayofweek/dow in extract expression and dayofweek function

2020-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31474:
---

Assignee: Kent Yao

> Consistancy between dayofweek/dow in extract expression and dayofweek function
> --
>
> Key: SPARK-31474
> URL: https://issues.apache.org/jira/browse/SPARK-31474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> {code:sql}
> spark-sql> SELECT extract(dayofweek from '2009-07-26');
> 1
> spark-sql> SELECT extract(dow from '2009-07-26');
> 0
> spark-sql> SELECT extract(isodow from '2009-07-26');
> 7
> spark-sql> SELECT dayofweek('2009-07-26');
> 1
> spark-sql> SELECT weekday('2009-07-26');
> 6
> {code}
> Currently, there are 4 types of day-of-week range: 
> the function dayofweek(2.3.0) and extracting dayofweek(2.4.0) result as of 
> Sunday(1) to Saturday(7)
> extracting dow(3.0.0) results as of Sunday(0) to Saturday(6)
> extracting isodow (3.0.0)  results as of Monday(1) to Sunday(7)
> the function weekday(2.4.0) results as of Monday(0) to Sunday(6)
> Actually, extracting dayofweek and dow are both derived from PostgreSQL but 
> have different meanings.
> https://issues.apache.org/jira/browse/SPARK-23903
> https://issues.apache.org/jira/browse/SPARK-28623



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31329) Modify executor monitor to allow for moving shuffle blocks

2020-04-21 Thread Dale Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088569#comment-17088569
 ] 

Dale Richardson edited comment on SPARK-31329 at 4/21/20, 11:36 AM:


Hi Holden,

I've been thinking about this issue as well.

Blocks can't move, but they can be replicated.  Any reason we can't just 
replicate the blocks out, allow the existing code paths to update the block 
locations with the block manager master, then unregister the current blocks?

We should chat and coordinate efforts.


was (Author: tigerquoll):
Hi Holden,

I've been thinking about this issue as well.

Blocks can't move, but they can be replicated.  Any reason we can't just 
replicate the blocks out, allow the existing code paths to update the block 
locations with the block manager master, then unregister the current blocks?

> Modify executor monitor to allow for moving shuffle blocks
> --
>
> Key: SPARK-31329
> URL: https://issues.apache.org/jira/browse/SPARK-31329
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> To enable Spark-20629 we need to revisit code that assumes shuffle blocks 
> don't move. Currently, the executor monitor assumes that shuffle blocks are 
> immovable. We should modify this code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31329) Modify executor monitor to allow for moving shuffle blocks

2020-04-21 Thread Dale Richardson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088569#comment-17088569
 ] 

Dale Richardson commented on SPARK-31329:
-

Hi Holden,

I've been thinking about this issue as well.

Blocks can't move, but they can be replicated.  Any reason we can't just 
replicate the blocks out, allow the existing code paths to update the block 
locations with the block manager master, then unregister the current blocks?

> Modify executor monitor to allow for moving shuffle blocks
> --
>
> Key: SPARK-31329
> URL: https://issues.apache.org/jira/browse/SPARK-31329
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
>
> To enable Spark-20629 we need to revisit code that assumes shuffle blocks 
> don't move. Currently, the executor monitor assumes that shuffle blocks are 
> immovable. We should modify this code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31508) string type compare with numberic cause data inaccurate

2020-04-21 Thread philipse (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

philipse updated SPARK-31508:
-
Summary: string type compare with numberic cause data inaccurate  (was: 
string type compare with numberic case data inaccurate)

> string type compare with numberic cause data inaccurate
> ---
>
> Key: SPARK-31508
> URL: https://issues.apache.org/jira/browse/SPARK-31508
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hadoop2.7
> spark2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi all
>  
> Sparksql may should convert values to double if string type compare with 
> number type.the cases shows as below
> 1, create table
> create table test1(id string);
>  
> 2,insert data into table
> insert into test1 select 'avc';
> insert into test1 select '2';
> insert into test1 select '0a';
> insert into test1 select '';
> insert into test1 select 
> '22';
> 3.Let's check what's happening
> select * from test_gf13871.test1 where id > 0
> the results shows below
> *2*
> **
> Really amazing,the big number 222...cannot be selected.
> while when i check in hive,the 222...shows normal.
> 4.try to explain the command,we may know what happened,if the data is big 
> enough than max_int_value,it will not selected,we may need to convert to 
> double instand.
> !image-2020-04-21-18-49-58-850.png!
> I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
> free to give any advice,
>  
> Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31508) string type compare with numberic case data inaccurate

2020-04-21 Thread philipse (Jira)
philipse created SPARK-31508:


 Summary: string type compare with numberic case data inaccurate
 Key: SPARK-31508
 URL: https://issues.apache.org/jira/browse/SPARK-31508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
 Environment: hadoop2.7

spark2.4.5
Reporter: philipse


Hi all

 

Sparksql may should convert values to double if string type compare with number 
type.the cases shows as below
1, create table
create table test1(id string);
 
2,insert data into table
insert into test1 select 'avc';
insert into test1 select '2';
insert into test1 select '0a';
insert into test1 select '';
insert into test1 select 
'22';
3.Let's check what's happening
select * from test_gf13871.test1 where id > 0
the results shows below
*2*
**
Really amazing,the big number 222...cannot be selected.
while when i check in hive,the 222...shows normal.
4.try to explain the command,we may know what happened,if the data is big 
enough than max_int_value,it will not selected,we may need to convert to double 
instand.
!image-2020-04-21-18-49-58-850.png!
I wanna know if we have fixed or planned in 3.0 or later version.,please feel 
free to give any advice,
 
Many Thanks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-04-21 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088514#comment-17088514
 ] 

Gabor Somogyi commented on SPARK-28367:
---

I've taken a look at the possibilities given by the new API in 
[KIP-396|https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=97551484].
 I've found the following problems:
* Consumer properties don't match 100% with AdminClient so Consumer properties 
can't be used for instantiation (at the first glance I think adding this to the 
API would be an overkill)
* With the new API by using AdminClient Spark looses the possibility to use the 
assign, subscribe and subscribePattern APIs (implementing this logic would be 
feasible since Kafka consumer does this on client side as well but would be 
ugly).

My main conclusion is that adding AdminClient and using Consumer in a parallel 
way would be super hacky. I would use either Consumer (which doesn't provide 
metadata only at the moment) or AdminClient (where it must be checked whether 
all existing features can be filled + how to add properties).


> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata

2020-04-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31361:

Summary: Rebase datetime in parquet/avro according to file metadata  (was: 
Rebase datetime in parquet/avro according to file written Spark version)

> Rebase datetime in parquet/avro according to file metadata
> --
>
> Key: SPARK-31361
> URL: https://issues.apache.org/jira/browse/SPARK-31361
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31507) Remove millennium, century, decade, millisecond, microsecond and epoch from extract fucntion

2020-04-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31507:
-
Summary: Remove millennium, century, decade, millisecond, microsecond and 
epoch from extract fucntion  (was: Remove millennium, century, decade, and 
epoch from extract fucntion)

> Remove millennium, century, decade, millisecond, microsecond and epoch from 
> extract fucntion
> 
>
> Key: SPARK-31507
> URL: https://issues.apache.org/jira/browse/SPARK-31507
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Extracting millennium, century, decade, millisecond, microsecond and epoch 
> from datetime is neither ANSI standard nor quite common in modern SQL 
> platforms. None of the systems listing below support these except PostgreSQL.
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm
> https://prestodb.io/docs/current/functions/datetime.html
> https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html
> https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts
> https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31507) Remove millennium, century, decade, and epoch from extract fucntion

2020-04-21 Thread Kent Yao (Jira)
Kent Yao created SPARK-31507:


 Summary: Remove millennium, century, decade, and epoch from 
extract fucntion
 Key: SPARK-31507
 URL: https://issues.apache.org/jira/browse/SPARK-31507
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


Extracting millennium, century, decade, millisecond, microsecond and epoch from 
datetime is neither ANSI standard nor quite common in modern SQL platforms. 
None of the systems listing below support these except PostgreSQL.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|https://math.nist.gov/spblas] also has [native 
optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
 , but current BLAS in Spark only has native optimization for Dense and naive 
Java implementation for Sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|https://math.nist.gov/spblas] also has [native 
optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|https://math.nist.gov/spblas] also has [native 
> optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
>  , but current BLAS in Spark only has native optimization for Dense and naive 
> Java implementation for Sparse.
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|https://math.nist.gov/spblas] also has [native 
optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|https://math.nist.gov/spblas]also has [native 
optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|https://math.nist.gov/spblas] also has [native 
> optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
>  , but current BLAS in Spark only has native optimization for dense and naive 
> Java implementation for sparse.
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|https://math.nist.gov/spblas]also has [native 
optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]]

[link title|http://example.com]also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|https://math.nist.gov/spblas]also has [native 
> optimization|https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]
>  , but current BLAS in Spark only has native optimization for dense and naive 
> Java implementation for sparse.
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
 but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

 

[link title|http://example.com]

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
> optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]]
>  , but current BLAS in Spark only has native optimization for dense and naive 
> Java implementation for sparse.
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]]

[link title|http://example.com]also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]]
 , but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|[https://math.nist.gov/spblas/]]
> [link title|http://example.com]also has [native 
> optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]]
>  , but current BLAS in Spark only has native optimization for dense and naive 
> Java implementation for sparse.
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
 but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

 

[link title|http://example.com]

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
 but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
> optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
>  but current BLAS in Spark only has native optimization for dense and naive 
> Java implementation for sparse.
>  
> [link title|http://example.com]
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiaochang Wu updated SPARK-31506:
-
Description: 
BLAS are highly optimized routines by the industry. If take advantage of it, we 
can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
 but current BLAS in Spark only has native optimization for dense and naive 
Java implementation for sparse.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 

  was:
BLAS routines are highly optimized routine by the industry. If take advantage 
of it, we can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
 but current BLAS in Spark only has naive Java implementation for it.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 


> Sparse BLAS native optimization
> ---
>
> Key: SPARK-31506
> URL: https://issues.apache.org/jira/browse/SPARK-31506
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiaochang Wu
>Priority: Major
>
> BLAS are highly optimized routines by the industry. If take advantage of it, 
> we can leverage hardware vendors' low-level optimization when available. 
> [Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
> optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
>  but current BLAS in Spark only has native optimization for dense and naive 
> Java implementation for sparse.
> We would like to introduce native optimization support for Sparse BLAS 
> operations related to machine learning and produce benchmarks. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31506) Sparse BLAS native optimization

2020-04-21 Thread Xiaochang Wu (Jira)
Xiaochang Wu created SPARK-31506:


 Summary: Sparse BLAS native optimization
 Key: SPARK-31506
 URL: https://issues.apache.org/jira/browse/SPARK-31506
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: Xiaochang Wu


BLAS routines are highly optimized routine by the industry. If take advantage 
of it, we can leverage hardware vendors' low-level optimization when available. 

[Sparse BLAS|[https://math.nist.gov/spblas/]] also has [native 
optimization|[https://software.intel.com/en-us/mkl-developer-reference-c-sparse-blas-level-2-and-level-3-routines]],
 but current BLAS in Spark only has naive Java implementation for it.

We would like to introduce native optimization support for Sparse BLAS 
operations related to machine learning and produce benchmarks. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31504) Output fields in formatted Explain should have determined order.

2020-04-21 Thread wuyi (Jira)
wuyi created SPARK-31504:


 Summary: Output fields in formatted Explain should have determined 
order.
 Key: SPARK-31504
 URL: https://issues.apache.org/jira/browse/SPARK-31504
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Currently, formatted explain use `producedAttributes`, which is AttributeSet, 
to generate the "Output" of a leaf node. As a result, it will lead to the 
different output of formatted explain for the same plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31505) Single precision floating point support in machine learning

2020-04-21 Thread Xiaochang Wu (Jira)
Xiaochang Wu created SPARK-31505:


 Summary: Single precision floating point support in machine 
learning
 Key: SPARK-31505
 URL: https://issues.apache.org/jira/browse/SPARK-31505
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.1.0
Reporter: Xiaochang Wu


MLlib assumed the algorithm data type to "double" in all its BLAS and algorithm 
code. In some situations, there is no need to use high precision, end users may 
want to trade precision with performance given machines have much more single 
precision FP bandwiths than double precision ones.

It may require templating all FP data type and rewrite all existing code. We 
would like to fast prototype some key algorithms and produce benchmarks on 
using float instead of double. Feel free to comment on this if there is a need!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31503) fix the SQL string of the TRIM functions

2020-04-21 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31503:
---

 Summary: fix the SQL string of the TRIM functions
 Key: SPARK-31503
 URL: https://issues.apache.org/jira/browse/SPARK-31503
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29274) Should not coerce decimal type to double type when it's join column

2020-04-21 Thread jinwensc (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088337#comment-17088337
 ] 

jinwensc edited comment on SPARK-29274 at 4/21/20, 6:29 AM:


bigdecimal compare to string, string should be promote to bigdecimal.

this bug fixed 

[https://github.com/apache/spark/pull/28278]


was (Author: fasheng):
bigdecimal compare to string, string should be promote to bigdecimal.

> Should not coerce decimal type to double type when it's join column
> ---
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29274) Should not coerce decimal type to double type when it's join column

2020-04-21 Thread jinwensc (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088337#comment-17088337
 ] 

jinwensc commented on SPARK-29274:
--

bigdecimal compare to string, string should be promote to bigdecimal.

> Should not coerce decimal type to double type when it's join column
> ---
>
> Key: SPARK-29274
> URL: https://issues.apache.org/jira/browse/SPARK-29274
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Assignee: Pengfei Chang
>Priority: Major
> Attachments: image-2019-09-27-20-20-24-238.png
>
>
> How to reproduce this issue:
> {code:sql}
> create table t1 (incdata_id decimal(21,0), v string) using parquet;
> create table t2 (incdata_id string, v string) using parquet;
> explain select * from t1 join t2 on (t1.incdata_id = t2.incdata_id);
> == Physical Plan ==
> *(5) SortMergeJoin 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double)))], 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double)))], Inner
> :- *(2) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31 as 
> double))) ASC NULLS FIRST], false, 0
> :  +- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#31
>  as double))), 200), true, [id=#104]
> : +- *(1) Filter isnotnull(incdata_id#31)
> :+- Scan hive default.t1 [incdata_id#31, v#32], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#31, v#32], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort 
> [knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33 as 
> double))) ASC NULLS FIRST], false, 0
>+- Exchange 
> hashpartitioning(knownfloatingpointnormalized(normalizenanandzero(cast(incdata_id#33
>  as double))), 200), true, [id=#112]
>   +- *(3) Filter isnotnull(incdata_id#33)
>  +- Scan hive default.t2 [incdata_id#33, v#34], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> [incdata_id#33, v#34], Statistics(sizeInBytes=8.0 EiB)
> {code}
> {code:sql}
> select cast(v1 as double) as v3, cast(v2 as double) as v4,
>   cast(v1 as double) = cast(v2 as double), v1 = v2 
> from (select cast('1001636981212' as decimal(21, 0)) as v1,
>   cast('1001636981213' as decimal(21, 0)) as v2) t;
> 1.00163697E20 1.00163697E20   truefalse
> {code}
>  
> It's a realy case in our production:
> !image-2019-09-27-20-20-24-238.png|width=100%!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org