date:20190924

[jira] [Commented] (SPARK-29235) CrossValidatorModel.avgMetrics disappears after model is written/read again

2019-09-24 Thread shahid (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937427#comment-16937427
 ] 

shahid commented on SPARK-29235:


I would like to analyze the issue.

> CrossValidatorModel.avgMetrics disappears after model is written/read again
> ---
>
> Key: SPARK-29235
> URL: https://issues.apache.org/jira/browse/SPARK-29235
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.1, 2.4.3
> Environment: Databricks cluster:
> {
>     "num_workers": 4,
>     "cluster_name": "mabedfor-test-classfix",
>     "spark_version": "5.3.x-cpu-ml-scala2.11",
>     "spark_conf": {
>     "spark.databricks.delta.preview.enabled": "true"
>     },
>     "node_type_id": "Standard_DS12_v2",
>     "driver_node_type_id": "Standard_DS12_v2",
>     "ssh_public_keys": [],
>     "custom_tags": {},
>     "spark_env_vars": {
>     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
>     },
>     "autotermination_minutes": 120,
>     "enable_elastic_disk": true,
>     "cluster_source": "UI",
>     "init_scripts": [],
>     "cluster_id": "0722-165622-calls746"
> }
>Reporter: Matthew Bedford
>Priority: Minor
>
>  
>  Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
> model is written to disk and read later, it no longer has avgMetrics.  To 
> reproduce:
> {{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
> {{cv = CrossValidator(...) #fill with params}}
> {{cvModel = cv.fit(trainDF) #given dataframe with training data}}
> {{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}
> {{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}
> {{cvModel2 = 
> CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}
> {{print(cvModel2.avgMetrics) #BUG - prints an empty list}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29172) Fix some exception issue of explain commands

2019-09-24 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reopened SPARK-29172:


This was accidentally closed because the issue number included in the title of 
PR #25848 was wrong. What should have been closed is SPARK-29178.

So I'll reopen this.

> Fix some exception issue of explain commands
> 
>
> Key: SPARK-29172
> URL: https://issues.apache.org/jira/browse/SPARK-29172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
> Attachments: cost.png, extemded.png
>
>
> The behaviors of run commands during exception handling are different depends 
> on explain command.
> See the attachments. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29226) Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29226:
-

Assignee: jiaan.geng

> Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.
> ---
>
> Key: SPARK-29226
> URL: https://issues.apache.org/jira/browse/SPARK-29226
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 
> and it will cause a security vulnerabilities. We could get some security info 
> from https://www.tenable.com/cve/CVE-2019-16335
> This reference remind to upgrate the version of `jackson-databind` to 2.9.10 
> or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29226) Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29226.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25912
[https://github.com/apache/spark/pull/25912]

> Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.
> ---
>
> Key: SPARK-29226
> URL: https://issues.apache.org/jira/browse/SPARK-29226
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 
> and it will cause a security vulnerabilities. We could get some security info 
> from https://www.tenable.com/cve/CVE-2019-16335
> This reference remind to upgrate the version of `jackson-databind` to 2.9.10 
> or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937408#comment-16937408
 ] 

Suchintak Patnaik commented on SPARK-29234:
---

Is it possible to back port the PRs to version 2.3 as well??

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937396#comment-16937396
 ] 

Dongjoon Hyun commented on SPARK-29234:
---

I'm fine with that. But, we need to back port two PRs, right?
In this case, please leave a comment on those two PRs to ping the relevant 
people.

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937392#comment-16937392
 ] 

Yuming Wang commented on SPARK-29234:
-

It's 3.0. [~dongjoon] Do you think we need to backport SPARK-27592 to 
branch-2.4?

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28625) Shuffle Writer API: Indeterminate shuffle support in ShuffleMapOutputWriter

2019-09-24 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li resolved SPARK-28625.
-
Resolution: Resolved

Done in SPARK-25341, which changes the writer creation API by using map task 
attempt id as ma id.

> Shuffle Writer API: Indeterminate shuffle support in ShuffleMapOutputWriter
> ---
>
> Key: SPARK-28625
> URL: https://issues.apache.org/jira/browse/SPARK-28625
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> related to SPARK-25341.
> In order to support the indeterminate stage rerun, we need the corresponding 
> support in shuffle writer api.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937378#comment-16937378
 ] 

Suchintak Patnaik commented on SPARK-29234:
---

[~yumwang] in which Spark version this is fixed? Isn't it backward compatible 
like fixed in older versions?

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29233) Add regex expression checks for executorEnv in K8S mode

2019-09-24 Thread merrily01 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

merrily01 updated SPARK-29233:
--
Description: 
In k8s mode, there are some naming regular expression requirements and checks 
for the key of pod environment variables, such as:
 * In version k8s v-1.10.5, the naming rules for pod env satisfy the 
requirements of regular expressions: [-. _a-zA-Z] [-. _a-zA-Z0-9].*
 * In version k8s v-1.6, the naming rules for pod env satisfy the requirement 
of regular expressions: [A-Za-z_] [A-Za-z0-9_]*

Therefore, in spark on k8s mode, spark should add relevant regular checking 
rules when creating executor Env, and stricter validation rules should be used 
to filter out illegal key values that do not meet the requirements of the 
naming rules of Env for creating pod by k8s. Otherwise, it will lead to the 
problem that the pod can not be created normally and the tasks will be 
suspended.

 

To solve the problem above, a regular validation to executorEnv is added and 
committed. 

  was:
In k8s mode, there are some naming regular expression requirements and checks 
for the key of pod environment variables, such as:
 * In version k8s v-1.10.5, the naming rules for pod env satisfy the 
requirements of regular expressions: [-. _a-zA-Z] [-. _a-zA-Z0-9].*
 * In version k8s v-1.6, the naming rules for pod env satisfy the requirement 
of regular expressions: [A-Za-z_] [A-Za-z0-9_]*

Therefore, in spark on k8s mode, spark should add relevant regular checking 
rules when creating executor Env, and stricter validation rules should be used 
to filter out illegal key values that do not meet the requirements of the 
naming rules of Env for creating pod by k8s. Otherwise, it will lead to the 
problem that the pod can not be created normally and the tasks will be 
suspended.


> Add regex expression checks for executorEnv in K8S mode
> ---
>
> Key: SPARK-29233
> URL: https://issues.apache.org/jira/browse/SPARK-29233
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: merrily01
>Priority: Blocker
> Attachments: error.jpeg, podEnv.jpeg
>
>
> In k8s mode, there are some naming regular expression requirements and checks 
> for the key of pod environment variables, such as:
>  * In version k8s v-1.10.5, the naming rules for pod env satisfy the 
> requirements of regular expressions: [-. _a-zA-Z] [-. _a-zA-Z0-9].*
>  * In version k8s v-1.6, the naming rules for pod env satisfy the requirement 
> of regular expressions: [A-Za-z_] [A-Za-z0-9_]*
> Therefore, in spark on k8s mode, spark should add relevant regular checking 
> rules when creating executor Env, and stricter validation rules should be 
> used to filter out illegal key values that do not meet the requirements of 
> the naming rules of Env for creating pod by k8s. Otherwise, it will lead to 
> the problem that the pod can not be created normally and the tasks will be 
> suspended.
>  
> To solve the problem above, a regular validation to executorEnv is added and 
> committed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29236) Access 'executorDataMap' out of 'DriverEndpoint' should be protected by lock

2019-09-24 Thread Xianyang Liu (Jira)

Xianyang Liu created SPARK-29236:


 Summary: Access 'executorDataMap' out of 'DriverEndpoint' should 
be protected by lock
 Key: SPARK-29236
 URL: https://issues.apache.org/jira/browse/SPARK-29236
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Xianyang Liu


Just as the comments:

> 

// Accessing `executorDataMap` in `DriverEndpoint.receive/receiveAndReply` 
doesn't need any
// protection. But accessing `executorDataMap` out of 
`DriverEndpoint.receive/receiveAndReply`
// must be protected by `CoarseGrainedSchedulerBackend.this`. Besides, 
`executorDataMap` should
// only be modified in `DriverEndpoint.receive/receiveAndReply` with protection 
by
// `CoarseGrainedSchedulerBackend.this`.

 

`executorDataMap` is not threadsafe, it should be protected by lock when 
accessing it out of `DriverEndpoint`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-28915) The new keyword is not used when instantiating the WorkerOffer.

2019-09-24 Thread Jiaqi Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaqi Li closed SPARK-28915.


> The new keyword is not used when instantiating the WorkerOffer.
> ---
>
> Key: SPARK-28915
> URL: https://issues.apache.org/jira/browse/SPARK-28915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Jiaqi Li
>Priority: Minor
>
> WorkerOffer is a case class, so we don't need to use new when instantiating 
> it, which makes code more concise.
> {code:java}
> private[spark]
> case class WorkerOffer(
> executorId: String,
> host: String,
> cores: Int,
> address: Option[String] = None,
> resources: Map[String, Buffer[String]] = Map.empty)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29234.
-
Resolution: Duplicate

Issue resolved by pull request 24486
[https://github.com/apache/spark/pull/24486]

> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28502) Error with struct conversion while using pandas_udf

2019-09-24 Thread Bryan Cutler (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937233#comment-16937233
 ] 

Bryan Cutler commented on SPARK-28502:
--

I was able to reproduce in Spark 2.4.3. The problem was introducing a window in 
a PandasUDFType.GROUPED_MAP adds an Arrow column that is a StructType to the 
UDF. From the above example it is:

{noformat}
pyarrow.Table
window: struct not null
  child 0, start: timestamp[us, tz=America/Los_Angeles]
  child 1, end: timestamp[us, tz=America/Los_Angeles]
id: double
ts: timestamp[us, tz=America/Los_Angeles]
{noformat}

 And Spark 2.4.3 is not able to handle this column.

In current master, it seems to work and the column is treated as part of the 
key.  The example above will just ignore it by default, but you can get it as 
part of th e UDF input like so:

{code}
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def some_udf(key, df):
print("Key: {}".format(key))
print("DF:\n{}".format(df))
# some computation
return df
{code}

Would print out for ts 13.0:

{noformat}
Key: (13.0, {'start': datetime.datetime(2018, 3, 5, 0, 0), 'end': 
datetime.datetime(2018, 3, 20, 0, 0)})
DF:
 id  ts
0  13.0 2018-03-11 05:27:18
{noformat}

Someone should verify this is the correct way to handle it and the result is 
right. Can you do this [~blakec] or [~nasirali] ? If so, we can close this but 
should follow up and add proper testing.

> Error with struct conversion while using pandas_udf
> ---
>
> Key: SPARK-28502
> URL: https://issues.apache.org/jira/browse/SPARK-28502
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
> Environment: OS: Ubuntu
> Python: 3.6
>Reporter: Nasir Ali
>Priority: Minor
>
> What I am trying to do: Group data based on time intervals (e.g., 15 days 
> window) and perform some operations on dataframe using (pandas) UDFs. I don't 
> know if there is a better/cleaner way to do it.
> Below is the sample code that I tried and error message I am getting.
>  
> {code:java}
> df = sparkSession.createDataFrame([(17.00, "2018-03-10T15:27:18+00:00"),
> (13.00, "2018-03-11T12:27:18+00:00"),
> (25.00, "2018-03-12T11:27:18+00:00"),
> (20.00, "2018-03-13T15:27:18+00:00"),
> (17.00, "2018-03-14T12:27:18+00:00"),
> (99.00, "2018-03-15T11:27:18+00:00"),
> (156.00, "2018-03-22T11:27:18+00:00"),
> (17.00, "2018-03-31T11:27:18+00:00"),
> (25.00, "2018-03-15T11:27:18+00:00"),
> (25.00, "2018-03-16T11:27:18+00:00")
> ],
>["id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> schema = StructType([
> StructField("id", IntegerType()),
> StructField("ts", TimestampType())
> ])
> @pandas_udf(schema, PandasUDFType.GROUPED_MAP)
> def some_udf(df):
> # some computation
> return df
> df.groupby('id', F.window("ts", "15 days")).apply(some_udf).show()
> {code}
> This throws following exception:
> {code:java}
> TypeError: Unsupported type in conversion from Arrow: struct timestamp[us, tz=America/Chicago], end: timestamp[us, tz=America/Chicago]>
> {code}
>  
> However, if I use builtin agg method then it works all fine. For example,
> {code:java}
> df.groupby('id', F.window("ts", "15 days")).mean().show(truncate=False)
> {code}
> Output
> {code:java}
> +-+--+---+
> |id   |window|avg(id)|
> +-+--+---+
> |13.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|13.0   |
> |17.0 |[2018-03-20 00:00:00, 2018-04-04 00:00:00]|17.0   |
> |156.0|[2018-03-20 00:00:00, 2018-04-04 00:00:00]|156.0  |
> |99.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|99.0   |
> |20.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|20.0   |
> |17.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|17.0   |
> |25.0 |[2018-03-05 00:00:00, 2018-03-20 00:00:00]|25.0   |
> +-+--+---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29235) CrossValidatorModel.avgMetrics disappears after model is written/read again

2019-09-24 Thread Matthew Bedford (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Bedford updated SPARK-29235:

Description: 
 
 Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:

{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}

{{cv = CrossValidator(...) #fill with params}}

{{cvModel = cv.fit(trainDF) #given dataframe with training data}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}

{{print(cvModel2.avgMetrics) #BUG - prints an empty list}}

  was:
 
 Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:

{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}

{{cv = CrossValidator(...) #fill with params}}

{{cvModel = cv.fit(trainDF) #given dataframe with training}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
 #BUG - prints an empty list}}


> CrossValidatorModel.avgMetrics disappears after model is written/read again
> ---
>
> Key: SPARK-29235
> URL: https://issues.apache.org/jira/browse/SPARK-29235
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.1, 2.4.3
> Environment: Databricks cluster:
> {
>     "num_workers": 4,
>     "cluster_name": "mabedfor-test-classfix",
>     "spark_version": "5.3.x-cpu-ml-scala2.11",
>     "spark_conf": {
>     "spark.databricks.delta.preview.enabled": "true"
>     },
>     "node_type_id": "Standard_DS12_v2",
>     "driver_node_type_id": "Standard_DS12_v2",
>     "ssh_public_keys": [],
>     "custom_tags": {},
>     "spark_env_vars": {
>     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
>     },
>     "autotermination_minutes": 120,
>     "enable_elastic_disk": true,
>     "cluster_source": "UI",
>     "init_scripts": [],
>     "cluster_id": "0722-165622-calls746"
> }
>Reporter: Matthew Bedford
>Priority: Minor
>
>  
>  Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
> model is written to disk and read later, it no longer has avgMetrics.  To 
> reproduce:
> {{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
> {{cv = CrossValidator(...) #fill with params}}
> {{cvModel = cv.fit(trainDF) #given dataframe with training data}}
> {{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}
> {{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}
> {{cvModel2 = 
> CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}
> {{print(cvModel2.avgMetrics) #BUG - prints an empty list}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29206) Number of shuffle Netty server threads should be a multiple of number of chunk fetch handler threads

2019-09-24 Thread Min Shen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937219#comment-16937219
 ] 

Min Shen commented on SPARK-29206:
--

[~irashid],

Would appreciate your thoughts on this ticket as well.

> Number of shuffle Netty server threads should be a multiple of number of 
> chunk fetch handler threads
> 
>
> Key: SPARK-29206
> URL: https://issues.apache.org/jira/browse/SPARK-29206
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Min Shen
>Priority: Major
>
> In SPARK-24355, we proposed to use a separate chunk fetch handler thread pool 
> to handle the slow-to-process chunk fetch requests in order to improve the 
> responsiveness of shuffle service for RPC requests.
> Initially, we thought by making the number of Netty server threads larger 
> than the number of chunk fetch handler threads, it would reserve some threads 
> for RPC requests thus resolving the various RPC request timeout issues we 
> experienced previously. The solution worked in our cluster initially. 
> However, as the number of Spark applications in our cluster continues to 
> increase, we saw the RPC request (SASL authentication specifically) timeout 
> issue again:
> {noformat}
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout 
> waiting for task.
>   at 
> org.spark-project.guava.base.Throwables.propagate(Throwables.java:160)
>   at 
> org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:278)
>   at 
> org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:80)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>   at 
> org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:181)
>   at 
> org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:141)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:218)
>  {noformat}
> After further investigation, we realized that as the number of concurrent 
> clients connecting to a shuffle service increases, it becomes _VERY_ 
> important to configure the number of Netty server threads and number of chunk 
> fetch handler threads correctly. Specifically, the number of Netty server 
> threads needs to be a multiple of the number of chunk fetch handler threads. 
> The reason is explained in details below:
> When a channel is established on the Netty server, it is registered with both 
> the Netty server default EventLoopGroup and the chunk fetch handler 
> EventLoopGroup. Once registered, this channel sticks with a given thread in 
> both EventLoopGroups, i.e. all requests from this channel is going to be 
> handled by the same thread. Right now, Spark shuffle Netty server uses the 
> default Netty strategy to select a thread from a EventLoopGroup to be 
> associated with a new channel, which is simply round-robin (Netty's 
> DefaultEventExecutorChooserFactory).
> In SPARK-24355, with the introduced chunk fetch handler thread pool, all 
> chunk fetch requests from a given channel will be first added to the task 
> queue of the chunk fetch handler thread associated with that channel. When 
> the requests get processed, the chunk fetch request handler thread will 
> submit a task to the task queue of the Netty server thread that's also 
> associated with this channel. If the number of Netty server threads is not a 
> multiple of the number of chunk fetch handler threads, it would become a 
> problem when the server has a large number of concurrent connections.
> Assume we configure the number of Netty server threads as 40 and the 
> percentage of chunk fetch handler threads as 87, which leads to 35 chunk 
> fetch handler threads. Then according to the round-robin policy, channel 0, 
> 40, 80, 120, 160, 200, 240, and 280 will all be associated with the 1st Netty 
> server thread in the default EventLoopGroup. However, since the chunk fetch 
> handler thread pool only has 35 threads, out of these 8 channels, only 
> channel 0 and 280 will be associated with the same chunk fetch handler 
> thread. Thus, channel 0, 40, 80, 120, 160, 200, 240 will all be associated 
> with different chunk fetch handler threads but associated with the same Netty 
> server thread. This means, the 7 different chunk fetch handler threads 
> associated with these channels could potentially submit tasks to the task 
> queue of the same Netty server thread at the same time. This would lead to 7 
> slow-to-process requests sitting in the task queue.

[jira] [Updated] (SPARK-29235) CrossValidatorModel.avgMetrics disappears after model is written/read again

2019-09-24 Thread Matthew Bedford (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Bedford updated SPARK-29235:

Affects Version/s: 2.4.3

> CrossValidatorModel.avgMetrics disappears after model is written/read again
> ---
>
> Key: SPARK-29235
> URL: https://issues.apache.org/jira/browse/SPARK-29235
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.1, 2.4.3
> Environment: Databricks cluster:
> {
>     "num_workers": 4,
>     "cluster_name": "mabedfor-test-classfix",
>     "spark_version": "5.3.x-cpu-ml-scala2.11",
>     "spark_conf": {
>     "spark.databricks.delta.preview.enabled": "true"
>     },
>     "node_type_id": "Standard_DS12_v2",
>     "driver_node_type_id": "Standard_DS12_v2",
>     "ssh_public_keys": [],
>     "custom_tags": {},
>     "spark_env_vars": {
>     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
>     },
>     "autotermination_minutes": 120,
>     "enable_elastic_disk": true,
>     "cluster_source": "UI",
>     "init_scripts": [],
>     "cluster_id": "0722-165622-calls746"
> }
>Reporter: Matthew Bedford
>Priority: Minor
>
>  
>  Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
> model is written to disk and read later, it no longer has avgMetrics.  To 
> reproduce:
> {{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
> {{cv = CrossValidator(...) #fill with params}}
> {{cvModel = cv.fit(trainDF) #given dataframe with training}}
> {{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}
> {{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}
> {{cvModel2 = 
> CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
>  #BUG - prints an empty list}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29230) Fix NullPointerException in the test class ProcfsMetricsGetterSuite

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29230:
--
Component/s: Tests

> Fix NullPointerException in the test class ProcfsMetricsGetterSuite
> ---
>
> Key: SPARK-29230
> URL: https://issues.apache.org/jira/browse/SPARK-29230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Assignee: Jiaqi Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I use `ProcfsMetricsGetterSuite` testing, always throw out 
> `java.lang.NullPointerException`. 
> I think there is a problem with locating `new ProcfsMetricsGetter`, which 
> will lead to `SparkEnv` not being initialized in time. 
> This leads to `java.lang.NullPointerException` when the method is executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29230) Fix NullPointerException in the test class ProcfsMetricsGetterSuite

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29230:
-

Assignee: Jiaqi Li

> Fix NullPointerException in the test class ProcfsMetricsGetterSuite
> ---
>
> Key: SPARK-29230
> URL: https://issues.apache.org/jira/browse/SPARK-29230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Assignee: Jiaqi Li
>Priority: Minor
>
> When I use `ProcfsMetricsGetterSuite` testing, always throw out 
> `java.lang.NullPointerException`. 
> I think there is a problem with locating `new ProcfsMetricsGetter`, which 
> will lead to `SparkEnv` not being initialized in time. 
> This leads to `java.lang.NullPointerException` when the method is executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29230) Fix NullPointerException in the test class ProcfsMetricsGetterSuite

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29230.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/25918

> Fix NullPointerException in the test class ProcfsMetricsGetterSuite
> ---
>
> Key: SPARK-29230
> URL: https://issues.apache.org/jira/browse/SPARK-29230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Assignee: Jiaqi Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> When I use `ProcfsMetricsGetterSuite` testing, always throw out 
> `java.lang.NullPointerException`. 
> I think there is a problem with locating `new ProcfsMetricsGetter`, which 
> will lead to `SparkEnv` not being initialized in time. 
> This leads to `java.lang.NullPointerException` when the method is executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937170#comment-16937170
 ] 

Jungtaek Lim commented on SPARK-29220:
--

Could we expose a way to store these unit-test.log files in all modules to 
somewhere, like build artifacts? For example, see files under "container 0" in 
this page: 
[https://circleci.com/gh/HeartSaVioR/spark-state-tools/68#artifacts/containers/0]

I understand Amplab CI runs plenty of builds so space might be matter - they 
would be mostly only needed for failed builds. Zipped file would be also OK - 
even better than nothing.

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>

[jira] [Updated] (SPARK-29235) CrossValidatorModel.avgMetrics disappears after model is written/read again

2019-09-24 Thread Matthew Bedford (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Bedford updated SPARK-29235:

Description: 
 
 Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:

{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
{{}}

{{cv = CrossValidator(...) #fill with params}}

{{cvModel = cv.fit(trainDF) #given dataframe with training}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
 #BUG - prints an empty list}}

  was:
 
Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:
{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel
}}{{}}

{{cv = CrossValidator(...) #fill with params
}}{{}}

{{cvModel = cv.fit(trainDF) #given dataframe with training}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}{color:#172b4d}"/tmp/model"{color}{color})}}{{{color:#172b4d}print(cvModel2.avgMetrics)
 #BUG - prints an empty list{color}}}


> CrossValidatorModel.avgMetrics disappears after model is written/read again
> ---
>
> Key: SPARK-29235
> URL: https://issues.apache.org/jira/browse/SPARK-29235
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.1
> Environment: Databricks cluster:
> {
>     "num_workers": 4,
>     "cluster_name": "mabedfor-test-classfix",
>     "spark_version": "5.3.x-cpu-ml-scala2.11",
>     "spark_conf": {
>     "spark.databricks.delta.preview.enabled": "true"
>     },
>     "node_type_id": "Standard_DS12_v2",
>     "driver_node_type_id": "Standard_DS12_v2",
>     "ssh_public_keys": [],
>     "custom_tags": {},
>     "spark_env_vars": {
>     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
>     },
>     "autotermination_minutes": 120,
>     "enable_elastic_disk": true,
>     "cluster_source": "UI",
>     "init_scripts": [],
>     "cluster_id": "0722-165622-calls746"
> }
>Reporter: Matthew Bedford
>Priority: Minor
>
>  
>  Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
> model is written to disk and read later, it no longer has avgMetrics.  To 
> reproduce:
> {{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
> {{}}
> {{cv = CrossValidator(...) #fill with params}}
> {{cvModel = cv.fit(trainDF) #given dataframe with training}}
> {{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}
> {{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}
> {{cvModel2 = 
> CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
>  #BUG - prints an empty list}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29235) CrossValidatorModel.avgMetrics disappears after model is written/read again

2019-09-24 Thread Matthew Bedford (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Bedford updated SPARK-29235:

Description: 
 
 Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:

{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}

{{cv = CrossValidator(...) #fill with params}}

{{cvModel = cv.fit(trainDF) #given dataframe with training}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
 #BUG - prints an empty list}}

  was:
 
 Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:

{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
{{}}

{{cv = CrossValidator(...) #fill with params}}

{{cvModel = cv.fit(trainDF) #given dataframe with training}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
 #BUG - prints an empty list}}


> CrossValidatorModel.avgMetrics disappears after model is written/read again
> ---
>
> Key: SPARK-29235
> URL: https://issues.apache.org/jira/browse/SPARK-29235
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.1
> Environment: Databricks cluster:
> {
>     "num_workers": 4,
>     "cluster_name": "mabedfor-test-classfix",
>     "spark_version": "5.3.x-cpu-ml-scala2.11",
>     "spark_conf": {
>     "spark.databricks.delta.preview.enabled": "true"
>     },
>     "node_type_id": "Standard_DS12_v2",
>     "driver_node_type_id": "Standard_DS12_v2",
>     "ssh_public_keys": [],
>     "custom_tags": {},
>     "spark_env_vars": {
>     "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
>     },
>     "autotermination_minutes": 120,
>     "enable_elastic_disk": true,
>     "cluster_source": "UI",
>     "init_scripts": [],
>     "cluster_id": "0722-165622-calls746"
> }
>Reporter: Matthew Bedford
>Priority: Minor
>
>  
>  Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
> model is written to disk and read later, it no longer has avgMetrics.  To 
> reproduce:
> {{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel}}
> {{cv = CrossValidator(...) #fill with params}}
> {{cvModel = cv.fit(trainDF) #given dataframe with training}}
> {{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}
> {{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}
> {{cvModel2 = 
> CrossValidatorModel.read().load({color:#172b4d}"/tmp/model"{color})}}{{print(cvModel2.avgMetrics)
>  #BUG - prints an empty list}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29235) CrossValidatorModel.avgMetrics disappears after model is written/read again

2019-09-24 Thread Matthew Bedford (Jira)

Matthew Bedford created SPARK-29235:
---

 Summary: CrossValidatorModel.avgMetrics disappears after model is 
written/read again
 Key: SPARK-29235
 URL: https://issues.apache.org/jira/browse/SPARK-29235
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.1
 Environment: Databricks cluster:

{
    "num_workers": 4,
    "cluster_name": "mabedfor-test-classfix",
    "spark_version": "5.3.x-cpu-ml-scala2.11",
    "spark_conf": {
    "spark.databricks.delta.preview.enabled": "true"
    },
    "node_type_id": "Standard_DS12_v2",
    "driver_node_type_id": "Standard_DS12_v2",
    "ssh_public_keys": [],
    "custom_tags": {},
    "spark_env_vars": {
    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
    },
    "autotermination_minutes": 120,
    "enable_elastic_disk": true,
    "cluster_source": "UI",
    "init_scripts": [],
    "cluster_id": "0722-165622-calls746"
}
Reporter: Matthew Bedford


 
Right after a CrossValidatorModel is trained, it has avgMetrics.  After the 
model is written to disk and read later, it no longer has avgMetrics.  To 
reproduce:
{{from pyspark.ml.tuning import CrossValidator, CrossValidatorModel
}}{{}}

{{cv = CrossValidator(...) #fill with params
}}{{}}

{{cvModel = cv.fit(trainDF) #given dataframe with training}}

{{data}}{{print(cvModel.avgMetrics) #prints a nonempty list as expected}}

{{cvModel.write().save({color:#172b4d}"/tmp/model"{color})}}

{{cvModel2 = 
CrossValidatorModel.read().load({color:#172b4d}{color:#172b4d}"/tmp/model"{color}{color})}}{{{color:#172b4d}print(cvModel2.avgMetrics)
 #BUG - prints an empty list{color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937155#comment-16937155
 ] 

shane knapp commented on SPARK-29220:
-

it's ok...  let's try and keep an eye on these tests for other PRs and see if 
it pops up again.

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937154#comment-16937154
 ] 

Dongjoon Hyun commented on SPARK-29220:
---

Since I'm monitoring Jenkins jobs, I'll ping you if I see this failure again.

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937152#comment-16937152
 ] 

Dongjoon Hyun commented on SPARK-29220:
---

Thank you for investigation, [~shaneknapp]. Yes. I merged that PR because I 
though it's irrelevant to this flakiness. Do you want to re-trigger on that PR? 
Then, sorry about that~

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at

[jira] [Updated] (SPARK-29211) Second invocation of custom UDF results in exception (when invoked from shell)

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29211:
--
Affects Version/s: 2.3.4

> Second invocation of custom UDF results in exception (when invoked from shell)
> --
>
> Key: SPARK-29211
> URL: https://issues.apache.org/jira/browse/SPARK-29211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dilip Biswal
>Priority: Major
>
> I encountered this while writing documentation for SQL reference. Here is the 
> small repro:
> UDF:
>  =
> {code:java}
> import org.apache.hadoop.hive.ql.exec.UDF;
>   
> public class SimpleUdf extends UDF {
>   public int evaluate(int value) {
> return value + 10;
>   }
> }
> {code}
> {code:java}
> spark.sql("CREATE FUNCTION simple_udf AS 'SimpleUdf' USING JAR 
> '/tmp/SimpleUdf.jar'").show
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> +-+   
>   
> |function_return_value|
> +-+
> |   11|
> |   12|
> +-+
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> scala> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM 
> t1").show
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'SimpleUdf': java.lang.ClassNotFoundException: SimpleUdf; line 1 pos 7
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:245)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:57)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:61)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:78)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:78)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$makeFunctionExpression$2(HiveSessionCatalog.scala:78)
>   at scala.util.Failure.getOrElse(Try.scala:222)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:70)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1176)
>   at 
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1344)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
> {code}
> Please note that the problem does not happen if we try it from a testsuite. 
> So far i have only seen it when i try it from shell. Also i tried it in 2.4.4 
> and observe the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29211) Second invocation of custom UDF results in exception (when invoked from shell)

2019-09-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937149#comment-16937149
 ] 

Dongjoon Hyun commented on SPARK-29211:
---

Thank you, [~dkbiswal]. That's enough.

> Second invocation of custom UDF results in exception (when invoked from shell)
> --
>
> Key: SPARK-29211
> URL: https://issues.apache.org/jira/browse/SPARK-29211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Dilip Biswal
>Priority: Major
>
> I encountered this while writing documentation for SQL reference. Here is the 
> small repro:
> UDF:
>  =
> {code:java}
> import org.apache.hadoop.hive.ql.exec.UDF;
>   
> public class SimpleUdf extends UDF {
>   public int evaluate(int value) {
> return value + 10;
>   }
> }
> {code}
> {code:java}
> spark.sql("CREATE FUNCTION simple_udf AS 'SimpleUdf' USING JAR 
> '/tmp/SimpleUdf.jar'").show
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> +-+   
>   
> |function_return_value|
> +-+
> |   11|
> |   12|
> +-+
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> scala> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM 
> t1").show
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'SimpleUdf': java.lang.ClassNotFoundException: SimpleUdf; line 1 pos 7
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:245)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:57)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:61)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:78)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:78)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$makeFunctionExpression$2(HiveSessionCatalog.scala:78)
>   at scala.util.Failure.getOrElse(Try.scala:222)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:70)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1176)
>   at 
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1344)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
> {code}
> Please note that the problem does not happen if we try it from a testsuite. 
> So far i have only seen it when i try it from shell. Also i tried it in 2.4.4 
> and observe the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937126#comment-16937126
 ] 

shane knapp commented on SPARK-29220:
-

hmm, i see that https://github.com/apache/spark/pull/25901 has been merged in.

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:507)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937124#comment-16937124
 ] 

shane knapp commented on SPARK-29220:
-

ARGH i just messed up and accidentally overwrote all of these logs.  :(

i'll retrigger a build and if it happens again let me know here and we can 
investigate fully.  i'm not really sure what's happening, as i'm not a 
java/scala experten, and other pull request builds are successfully running on 
this worker.  the system configs for these machines hasn't changed in a long 
time, either.

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
>

[jira] [Issue Comment Deleted] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread shane knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-29220:

Comment: was deleted

(was: for 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/

would any of the following logs be useful?


{noformat}
-bash-4.1$ find . | grep "unit-tests\.log"|egrep -i "yarn|sql"
./sql/hive/target/unit-tests.log
./sql/core/target/unit-tests.log
./sql/catalyst/target/unit-tests.log
./sql/hive-thriftserver/target/unit-tests.log
./external/kafka-0-10-sql/target/unit-tests.log
./resource-managers/yarn/target/unit-tests.log
{noformat}
 
let me know ASAP!)

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at

[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2019-09-24 Thread shane knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937120#comment-16937120
 ] 

shane knapp commented on SPARK-29220:
-

for 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/

would any of the following logs be useful?


{noformat}
-bash-4.1$ find . | grep "unit-tests\.log"|egrep -i "yarn|sql"
./sql/hive/target/unit-tests.log
./sql/core/target/unit-tests.log
./sql/catalyst/target/unit-tests.log
./sql/hive-thriftserver/target/unit-tests.log
./external/kafka-0-10-sql/target/unit-tests.log
./resource-managers/yarn/target/unit-tests.log
{noformat}
 
let me know ASAP!

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at

[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik updated SPARK-29234:
--
Description: 
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile




  was:
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile

While reading the same table in Spark also giving error.

df = spark.



> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29233) Add regex expression checks for executorEnv in K8S mode

2019-09-24 Thread merrily01 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

merrily01 updated SPARK-29233:
--
Attachment: error.jpeg

> Add regex expression checks for executorEnv in K8S mode
> ---
>
> Key: SPARK-29233
> URL: https://issues.apache.org/jira/browse/SPARK-29233
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: merrily01
>Priority: Blocker
> Attachments: error.jpeg, podEnv.jpeg
>
>
> In k8s mode, there are some naming regular expression requirements and checks 
> for the key of pod environment variables, such as:
>  * In version k8s v-1.10.5, the naming rules for pod env satisfy the 
> requirements of regular expressions: [-. _a-zA-Z] [-. _a-zA-Z0-9].*
>  * In version k8s v-1.6, the naming rules for pod env satisfy the requirement 
> of regular expressions: [A-Za-z_] [A-Za-z0-9_]*
> Therefore, in spark on k8s mode, spark should add relevant regular checking 
> rules when creating executor Env, and stricter validation rules should be 
> used to filter out illegal key values that do not meet the requirements of 
> the naming rules of Env for creating pod by k8s. Otherwise, it will lead to 
> the problem that the pod can not be created normally and the tasks will be 
> suspended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suchintak Patnaik updated SPARK-29234:
--
Description: 
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile

While reading the same table in Spark also giving error.

df = spark.


  was:
When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile



> bucketed table created by Spark SQL DataFrame is in SequenceFile format
> ---
>
> Key: SPARK-29234
> URL: https://issues.apache.org/jira/browse/SPARK-29234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Major
>
> When we create a bucketed table as follows, it's input and output format are 
> getting displayed as SequenceFile format. But physically the files are 
> getting created in HDFS as the format specified by the user e.g. 
> orc,parquet,etc.
> df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")
> in Hive, DESCRIBE FORMATTED OrdersExample;
> describe formatted ordersExample;
> OK
> # col_name  data_type   comment
> col array   from deserializer
> # Storage Information
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
> Querying the same table in Hive is giving error.
> select * from OrdersExample;
> OK
> Failed with exception java.io.IOException:java.io.IOException: 
> hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
>  not a SequenceFile
> While reading the same table in Spark also giving error.
> df = spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29233) Add regex expression checks for executorEnv in K8S mode

2019-09-24 Thread merrily01 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

merrily01 updated SPARK-29233:
--
Attachment: podEnv.jpeg

> Add regex expression checks for executorEnv in K8S mode
> ---
>
> Key: SPARK-29233
> URL: https://issues.apache.org/jira/browse/SPARK-29233
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.3
>Reporter: merrily01
>Priority: Blocker
> Attachments: error.jpeg, podEnv.jpeg
>
>
> In k8s mode, there are some naming regular expression requirements and checks 
> for the key of pod environment variables, such as:
>  * In version k8s v-1.10.5, the naming rules for pod env satisfy the 
> requirements of regular expressions: [-. _a-zA-Z] [-. _a-zA-Z0-9].*
>  * In version k8s v-1.6, the naming rules for pod env satisfy the requirement 
> of regular expressions: [A-Za-z_] [A-Za-z0-9_]*
> Therefore, in spark on k8s mode, spark should add relevant regular checking 
> rules when creating executor Env, and stricter validation rules should be 
> used to filter out illegal key values that do not meet the requirements of 
> the naming rules of Env for creating pod by k8s. Otherwise, it will lead to 
> the problem that the pod can not be created normally and the tasks will be 
> suspended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29234) bucketed table created by Spark SQL DataFrame is in SequenceFile format

2019-09-24 Thread Suchintak Patnaik (Jira)

Suchintak Patnaik created SPARK-29234:
-

 Summary: bucketed table created by Spark SQL DataFrame is in 
SequenceFile format
 Key: SPARK-29234
 URL: https://issues.apache.org/jira/browse/SPARK-29234
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Suchintak Patnaik


When we create a bucketed table as follows, it's input and output format are 
getting displayed as SequenceFile format. But physically the files are getting 
created in HDFS as the format specified by the user e.g. orc,parquet,etc.

df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

in Hive, DESCRIBE FORMATTED OrdersExample;

describe formatted ordersExample;
OK
# col_name  data_type   comment
col array   from deserializer

# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

Querying the same table in Hive is giving error.

select * from OrdersExample;
OK
Failed with exception java.io.IOException:java.io.IOException: 
hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-0-55920574-eeb5-48b7-856d-e5c27e85ba12_0.c000.snappy.orc
 not a SequenceFile




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29233) Add regex expression checks for executorEnv in K8S mode

2019-09-24 Thread merrily01 (Jira)

merrily01 created SPARK-29233:
-

 Summary: Add regex expression checks for executorEnv in K8S mode
 Key: SPARK-29233
 URL: https://issues.apache.org/jira/browse/SPARK-29233
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 2.4.3
Reporter: merrily01


In k8s mode, there are some naming regular expression requirements and checks 
for the key of pod environment variables, such as:
 * In version k8s v-1.10.5, the naming rules for pod env satisfy the 
requirements of regular expressions: [-. _a-zA-Z] [-. _a-zA-Z0-9].*
 * In version k8s v-1.6, the naming rules for pod env satisfy the requirement 
of regular expressions: [A-Za-z_] [A-Za-z0-9_]*

Therefore, in spark on k8s mode, spark should add relevant regular checking 
rules when creating executor Env, and stricter validation rules should be used 
to filter out illegal key values that do not meet the requirements of the 
naming rules of Env for creating pod by k8s. Otherwise, it will lead to the 
problem that the pod can not be created normally and the tasks will be 
suspended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29232) RandomForestRegressionModel does not update the parameter maps of the DecisionTreeRegressionModels underneath

2019-09-24 Thread Jiaqi Guo (Jira)

Jiaqi Guo created SPARK-29232:
-

 Summary: RandomForestRegressionModel does not update the parameter 
maps of the DecisionTreeRegressionModels underneath
 Key: SPARK-29232
 URL: https://issues.apache.org/jira/browse/SPARK-29232
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.4.0
Reporter: Jiaqi Guo


We trained a RandomForestRegressionModel, and tried to access the trees. Even 
though the DecisionTreeRegressionModel is correctly built with the proper 
parameters from random forest, the parameter map is not updated, and still 
contains only the default value. 

For example, if a RandomForestRegressor was trained with maxDepth of 12, then 
accessing the tree information, extractParamMap still returns the default 
values, with maxDepth=5. Calling the depth itself of 
DecisionTreeRegressionModel returns the correct value of 12 though.

This creates issues when we want to access each individual tree and build the 
trees back up for the random forest estimator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29082) Spark driver cannot start with only delegation tokens

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29082.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25901
[https://github.com/apache/spark/pull/25901]

> Spark driver cannot start with only delegation tokens
> -
>
> Key: SPARK-29082
> URL: https://issues.apache.org/jira/browse/SPARK-29082
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 3.0.0
>
>
> If you start a Spark application with just delegation tokens, it fails. For 
> example, from an Oozie launch, you see things like this (line numbers may be 
> different):
> {noformat}
> No child hadoop job is executed.
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.runActionMain(LauncherAM.java:410)
> at 
> org.apache.oozie.action.hadoop.LauncherAM.access$300(LauncherAM.java:55)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$2.run(LauncherAM.java:223)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.run(LauncherAM.java:217)
> at 
> org.apache.oozie.action.hadoop.LauncherAM$1.run(LauncherAM.java:153)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.oozie.action.hadoop.LauncherAM.main(LauncherAM.java:141)
> Caused by: org.apache.hadoop.security.KerberosAuthException: failure to 
> login: for principal: hrt_qa javax.security.auth.login.LoginException: Unable 
> to obtain password from user
> at 
> org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847)
> at 
> org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:616)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.doLogin(HadoopDelegationTokenManager.scala:276)
> at 
> org.apache.spark.deploy.security.HadoopDelegationTokenManager.obtainDelegationTokens(HadoopDelegationTokenManager.scala:140)
> at 
> org.apache.spark.deploy.yarn.Client.setupSecurityToken(Client.scala:305)
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:1057)
> at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:179)
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1178)
> at 
> org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1584)
> at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:860)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29229) Change the additional remote repository in IsolatedClientLoader to google minor

2019-09-24 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16937057#comment-16937057
 ] 

Dongjoon Hyun commented on SPARK-29229:
---

This is backported to `branch-2.4` via 
https://github.com/apache/spark/commit/5267e6e5bdc189a5ae876c726d96038b3ac3db66

> Change the additional remote repository in IsolatedClientLoader to google 
> minor
> ---
>
> Key: SPARK-29229
> URL: https://issues.apache.org/jira/browse/SPARK-29229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> The repository currently used is 
> "[http://www.datanucleus.org/downloads/maven2];, which is no longer 
> maintained. This will sometimes cause downloading failure and make hive test 
> cases flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29229) Change the additional remote repository in IsolatedClientLoader to google minor

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29229:
--
Fix Version/s: 2.4.5

> Change the additional remote repository in IsolatedClientLoader to google 
> minor
> ---
>
> Key: SPARK-29229
> URL: https://issues.apache.org/jira/browse/SPARK-29229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> The repository currently used is 
> "[http://www.datanucleus.org/downloads/maven2];, which is no longer 
> maintained. This will sometimes cause downloading failure and make hive test 
> cases flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29229) Change the additional remote repository in IsolatedClientLoader to google minor

2019-09-24 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29229:
--
Issue Type: Bug  (was: Improvement)

> Change the additional remote repository in IsolatedClientLoader to google 
> minor
> ---
>
> Key: SPARK-29229
> URL: https://issues.apache.org/jira/browse/SPARK-29229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> The repository currently used is 
> "[http://www.datanucleus.org/downloads/maven2];, which is no longer 
> maintained. This will sometimes cause downloading failure and make hive test 
> cases flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29229) Change the additional remote repository in IsolatedClientLoader to google minor

2019-09-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29229.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25915
[https://github.com/apache/spark/pull/25915]

> Change the additional remote repository in IsolatedClientLoader to google 
> minor
> ---
>
> Key: SPARK-29229
> URL: https://issues.apache.org/jira/browse/SPARK-29229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> The repository currently used is 
> "[http://www.datanucleus.org/downloads/maven2];, which is no longer 
> maintained. This will sometimes cause downloading failure and make hive test 
> cases flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29229) Change the additional remote repository in IsolatedClientLoader to google minor

2019-09-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29229:
---

Assignee: Yuanjian Li

> Change the additional remote repository in IsolatedClientLoader to google 
> minor
> ---
>
> Key: SPARK-29229
> URL: https://issues.apache.org/jira/browse/SPARK-29229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> The repository currently used is 
> "[http://www.datanucleus.org/downloads/maven2];, which is no longer 
> maintained. This will sometimes cause downloading failure and make hive test 
> cases flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29228) Adding the ability to add headers to the KafkaRecords published to Kafka from spark structured streaming

2019-09-24 Thread Randal Boyle (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randal Boyle resolved SPARK-29228.
--
Resolution: Duplicate

> Adding the ability to add headers to the KafkaRecords published to Kafka from 
> spark structured streaming
> 
>
> Key: SPARK-29228
> URL: https://issues.apache.org/jira/browse/SPARK-29228
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Randal Boyle
>Priority: Major
>
> Currently there is no way to add headers to the records that will be 
> published to Kafka through spark structured streaming. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29231) Can not infer an additional set of constraints if contains CAST

2019-09-24 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-29231:
---

 Summary: Can not infer an additional set of constraints if 
contains CAST
 Key: SPARK-29231
 URL: https://issues.apache.org/jira/browse/SPARK-29231
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


How to reproduce:
{code:scala}
scala> spark.sql("create table t1(c11 int, c12 decimal) ")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table t2(c21 bigint, c22 decimal) ")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 where 
t1.c11=1").explain
== Physical Plan ==
SortMergeJoin [cast(c11#0 as bigint)], [c21#2L], LeftOuter
:- *(2) Sort [cast(c11#0 as bigint) ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(cast(c11#0 as bigint), 200), true, [id=#30]
: +- *(1) Filter (isnotnull(c11#0) AND (c11#0 = 1))
:+- Scan hive default.t1 [c11#0, c12#1], HiveTableRelation 
`default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c11#0, 
c12#1], Statistics(sizeInBytes=8.0 EiB)
+- *(4) Sort [c21#2L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c21#2L, 200), true, [id=#37]
  +- *(3) Filter isnotnull(c21#2L)
 +- Scan hive default.t2 [c21#2L, c22#3], HiveTableRelation 
`default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c21#2L, 
c22#3], Statistics(sizeInBytes=8.0 EiB)

{code}

PostgreSQL suport this feature:
{code:sql}
postgres=# create table t1(c11 int4, c12 decimal);
CREATE TABLE
postgres=# create table t2(c21 int8, c22 decimal);
CREATE TABLE
postgres=# explain select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
where t1.c11=1;
   QUERY PLAN

 Nested Loop Left Join  (cost=0.00..51.43 rows=36 width=76)
   Join Filter: (t1.c11 = t2.c21)
   ->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
 Filter: (c11 = 1)
   ->  Materialize  (cost=0.00..25.03 rows=6 width=40)
 ->  Seq Scan on t2  (cost=0.00..25.00 rows=6 width=40)
   Filter: (c21 = 1)
(7 rows)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29231) Can not infer an additional set of constraints if contains CAST

2019-09-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936978#comment-16936978
 ] 

Yuming Wang commented on SPARK-29231:
-

I'm working on.

> Can not infer an additional set of constraints if contains CAST
> ---
>
> Key: SPARK-29231
> URL: https://issues.apache.org/jira/browse/SPARK-29231
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:scala}
> scala> spark.sql("create table t1(c11 int, c12 decimal) ")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("create table t2(c21 bigint, c22 decimal) ")
> res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1").explain
> == Physical Plan ==
> SortMergeJoin [cast(c11#0 as bigint)], [c21#2L], LeftOuter
> :- *(2) Sort [cast(c11#0 as bigint) ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(cast(c11#0 as bigint), 200), true, [id=#30]
> : +- *(1) Filter (isnotnull(c11#0) AND (c11#0 = 1))
> :+- Scan hive default.t1 [c11#0, c12#1], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c11#0, 
> c12#1], Statistics(sizeInBytes=8.0 EiB)
> +- *(4) Sort [c21#2L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c21#2L, 200), true, [id=#37]
>   +- *(3) Filter isnotnull(c21#2L)
>  +- Scan hive default.t2 [c21#2L, c22#3], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c21#2L, 
> c22#3], Statistics(sizeInBytes=8.0 EiB)
> {code}
> PostgreSQL suport this feature:
> {code:sql}
> postgres=# create table t1(c11 int4, c12 decimal);
> CREATE TABLE
> postgres=# create table t2(c21 int8, c22 decimal);
> CREATE TABLE
> postgres=# explain select t1.*, t2.* from t1 left join t2 on t1.c11=t2.c21 
> where t1.c11=1;
>QUERY PLAN
> 
>  Nested Loop Left Join  (cost=0.00..51.43 rows=36 width=76)
>Join Filter: (t1.c11 = t2.c21)
>->  Seq Scan on t1  (cost=0.00..25.88 rows=6 width=36)
>  Filter: (c11 = 1)
>->  Materialize  (cost=0.00..25.03 rows=6 width=40)
>  ->  Seq Scan on t2  (cost=0.00..25.00 rows=6 width=40)
>Filter: (c21 = 1)
> (7 rows)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29230) Fix NullPointerException in the test class ProcfsMetricsGetterSuite

2019-09-24 Thread Jiaqi Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaqi Li updated SPARK-29230:
-
Description: 
When I use `ProcfsMetricsGetterSuite` testing, always throw out 
`java.lang.NullPointerException`. 

I think there is a problem with locating `new ProcfsMetricsGetter`, which will 
lead to `SparkEnv` not being initialized in time. 

This leads to `java.lang.NullPointerException` when the method is executed.

  was:When I use `ProcfsMetricsGetterSuite for` testing, always throw out 
`java.lang.NullPointerException`. I think there is a problem with locating `new 
ProcfsMetricsGetter`, which will lead to `SparkEnv` not being initialized in 
time. This leads to `java.lang.NullPointerException` when the method is 
executed.


> Fix NullPointerException in the test class ProcfsMetricsGetterSuite
> ---
>
> Key: SPARK-29230
> URL: https://issues.apache.org/jira/browse/SPARK-29230
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jiaqi Li
>Priority: Minor
>
> When I use `ProcfsMetricsGetterSuite` testing, always throw out 
> `java.lang.NullPointerException`. 
> I think there is a problem with locating `new ProcfsMetricsGetter`, which 
> will lead to `SparkEnv` not being initialized in time. 
> This leads to `java.lang.NullPointerException` when the method is executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29230) Fix NullPointerException in the test class ProcfsMetricsGetterSuite

2019-09-24 Thread Jiaqi Li (Jira)

Jiaqi Li created SPARK-29230:


 Summary: Fix NullPointerException in the test class 
ProcfsMetricsGetterSuite
 Key: SPARK-29230
 URL: https://issues.apache.org/jira/browse/SPARK-29230
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jiaqi Li


When I use `ProcfsMetricsGetterSuite for` testing, always throw out 
`java.lang.NullPointerException`. I think there is a problem with locating `new 
ProcfsMetricsGetter`, which will lead to `SparkEnv` not being initialized in 
time. This leads to `java.lang.NullPointerException` when the method is 
executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29095) add extractInstances

2019-09-24 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-29095.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25802
[https://github.com/apache/spark/pull/25802]

> add extractInstances
> 
>
> Key: SPARK-29095
> URL: https://issues.apache.org/jira/browse/SPARK-29095
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> There was method extractLabeledPoints for ml algs to transform dataset into 
> rdd of labelPoints.
> Now more and more algs support sample weighting and extractLabeledPoints is 
> less used, so we should support extract weight in the common methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29095) add extractInstances

2019-09-24 Thread Sean Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-29095:
-

Assignee: zhengruifeng

> add extractInstances
> 
>
> Key: SPARK-29095
> URL: https://issues.apache.org/jira/browse/SPARK-29095
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> There was method extractLabeledPoints for ml algs to transform dataset into 
> rdd of labelPoints.
> Now more and more algs support sample weighting and extractLabeledPoints is 
> less used, so we should support extract weight in the common methods.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Description: 
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

*spark*

!current saprk.jpg!  

*HIVE*

!Screen Shot 2019-09-24 at 9.26.14 PM.png!

*Spark SQL* *expected*

  !Screen Shot 2019-09-24 at 9.14.39 PM.png!

  was:
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

**spark**

!current saprk.jpg!  

**HIVE **

!Screen Shot 2019-09-24 at 9.26.14 PM.png!

**expected** 

  !Screen Shot 2019-09-24 at 9.14.39 PM.png!


> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot 
> 2019-09-24 at 9.26.14 PM.png, current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> *spark*
> !current saprk.jpg!  
> *HIVE*
> !Screen Shot 2019-09-24 at 9.26.14 PM.png!
> *Spark SQL* *expected*
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Description: 
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

**spark**

!current saprk.jpg!  

**HIVE **

!Screen Shot 2019-09-24 at 9.26.14 PM.png!

**expected** 

  !Screen Shot 2019-09-24 at 9.14.39 PM.png!

  was:
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

**spark**

!current saprk.jpg!  

**expected** 

  !Screen Shot 2019-09-24 at 9.14.39 PM.png!


> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot 
> 2019-09-24 at 9.26.14 PM.png, current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> **spark**
> !current saprk.jpg!  
> **HIVE **
> !Screen Shot 2019-09-24 at 9.26.14 PM.png!
> **expected** 
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Attachment: Screen Shot 2019-09-24 at 9.26.14 PM.png

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot 
> 2019-09-24 at 9.26.14 PM.png, current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> **spark**
> !current saprk.jpg!  
> **expected** 
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936781#comment-16936781
 ] 

angerszhu commented on SPARK-29225:
---

[~hyukjin.kwon]

Sorry for forget to add desc. 

There are 4 main different:
 # Spark SQL `desc formatted` didn't have header of no-partition columns
 # Spark SQL `desc formatted` didn't have empty line between header and columns
 # Spark SQL `desc formatted` didn't have empty line between normal column and 
`#Partition information`
 # Spark SQL `desc formatted`  show twice of partition columns

This will cause HUE can't show table column correctly when it connect to Spark 
Thrift Server. Since it will get column info by call  SQL `DESC FORMATTED 
table_name` then parse column information by hive's format.

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, current 
> saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> **spark**
> !current saprk.jpg!  
> **expected** 
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *

2019-09-24 Thread fengchaoge (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936778#comment-16936778
 ] 

fengchaoge commented on SPARK-28990:


@[~726575...@qq.com] hello daile ,Can you send a link? thanks

> SparkSQL invalid call to toAttribute on unresolved object, tree: *
> --
>
> Key: SPARK-28990
> URL: https://issues.apache.org/jira/browse/SPARK-28990
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: fengchaoge
>Priority: Major
>
> SparkSQL create table as select from one table which may not exists throw 
> exceptions like:
> {code}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree:
> {code}
> This is not friendly, spark user may have no idea about what's wrong.
> Simple sql can reproduce it，like this:
> {code}
> spark-sql (default)> create table default.spark as select * from default.dual;
> {code}
> {code}
> 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing 
> command: create table default.spark as select * from default.dual
> 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in 
> [create table default.spark as select * from default.dual]
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> toAttribute on unresolved object, tree: *
> at 
> org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.immutable.List.map(List.scala:296)
> at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160)
> at 
> org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
> at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148)
> at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:392)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127)
> at

[jira] [Resolved] (SPARK-29156) Hive has appending data as part of cdc, In write mode we should be able to write only changes captured to teradata or datasource.

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29156.
--
Resolution: Invalid

no feedback

> Hive has appending data as part of cdc, In write mode we should be able to 
> write only changes captured to teradata or datasource.
> -
>
> Key: SPARK-29156
> URL: https://issues.apache.org/jira/browse/SPARK-29156
> Project: Spark
>  Issue Type: New Feature
>  Components: Tests
>Affects Versions: 2.4.3
> Environment: spark 2.3.2
> dataiku
> aws emr
>Reporter: raju
>Priority: Major
>  Labels: patch
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In general change data captures are appended to hive tables. We have scenario 
> where connecting to teradata/ datasource. Only changes captured as updates 
> should be able to write in data source. We are unable to do same by over 
> write and append modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Description: 
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

**spark**

!current saprk.jpg!  

**expected** 

 

  was:
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

 

### Spark Current 

 

 


> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, current 
> saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> **spark**
> !current saprk.jpg!  
> **expected** 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Description: 
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

**spark**

!current saprk.jpg!  

**expected** 

  !Screen Shot 2019-09-24 at 9.14.39 PM.png!

  was:
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

**spark**

!current saprk.jpg!  

**expected** 

 


> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, current 
> saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> **spark**
> !current saprk.jpg!  
> **expected** 
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Attachment: Screen Shot 2019-09-24 at 9.14.39 PM.png

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, current 
> saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
>  
> ### Spark Current 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29174) LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29174:
-
Description: 
*using does not work for insert overwrite when in local  but works when insert 
overwrite in HDFS directory*

 ** 

 

{code}

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory 
'/user/trash2/' using parquet select * from trash1 a where a.country='PAK';

+-+--+
| Result  |
+-+--+
+-+--+

No rows selected (0.448 seconds)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' using parquet select * from trash1 a where a.country='PAK';

Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 
0)

 

== SQL ==

insert overwrite local directory '/opt/trash2/' using parquet select * from 
trash1 a where a.country='PAK'

^^^ (state=,code=0)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK';

+-+--+
| Result  |
+-+--+
| | |
{code}

 

  was:
*using does not work for insert overwrite when in local  but works when insert 
overwrite in HDFS directory*

 ** 

 

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory 
'/user/trash2/' using parquet select * from trash1 a where a.country='PAK';

+-+--+

| Result  |

+-+--+

+-+--+

No rows selected (0.448 seconds)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' using parquet select * from trash1 a where a.country='PAK';

Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 
0)

 

== SQL ==

insert overwrite local directory '/opt/trash2/' using parquet select * from 
trash1 a where a.country='PAK'

^^^ (state=,code=0)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK';

+-+--+

| Result  |

+-+--+
| | |

 


> LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source
> ---
>
> Key: SPARK-29174
> URL: https://issues.apache.org/jira/browse/SPARK-29174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *using does not work for insert overwrite when in local  but works when 
> insert overwrite in HDFS directory*
>  ** 
>  
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory 
> '/user/trash2/' using parquet select * from trash1 a where a.country='PAK';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.448 seconds)
> 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
> '/opt/trash2/' using parquet select * from trash1 a where a.country='PAK';
> Error: org.apache.spark.sql.catalyst.parser.ParseException:
> LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, 
> pos 0)
>  
> == SQL ==
> insert overwrite local directory '/opt/trash2/' using parquet select * from 
> trash1 a where a.country='PAK'
> ^^^ (state=,code=0)
> 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
> '/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK';
> +-+--+
> | Result  |
> +-+--+
> | | |
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29174) LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29174:
-
Description: 
*using does not work for insert overwrite when in local  but works when insert 
overwrite in HDFS directory*

{code}

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory 
'/user/trash2/' using parquet select * from trash1 a where a.country='PAK';

+-+--+
| Result  |
+-+--+
+-+--+

No rows selected (0.448 seconds)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' using parquet select * from trash1 a where a.country='PAK';

Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 
0)

 

== SQL ==

insert overwrite local directory '/opt/trash2/' using parquet select * from 
trash1 a where a.country='PAK'

^^^ (state=,code=0)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK';

+-+--+
| Result  |
+-+--+
| | |
{code}

 

  was:
*using does not work for insert overwrite when in local  but works when insert 
overwrite in HDFS directory*

 ** 

 

{code}

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory 
'/user/trash2/' using parquet select * from trash1 a where a.country='PAK';

+-+--+
| Result  |
+-+--+
+-+--+

No rows selected (0.448 seconds)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' using parquet select * from trash1 a where a.country='PAK';

Error: org.apache.spark.sql.catalyst.parser.ParseException:

LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, pos 
0)

 

== SQL ==

insert overwrite local directory '/opt/trash2/' using parquet select * from 
trash1 a where a.country='PAK'

^^^ (state=,code=0)

0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
'/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK';

+-+--+
| Result  |
+-+--+
| | |
{code}

 


> LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source
> ---
>
> Key: SPARK-29174
> URL: https://issues.apache.org/jira/browse/SPARK-29174
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> *using does not work for insert overwrite when in local  but works when 
> insert overwrite in HDFS directory*
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite directory 
> '/user/trash2/' using parquet select * from trash1 a where a.country='PAK';
> +-+--+
> | Result  |
> +-+--+
> +-+--+
> No rows selected (0.448 seconds)
> 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
> '/opt/trash2/' using parquet select * from trash1 a where a.country='PAK';
> Error: org.apache.spark.sql.catalyst.parser.ParseException:
> LOCAL is not supported in INSERT OVERWRITE DIRECTORY to data source(line 1, 
> pos 0)
>  
> == SQL ==
> insert overwrite local directory '/opt/trash2/' using parquet select * from 
> trash1 a where a.country='PAK'
> ^^^ (state=,code=0)
> 0: jdbc:hive2://10.18.18.214:23040/default> insert overwrite local directory 
> '/opt/trash2/' stored as parquet select * from trash1 a where a.country='PAK';
> +-+--+
> | Result  |
> +-+--+
> | | |
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29178) Improve shown message on sql show()

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29178.
--
Resolution: Won't Fix

> Improve shown message on sql show()
> ---
>
> Key: SPARK-29178
> URL: https://issues.apache.org/jira/browse/SPARK-29178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
>
> Add to show total rows with Date Frame table in 
> Dataset.scala(sql/core/src/main/scala/org/apache/spark/sql/).
>  
> [ Before ]
> {code:java}
> |2019/8/18|  sunny|   27.2|
> |2019/8/19|  sunny|   28.6|
> |2019/8/20|  rainy|   27.6|
> +-+---+---+
> only showing top 20 rows
> {code}
>  
> [ After ]
> {code:java}
> |2019/8/18|  sunny|   27.2|
> |2019/8/19|  sunny|   28.6|
> |2019/8/20|  rainy|   27.6|
> +-+---+---+
> only showing top 20 / 31 rows
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Attachment: current saprk.jpg

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29225:
--
Description: 
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly :

 

### Spark Current 

 

 

  was:
Current `DESC FORMATTED TABLE` show different table desc format, this problem 
cause HUE can't parser column information corretltly 

 

 


> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
> Attachments: current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
>  
> ### Spark Current 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29178) Improve shown message on sql show()

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936772#comment-16936772
 ] 

Hyukjin Kwon commented on SPARK-29178:
--

In order to do that it requires to count all records whereas it can only read 
first 20 rows in the current way. I don't think we should do it.

> Improve shown message on sql show()
> ---
>
> Key: SPARK-29178
> URL: https://issues.apache.org/jira/browse/SPARK-29178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Tomoko Komiyama
>Priority: Minor
>
> Add to show total rows with Date Frame table in 
> Dataset.scala(sql/core/src/main/scala/org/apache/spark/sql/).
>  
> [ Before ]
> {code:java}
> |2019/8/18|  sunny|   27.2|
> |2019/8/19|  sunny|   28.6|
> |2019/8/20|  rainy|   27.6|
> +-+---+---+
> only showing top 20 rows
> {code}
>  
> [ After ]
> {code:java}
> |2019/8/18|  sunny|   27.2|
> |2019/8/19|  sunny|   28.6|
> |2019/8/20|  rainy|   27.6|
> +-+---+---+
> only showing top 20 / 31 rows
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29184) I have installed pyspark and created code in notebook but when I am running it.its throwing error-Py4JJavaError: An error occurred while calling z:org.apache.spark.api.

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29184.
--
Resolution: Invalid

Please ask questions to mailing list or stackoverflow.

> I have installed pyspark and created code in notebook but when I am running 
> it.its throwing error-Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> -
>
> Key: SPARK-29184
> URL: https://issues.apache.org/jira/browse/SPARK-29184
> Project: Spark
>  Issue Type: IT Help
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Asha Tambulker
>Priority: Major
>  Labels: newbie
>
> I have installed Scala, Spark and Python3 on Ubuntu OS. I am running below 
> code in nootbook.
> Could you please help me to resolve this issue?
> from pyspark import SparkConf, SparkContext
> import sys
> inputs = sys.argv[1]
> output = sys.argv[2]
> def words_once(line):
>  #for w in line.split():
>  words = line.split()
>  words[3] = int(words[3])
>  yield tuple.(words)
> def word_filter(wiki_page):
>  if wiki_page[1]== "en" and wiki_page[2] == "Main Page" and not 
> wiki_page[2].startWith("Special:"):
>  return wiki_page
> def max_val(x, y):
>  return max(x,y)
> def get_key(kv):
>  return kv[0]
> def map_key_value(wiki_page)
>  return (wiki_page[0],wiki_page[3])
> def output_format(kv):
>  k, v = kv
>  return '%s %i' % (k, v)
> text = sc.textFile(inputs)
> words = text.flatMap(words_once)
> word_filter = words.filter(word_filter)
> key_pair = word_filter.map(map_key_value)
> wordcount = key_pair.reduceByKey(mak_val)
> outdata = wordcount.sortBy(get_key).map(output_format)
> outdata.saveAsTextFile(output)
>  
> But I am getting error in line - outdata = 
> wordcount.sortBy(get_key).map(output_format)
>  
> Error Is -
>  
> --
> Py4JJavaError Traceback (most recent call last)
>  in 
> > 1 outdata = wordcount.sortBy(get_key).map(output_format)
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/rdd.py in sortBy(self, 
> keyfunc, ascending, numPartitions)
>  697 [('a', 1), ('b', 2), ('1', 3), ('d', 4), ('2', 5)]
>  698 """
> --> 699 return self.keyBy(keyfunc).sortByKey(ascending, 
> numPartitions).values()
>  700 
>  701 def glom(self):
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/rdd.py in 
> sortByKey(self, ascending, numPartitions, keyfunc)
>  665 # the key-space into bins such that the bins have roughly the same
>  666 # number of (key, value) pairs falling into them
> --> 667 rddSize = self.count()
>  668 if not rddSize:
>  669 return self # empty RDD
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/rdd.py in count(self)
>  1053 3
>  1054 """
> -> 1055 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>  1056 
>  1057 def stats(self):
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/rdd.py in sum(self)
>  1044 6.0
>  1045 """
> -> 1046 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>  1047 
>  1048 def count(self):
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/rdd.py in fold(self, 
> zeroValue, op)
>  915 # zeroValue provided to each partition is unique from the one provided
>  916 # to the final reduce call
> --> 917 vals = self.mapPartitions(func).collect()
>  918 return reduce(op, vals, zeroValue)
>  919
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/rdd.py in collect(self)
>  814 """
>  815 with SCCallSiteSync(self.context) as css:
> --> 816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
>  818
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
>  1255 answer = self.gateway_client.send_command(command)
>  1256 return_value = get_return_value(
> -> 1257 answer, self.gateway_client, self.target_id, self.name)
>  1258 
>  1259 for temp_arg in temp_args:
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, 
> **kw)
>  61 def deco(*a, **kw):
>  62 try:
> ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> ~/Downloads/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
>  326 raise Py4JJavaError(
>  327 "An error occurred while calling \{0}{1}\{2}.\n".
> --> 328 format(target_id, ".", name), value)
>  329 else:
>  330 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
>

[jira] [Updated] (SPARK-29185) Add new SaveMode types for Spark SQL jdbc datasource

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29185:
-
Component/s: (was: Input/Output)
 SQL

> Add new SaveMode types for Spark SQL jdbc datasource
> 
>
> Key: SPARK-29185
> URL: https://issues.apache.org/jira/browse/SPARK-29185
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Timothy Zhang
>Priority: Major
>
>  It is necessary to add new SaveMode for Delete, Update, and Upsert, such as:
>  * SaveMode.Delete
>  * SaveMode.Update
>  * SaveMode.Upsert
> So that Spark SQL could support legacy RDBMS much betters, e.g. Oracle, DB2, 
> MySQL etc. Actually code implementation of current SaveMode.Append types is 
> very flexible. All types could share the same savePartition function, add 
> only add new getStatement functions for Delete, Update, Upsert with SQL 
> statements DELETE FROM, UPDATE, MERGE INTO respectively. We have an initial 
> implementations for them:
> {code:java}
> def getDeleteStatement(table: String, rddSchema: StructType, dialect: 
> JdbcDialect): String = {
> val columns = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name) + 
> "=?").mkString(" AND ")
> s"DELETE FROM ${table.toUpperCase} WHERE $columns"
>   }
>   def getUpdateStatement(table: String, rddSchema: StructType, priKeys: 
> Seq[String], dialect: JdbcDialect): String = {
> val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name))
> val priCols = priKeys.map(dialect.quoteIdentifier(_))
> val columns = (fullCols diff priCols).map(_ + "=?").mkString(",")
> val cnditns = priCols.map(_ + "=?").mkString(" AND ")
> s"UPDATE ${table.toUpperCase} SET $columns WHERE $cnditns"
>   }
>   def getMergeStatement(table: String, rddSchema: StructType, priKeys: 
> Seq[String], dialect: JdbcDialect): String = {
> val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name))
> val priCols = priKeys.map(dialect.quoteIdentifier(_))
> val nrmCols = fullCols diff priCols
> val fullPart = fullCols.map(c => 
> s"${dialect.quoteIdentifier("SRC")}.$c").mkString(",")
> val priPart = priCols.map(c => 
> s"${dialect.quoteIdentifier("TGT")}.$c=${dialect.quoteIdentifier("SRC")}.$c").mkString("
>  AND ")
> val nrmPart = nrmCols.map(c => 
> s"$c=${dialect.quoteIdentifier("SRC")}.$c").mkString(",")
> val columns = fullCols.mkString(",")
> val placeholders = fullCols.map(_ => "?").mkString(",")
> s"MERGE INTO ${table.toUpperCase} AS ${dialect.quoteIdentifier("TGT")} " +
>   s"USING TABLE(VALUES($placeholders)) " +
>   s"AS ${dialect.quoteIdentifier("SRC")}($columns) " +
>   s"ON $priPart " +
>   s"WHEN NOT MATCHED THEN INSERT ($columns) VALUES ($fullPart) " +
>   s"WHEN MATCHED THEN UPDATE SET $nrmPart"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29186) SubqueryAlias name value is null in Spark 2.4.3 Logical plan.

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936770#comment-16936770
 ] 

Hyukjin Kwon commented on SPARK-29186:
--

Can you reformat the description with \{code\} ... \{code\} with correct 
indentation? Looks impossible to read.

> SubqueryAlias name value is null in Spark 2.4.3 Logical plan.
> -
>
> Key: SPARK-29186
> URL: https://issues.apache.org/jira/browse/SPARK-29186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: I have tried this on AWS Glue with Spark 2.4.3
> and on windows 10 with 2.4.4
> at both of them facing same issue
>Reporter: Tarun Khaneja
>Priority: Blocker
> Fix For: 2.2.1
>
>
> I am writing a program to analyze sql query. So I am using Spark logical 
> plan.I am writing a program to analyze sql query. So I am using Spark logical 
> plan.
> Below is the code which I am using
>      object QueryAnalyzer
> {    val LOG = LoggerFactory.getLogger(this.getClass)     //Spark Conf    val 
> conf = new     
> SparkConf().setMaster("local[2]").setAppName("LocalEdlExecutor")     //Spark 
> Context    val sc = new SparkContext(conf)     //sql Context    val 
> sqlContext = new SQLContext(sc)     //Spark Session    val sparkSession = 
> SparkSession      .builder()      .appName("Spark User Data")      
> .config("spark.app.name", "LocalEdl")      .getOrCreate()     def main(args: 
> Array[String])
> {          var inputDfColumns = Map[String,List[String]]()       val 
> dfSession =  sparkSession.      read.      format("csv").      
> option("header", EdlConstants.TRUE).      option("inferschema", 
> EdlConstants.TRUE).      option("delimiter", ",").      option("decoding", 
> EdlConstants.UTF8).      option("multiline", true)        var oDF = 
> dfSession.      load("C:\\Users\\tarun.khaneja\\data\\order.csv")        
> println("smaple data in oDF>")      oDF.show()           var cusDF = 
> dfSession.        load("C:\\Users\\tarun.khaneja\\data\\customer.csv")        
>   println("smaple data in cusDF>")      cusDF.show()              
> oDF.createOrReplaceTempView("orderTempView")      
> cusDF.createOrReplaceTempView("customerTempView")            //get input 
> columns from all dataframe      inputDfColumns += 
> ("orderTempView"->oDF.columns.toList)      inputDfColumns += 
> ("customerTempView"->cusDF.columns.toList)            val res = 
> sqlContext.sql("""select OID, max(MID+CID) as MID_new,ROW_NUMBER() OVER (     
>                               ORDER BY CID) as rn from                        
>            (select OID_1 as OID, CID_1 as CID, OID_1+CID_1 as MID from        
>                           (select min(ot.OrderID) as OID_1, ct.CustomerID as 
> CID_1                                    from orderTempView as ot inner join 
> customerTempView as ct                                    on ot.CustomerID = 
> ct.CustomerID group by CID_1)) group by OID,CID""")            
> println(res.show(false))                                    val analyzedPlan 
> = res.queryExecution.analyzed      println(analyzedPlan.prettyJson)           
>  }
>  
>  Now problem is, with *Spark 2.2.1*, I am getting below json. where I have 
> SubqueryAlias which provide important information of alias name for table 
> which we used in query, as shown below.
>       ...    ...       ... [ \{ "class" : 
> "org.apache.spark.sql.catalyst.expressions.AttributeReference", 
> "num-children" : 0, "name" : "OrderDate", "dataType" : "string", "nullable" : 
> true, "metadata" : { }, "exprId" : \{   "product-class" : 
> "org.apache.spark.sql.catalyst.expressions.ExprId",   "id" : 2,   "jvmId" : 
> "acefe6e6-e469-4c9a-8a36-5694f054dc0a" }, "isGenerated" : false   } ] ] },
> {   "class" : "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*",  
>  "num-children" : 1,   "alias" : "ct",   "child" : 0 }
> ,
> {   "class" : "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*",  
>  "num-children" : 1,   "alias" : "customertempview",   "child" : 0 }
> , {   "class" : "org.apache.spark.sql.execution.datasources.LogicalRelation", 
>   "num-children" : 0,   "relation" : null,   "output" : 
>     ...    ...       ...
>  But with *Spark 2.4.3*, I am getting SubqueryAlias name as null. As shown 
> below in json.
>     ...    ... \{ "class":  
> "org.apache.spark.sql.catalyst.expressions.AttributeReference", 
> "num-children": 0, "name": "CustomerID", "dataType": "integer", "nullable": 
> true, "metadata": {}, "exprId": \{ "product-class":  
> "org.apache.spark.sql.catalyst.expressions.ExprId", "id": 19, "jvmId": 
> "3b0dde0c-0b8f-4c63-a3ed-4dba526f8331" }, "qualifier": "[ct]" }] },
> { "class":  "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*", 
>

[jira] [Updated] (SPARK-29186) SubqueryAlias name value is null in Spark 2.4.3 Logical plan.

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29186:
-
Target Version/s:   (was: 2.4.3)

> SubqueryAlias name value is null in Spark 2.4.3 Logical plan.
> -
>
> Key: SPARK-29186
> URL: https://issues.apache.org/jira/browse/SPARK-29186
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: I have tried this on AWS Glue with Spark 2.4.3
> and on windows 10 with 2.4.4
> at both of them facing same issue
>Reporter: Tarun Khaneja
>Priority: Blocker
> Fix For: 2.2.1
>
>
> I am writing a program to analyze sql query. So I am using Spark logical 
> plan.I am writing a program to analyze sql query. So I am using Spark logical 
> plan.
> Below is the code which I am using
>      object QueryAnalyzer
> {    val LOG = LoggerFactory.getLogger(this.getClass)     //Spark Conf    val 
> conf = new     
> SparkConf().setMaster("local[2]").setAppName("LocalEdlExecutor")     //Spark 
> Context    val sc = new SparkContext(conf)     //sql Context    val 
> sqlContext = new SQLContext(sc)     //Spark Session    val sparkSession = 
> SparkSession      .builder()      .appName("Spark User Data")      
> .config("spark.app.name", "LocalEdl")      .getOrCreate()     def main(args: 
> Array[String])
> {          var inputDfColumns = Map[String,List[String]]()       val 
> dfSession =  sparkSession.      read.      format("csv").      
> option("header", EdlConstants.TRUE).      option("inferschema", 
> EdlConstants.TRUE).      option("delimiter", ",").      option("decoding", 
> EdlConstants.UTF8).      option("multiline", true)        var oDF = 
> dfSession.      load("C:\\Users\\tarun.khaneja\\data\\order.csv")        
> println("smaple data in oDF>")      oDF.show()           var cusDF = 
> dfSession.        load("C:\\Users\\tarun.khaneja\\data\\customer.csv")        
>   println("smaple data in cusDF>")      cusDF.show()              
> oDF.createOrReplaceTempView("orderTempView")      
> cusDF.createOrReplaceTempView("customerTempView")            //get input 
> columns from all dataframe      inputDfColumns += 
> ("orderTempView"->oDF.columns.toList)      inputDfColumns += 
> ("customerTempView"->cusDF.columns.toList)            val res = 
> sqlContext.sql("""select OID, max(MID+CID) as MID_new,ROW_NUMBER() OVER (     
>                               ORDER BY CID) as rn from                        
>            (select OID_1 as OID, CID_1 as CID, OID_1+CID_1 as MID from        
>                           (select min(ot.OrderID) as OID_1, ct.CustomerID as 
> CID_1                                    from orderTempView as ot inner join 
> customerTempView as ct                                    on ot.CustomerID = 
> ct.CustomerID group by CID_1)) group by OID,CID""")            
> println(res.show(false))                                    val analyzedPlan 
> = res.queryExecution.analyzed      println(analyzedPlan.prettyJson)           
>  }
>  
>  Now problem is, with *Spark 2.2.1*, I am getting below json. where I have 
> SubqueryAlias which provide important information of alias name for table 
> which we used in query, as shown below.
>       ...    ...       ... [ \{ "class" : 
> "org.apache.spark.sql.catalyst.expressions.AttributeReference", 
> "num-children" : 0, "name" : "OrderDate", "dataType" : "string", "nullable" : 
> true, "metadata" : { }, "exprId" : \{   "product-class" : 
> "org.apache.spark.sql.catalyst.expressions.ExprId",   "id" : 2,   "jvmId" : 
> "acefe6e6-e469-4c9a-8a36-5694f054dc0a" }, "isGenerated" : false   } ] ] },
> {   "class" : "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*",  
>  "num-children" : 1,   "alias" : "ct",   "child" : 0 }
> ,
> {   "class" : "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*",  
>  "num-children" : 1,   "alias" : "customertempview",   "child" : 0 }
> , {   "class" : "org.apache.spark.sql.execution.datasources.LogicalRelation", 
>   "num-children" : 0,   "relation" : null,   "output" : 
>     ...    ...       ...
>  But with *Spark 2.4.3*, I am getting SubqueryAlias name as null. As shown 
> below in json.
>     ...    ... \{ "class":  
> "org.apache.spark.sql.catalyst.expressions.AttributeReference", 
> "num-children": 0, "name": "CustomerID", "dataType": "integer", "nullable": 
> true, "metadata": {}, "exprId": \{ "product-class":  
> "org.apache.spark.sql.catalyst.expressions.ExprId", "id": 19, "jvmId": 
> "3b0dde0c-0b8f-4c63-a3ed-4dba526f8331" }, "qualifier": "[ct]" }] },
> { "class":  "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*", 
> "num-children": 1, "name": null, "child": 0 }
> ,
> { "class":  "org.apache.spark.sql.catalyst.plans.logical.*SubqueryAlias*", 
>

[jira] [Commented] (SPARK-29195) Can't config orc.compress.size option for native ORC writer

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936767#comment-16936767
 ] 

Hyukjin Kwon commented on SPARK-29195:
--

Can you narrow down if this is a bug in ORC or Spark?

> Can't config orc.compress.size option for native ORC writer
> ---
>
> Key: SPARK-29195
> URL: https://issues.apache.org/jira/browse/SPARK-29195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
> Environment: Linux
> Java 1.8.0
>Reporter: Eric Sun
>Priority: Minor
>  Labels: ORC
>
>  Only codec can be effectively configured via code, but "orc.compress.size" 
> or "orc.row.index.stride" can not.
>  
> {code:java}
> // try
>   val spark = SparkSession
> .builder()
> .appName(appName)
> .enableHiveSupport()
> .config("spark.sql.orc.impl", "native")
> .config("orc.compress.size", 512 * 1024)
> .config("spark.sql.orc.compress.size", 512 * 1024)
> .config("hive.exec.orc.default.buffer.size", 512 * 1024)
> .config("spark.hadoop.io.file.buffer.size", 512 * 1024)
> .getOrCreate()
> {code}
> orcfiledump still shows:
>  
> {code:java}
> File Version: 0.12 with FUTURE
> Compression: ZLIB
> Compression size: 65536
> {code}
>  
> Executor Log:
> {code}
> impl.WriterImpl: ORC writer created for path: 
> hdfs://name_node_host:9000/foo/bar/_temporary/0/_temporary/attempt_20190920222359_0001_m_000127_0/part-00127-2a9a9287-54bf-441c-b3cf-718b122d9c2f_00127.c000.zlib.orc
>  with stripeSize: 67108864 blockSize: 268435456 compression: ZLIB bufferSize: 
> 65536
> File Output Committer Algorithm version is 2
> {code}
> According to [SPARK-23342], the other ORC options should be configurable. Is 
> there anything missing here?
> Is there any other way to affect "orc.compress.size"?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29205) Pyspark tests failed for suspected performance problem on ARM

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936763#comment-16936763
 ] 

Hyukjin Kwon commented on SPARK-29205:
--

It seems to be flaky test, rather. and Yes, the pointer seems correct.

> Pyspark tests failed for suspected performance problem on ARM
> -
>
> Key: SPARK-29205
> URL: https://issues.apache.org/jira/browse/SPARK-29205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: OS: Ubuntu16.04
> Arch: aarch64
> Host: Virtual Machine
>Reporter: zhao bo
>Priority: Major
>
> We test the pyspark on ARM VM. But found some test fails, once we change the 
> source code to extend the wait time for making sure those test tasks had 
> finished, then the test will pass.
>  
> The affected test cases including:
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_parameter_convergence
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_convergence
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> The error message about above test fails:
> ==
> FAIL: test_parameter_convergence 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that the model parameters improve with streaming data.
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergen ce
>     self._eventually(condition, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
>     self.assertEqual(len(model_weights), len(batches))
> AssertionError: 6 != 10
>  
>  
> ==
> FAIL: test_convergence 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 292, in test_convergence
>     self._eventually(condition, 60.0, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 288, in condition
>     self.assertEqual(len(models), len(input_batches))
> AssertionError: 19 != 20
>  
> ==
> FAIL: test_parameter_accuracy 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 266, in test_parameter_accuracy
>     self._eventually(condition, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 263, in condition
>     self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.21309223935797794 != 0.1 within 1 places
>  
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent

[jira] [Resolved] (SPARK-29205) Pyspark tests failed for suspected performance problem on ARM

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29205.
--
Resolution: Duplicate

Unfortunately, seems it's still flaky (although less flaky after the fix).

> Pyspark tests failed for suspected performance problem on ARM
> -
>
> Key: SPARK-29205
> URL: https://issues.apache.org/jira/browse/SPARK-29205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: OS: Ubuntu16.04
> Arch: aarch64
> Host: Virtual Machine
>Reporter: zhao bo
>Priority: Major
>
> We test the pyspark on ARM VM. But found some test fails, once we change the 
> source code to extend the wait time for making sure those test tasks had 
> finished, then the test will pass.
>  
> The affected test cases including:
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLinearRegressionWithTests.test_parameter_convergence
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_convergence
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_parameter_accuracy
> pyspark.mllib.tests.test_streaming_algorithms:StreamingLogisticRegressionWithSGDTests.test_training_and_prediction
> The error message about above test fails:
> ==
> FAIL: test_parameter_convergence 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests)
> Test that the model parameters improve with streaming data.
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergen ce
>     self._eventually(condition, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
>     self.assertEqual(len(model_weights), len(batches))
> AssertionError: 6 != 10
>  
>  
> ==
> FAIL: test_convergence 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 292, in test_convergence
>     self._eventually(condition, 60.0, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 288, in condition
>     self.assertEqual(len(models), len(input_batches))
> AssertionError: 19 != 20
>  
> ==
> FAIL: test_parameter_accuracy 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> --
> Traceback (most recent call last):
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 266, in test_parameter_accuracy
>     self._eventually(condition, catch_assertions=True)
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
>     raise lastValue
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
>     lastValue = condition()
>   File 
> "/usr/local/src/hth/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 263, in condition
>     self.assertAlmostEqual(rel, 0.1, 1)
> AssertionError: 0.21309223935797794 != 0.1 within 1 places
>  
> ==
> FAIL: test_training_and_prediction 
> (pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call

[jira] [Resolved] (SPARK-29210) Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29210.
--
Resolution: Won't Fix

> Spark 3.0 Migration guide should contain the details of to_utc_timestamp 
> function and from_utc_timestamp
> 
>
> Key: SPARK-29210
> URL: https://issues.apache.org/jira/browse/SPARK-29210
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> to_utc_timestamp and from_utc_timestamp are deprecated in Spark 3.0. Details 
> should be contain in Spark 3.0 Spark Migration guide.
>  
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> from_utc_timestamp('2016-08-31', 'Asia/Seoul');
> Error: org.apache.spark.sql.AnalysisException: The from_utc_timestamp 
> function has been disabled since Spark 3.0.Set 
> spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
> line 1 pos 7 (state=,code=0)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> to_utc_timestamp('2016-08-31', 'Asia/Seoul');
> Error: org.apache.spark.sql.AnalysisException: The to_utc_timestamp function 
> has been disabled since Spark 3.0. Set 
> spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
> line 1 pos 7 (state=,code=0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29210) Spark 3.0 Migration guide should contain the details of to_utc_timestamp function and from_utc_timestamp

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936760#comment-16936760
 ] 

Hyukjin Kwon commented on SPARK-29210:
--

Deprecations don't have to be documented in the migration guide. I think we 
only do that when they are actually removed.

> Spark 3.0 Migration guide should contain the details of to_utc_timestamp 
> function and from_utc_timestamp
> 
>
> Key: SPARK-29210
> URL: https://issues.apache.org/jira/browse/SPARK-29210
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> to_utc_timestamp and from_utc_timestamp are deprecated in Spark 3.0. Details 
> should be contain in Spark 3.0 Spark Migration guide.
>  
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> from_utc_timestamp('2016-08-31', 'Asia/Seoul');
> Error: org.apache.spark.sql.AnalysisException: The from_utc_timestamp 
> function has been disabled since Spark 3.0.Set 
> spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
> line 1 pos 7 (state=,code=0)
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> to_utc_timestamp('2016-08-31', 'Asia/Seoul');
> Error: org.apache.spark.sql.AnalysisException: The to_utc_timestamp function 
> has been disabled since Spark 3.0. Set 
> spark.sql.legacy.utcTimestampFunc.enabled to true to enable this function.;; 
> line 1 pos 7 (state=,code=0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29217) How to read streaming output path by ignoring metadata log files

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29217.
--
Resolution: Invalid

Please ask questions to mailing list or stackoverflow.

> How to read streaming output path by ignoring metadata log files
> 
>
> Key: SPARK-29217
> URL: https://issues.apache.org/jira/browse/SPARK-29217
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Thanida
>Priority: Minor
>
> As the output path of spark streaming contains `_spark_metadata` directory, 
> reading by  
> {code:java}
> spark.read.format("parquet").load(filepath)
> {code}
> always depend on files listing in metadata log.
> Moving some files in the output while streaming caused reading data failed. 
> So, how to read data in the streaming output path by ignoring metadata log 
> files?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29222) Flaky test: pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936756#comment-16936756
 ] 

Hyukjin Kwon commented on SPARK-29222:
--

cc [~bryanc]

> Flaky test: 
> pyspark.mllib.tests.test_streaming_algorithms.StreamingLinearRegressionWithTests.test_parameter_convergence
> ---
>
> Key: SPARK-29222
> URL: https://issues.apache.org/jira/browse/SPARK-29222
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, Tests
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111237/testReport/]
> {code:java}
> Error Message
> 7 != 10
> StacktraceTraceback (most recent call last):
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 429, in test_parameter_convergence
> self._eventually(condition, catch_assertions=True)
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 74, in _eventually
> raise lastValue
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 65, in _eventually
> lastValue = condition()
>   File 
> "/home/jenkins/workspace/SparkPullRequestBuilder@2/python/pyspark/mllib/tests/test_streaming_algorithms.py",
>  line 425, in condition
> self.assertEqual(len(model_weights), len(batches))
> AssertionError: 7 != 10
>{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-09-24 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936755#comment-16936755
 ] 

Hyukjin Kwon commented on SPARK-29225:
--

Can you show current & expected input/output?

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29175) Make maven central repository in IsolatedClientLoader configurable

2019-09-24 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-29175:

Description: We need to connect a central repository in 
IsolatedClientLoader for downloading Hive jars. Here we added a new config 
`spark.sql.additionalRemoteRepositories`, a comma-delimited string config of 
the optional additional remote maven mirror repositories, it can be used as the 
additional remote repositories for the default maven central repo.  (was: We 
need to connect a central repository in IsolatedClientLoader for downloading 
Hive jars. The repository currently used is 
"http://www.datanucleus.org/downloads/maven2;, which is no longer maintained. 
Here we added a new config with maven central repository 
"https://repo1.maven.org/maven2; as its default value.)

> Make maven central repository in IsolatedClientLoader configurable
> --
>
> Key: SPARK-29175
> URL: https://issues.apache.org/jira/browse/SPARK-29175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> We need to connect a central repository in IsolatedClientLoader for 
> downloading Hive jars. Here we added a new config 
> `spark.sql.additionalRemoteRepositories`, a comma-delimited string config of 
> the optional additional remote maven mirror repositories, it can be used as 
> the additional remote repositories for the default maven central repo.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29166) Add parameters to limit the number of dynamic partitions for data source table

2019-09-24 Thread Ido Michael (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936732#comment-16936732
 ] 

Ido Michael commented on SPARK-29166:
-

No problem [~cltlfcjin], great job!

> Add parameters to limit the number of dynamic partitions for data source table
> --
>
> Key: SPARK-29166
> URL: https://issues.apache.org/jira/browse/SPARK-29166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Dynamic partition in Hive table has some restrictions to limit the max number 
> of partitions. See 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-DynamicPartitionInserts
> It's very useful to prevent to create mistake partitions like ID. Also it can 
> protect the NameNode from mass RPC calls of creating.
> Data source table also needs similar limitation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29197) Remove saveModeForDSV2 in DataFrameWriter

2019-09-24 Thread Ido Michael (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936731#comment-16936731
 ] 

Ido Michael commented on SPARK-29197:
-

Great job Burak!

And thanks for the quick reply!

> Remove saveModeForDSV2 in DataFrameWriter
> -
>
> Key: SPARK-29197
> URL: https://issues.apache.org/jira/browse/SPARK-29197
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> It is very confusing that the default save mode is different between the 
> internal implementation of a Data source. The reason that we had to have 
> saveModeForDSV2 was that there was no easy way to check the existence of a 
> Table in DataSource v2. Now, we have catalogs for that. Therefore we should 
> be able to remove the different save modes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29229) Change the additional remote repository in IsolatedClientLoader to google minor

2019-09-24 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-29229:
---

 Summary: Change the additional remote repository in 
IsolatedClientLoader to google minor
 Key: SPARK-29229
 URL: https://issues.apache.org/jira/browse/SPARK-29229
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Yuanjian Li


The repository currently used is 
"[http://www.datanucleus.org/downloads/maven2];, which is no longer maintained. 
This will sometimes cause downloading failure and make hive test cases flaky.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28558) DatasetWriter partitionBy is changing the group file permissions in 2.4 for parquets

2019-09-24 Thread Nicolas Laduguie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936719#comment-16936719
 ] 

Nicolas Laduguie commented on SPARK-28558:
--

I debugged Spark and my understanding is that :
 * When you "partitionBy" a data set in Spark, it will generate a path in which 
to store the file (parquet in my case) including the partitions folders,
 * When Spark save the file, it saves directly the file and not the 
intermediate partitions folders if they don't exist,
 * The permissions are calculated for the file (so 666 by default) and the 
umask is applied on this permissions (for example 664 if umask is 002),
 * The file is saved with those permissions, and the partitions folders are 
also saved with those permissions.

This is probably why we can't apply the right permissions on partitions folders.

> DatasetWriter partitionBy is changing the group file permissions in 2.4 for 
> parquets
> 
>
> Key: SPARK-28558
> URL: https://issues.apache.org/jira/browse/SPARK-28558
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
> Environment: Hadoop 2.7
> Scala 2.11
> Tested:
>  * Spark 2.3.3 - Works
>  * Spark 2.4.x - All have the same issue
>Reporter: Stephen Pearson
>Priority: Minor
> Attachments: image-2019-09-24-09-20-07-225.png
>
>
> When writing a parquet using partitionBy the group file permissions are being 
> changed as shown below. This causes members of the group to get 
> "org.apache.hadoop.security.AccessControlException: Open failed for file 
> error: Permission denied (13)"
> This worked in 2.3. I found a workaround which was to set 
> "spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2" which gives 
> the correct behaviour
>  
> Code I used to reproduce issue:
> {quote}Seq(("H", 1), ("I", 2))
>  .toDF("Letter", "Number")
>  .write
>  .partitionBy("Letter")
>  .parquet(...){quote}
>  
> {quote}sparktesting$ tree -dp
> ├── [drwxrws---]  letter_testing2.3-defaults
> │   ├── [drwxrws---]  Letter=H
> │   └── [drwxrws---]  Letter=I
> ├── [drwxrws---]  letter_testing2.4-defaults
> │   ├── [drwxrwS---]  Letter=H
> │   └── [drwxrwS---]  Letter=I
> └── [drwxrws---]  letter_testing2.4-file-writer2
>     ├── [drwxrws---]  Letter=H
>     └── [drwxrws---]  Letter=I
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29228) Adding the ability to add headers to the KafkaRecords published to Kafka from spark structured streaming

2019-09-24 Thread Randal Boyle (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936715#comment-16936715
 ] 

Randal Boyle commented on SPARK-29228:
--

I [~rjhsb1989] am working on this :)

> Adding the ability to add headers to the KafkaRecords published to Kafka from 
> spark structured streaming
> 
>
> Key: SPARK-29228
> URL: https://issues.apache.org/jira/browse/SPARK-29228
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Randal Boyle
>Priority: Major
>
> Currently there is no way to add headers to the records that will be 
> published to Kafka through spark structured streaming. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29221) Flaky test: SQLQueryTestSuite.sql (subquery/scalar-subquery/scalar-subquery-select.sql)

2019-09-24 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936713#comment-16936713
 ] 

Yuming Wang commented on SPARK-29221:
-

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/6538/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/sql/

{noformat}
Error Message
org.scalatest.exceptions.TestFailedException: 
subquery/scalar-subquery/scalar-subquery-select.sql Expected 
"struct<[min_t3d:bigint,max_t2h:timestamp]>", but got "struct<[]>" Schema did 
not match for query #3 SELECT (SELECT min(t3d) FROM t3) min_t3d,(SELECT 
max(t2h) FROM t2) max_t2h FROM   t1 WHERE  t1a = 'val1c': QueryOutput(SELECT 
(SELECT min(t3d) FROM t3) min_t3d,(SELECT max(t2h) FROM t2) max_t2h 
FROM   t1 WHERE  t1a = 'val1c',struct<>,java.lang.NullPointerException null)
Stacktrace
sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
subquery/scalar-subquery/scalar-subquery-select.sql
Expected "struct<[min_t3d:bigint,max_t2h:timestamp]>", but got "struct<[]>" 
Schema did not match for query #3
SELECT (SELECT min(t3d) FROM t3) min_t3d,
   (SELECT max(t2h) FROM t2) max_t2h
FROM   t1
WHERE  t1a = 'val1c': QueryOutput(SELECT (SELECT min(t3d) FROM t3) min_t3d,
   (SELECT max(t2h) FROM t2) max_t2h
FROM   t1
WHERE  t1a = 'val1c',struct<>,java.lang.NullPointerException
null)
at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
at 
org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
at org.scalatest.Assertions.assertResult(Assertions.scala:1003)
at org.scalatest.Assertions.assertResult$(Assertions.scala:998)
at org.scalatest.FunSuite.assertResult(FunSuite.scala:1560)
at 
org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$11(SQLQueryTestSuite.scala:382)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at 
org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runQueries$9(SQLQueryTestSuite.scala:377)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.Assertions.withClue(Assertions.scala:1221)
at org.scalatest.Assertions.withClue$(Assertions.scala:1208)
at org.scalatest.FunSuite.withClue(FunSuite.scala:1560)
at 
org.apache.spark.sql.SQLQueryTestSuite.runQueries(SQLQueryTestSuite.scala:353)
at 
org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$15(SQLQueryTestSuite.scala:276)
at 
org.apache.spark.sql.SQLQueryTestSuite.$anonfun$runTest$15$adapted(SQLQueryTestSuite.scala:274)
at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at 
org.apache.spark.sql.SQLQueryTestSuite.runTest(SQLQueryTestSuite.scala:274)
at 
org.apache.spark.sql.SQLQueryTestSuite.$anonfun$createScalaTestCase$5(SQLQueryTestSuite.scala:223)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
at 
org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
at 
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at 
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
at 
org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at 
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
at scala.collection.immutable.List.foreach(List.scala:392)
at

[jira] [Created] (SPARK-29228) Adding the ability to add headers to the KafkaRecords published to Kafka from spark structured streaming

2019-09-24 Thread R JHS B (Jira)

R JHS B created SPARK-29228:
---

 Summary: Adding the ability to add headers to the KafkaRecords 
published to Kafka from spark structured streaming
 Key: SPARK-29228
 URL: https://issues.apache.org/jira/browse/SPARK-29228
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.4, 3.0.0
Reporter: R JHS B


Currently there is no way to add headers to the records that will be published 
to Kafka through spark structured streaming. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28292) Enable inject user-defined Hint

2019-09-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28292:
---

Assignee: Xiao Li

> Enable inject user-defined Hint
> ---
>
> Key: SPARK-28292
> URL: https://issues.apache.org/jira/browse/SPARK-28292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> We can't inject hint to Analyzer, hope to add a extension entrance to inject 
> user-defined hint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28292) Enable inject user-defined Hint

2019-09-24 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28292.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25746
[https://github.com/apache/spark/pull/25746]

> Enable inject user-defined Hint
> ---
>
> Key: SPARK-28292
> URL: https://issues.apache.org/jira/browse/SPARK-28292
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
> Fix For: 3.0.0
>
>
> We can't inject hint to Analyzer, hope to add a extension entrance to inject 
> user-defined hint



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28678.
--
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 25704
[https://github.com/apache/spark/pull/25704]

> Specify that start index is 1-based in docstring of 
> pyspark.sql.functions.slice
> ---
>
> Key: SPARK-28678
> URL: https://issues.apache.org/jira/browse/SPARK-28678
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sivam Pasupathipillai
>Assignee: Ting Yang
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> The start index parameter in pyspark.sql.functions.slice should be 1-based, 
> otherwise the method fails with an exception.
> This is not documented anywhere. Fix the docstring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-28678) Specify that start index is 1-based in docstring of pyspark.sql.functions.slice

2019-09-24 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28678:


Assignee: Ting Yang

> Specify that start index is 1-based in docstring of 
> pyspark.sql.functions.slice
> ---
>
> Key: SPARK-28678
> URL: https://issues.apache.org/jira/browse/SPARK-28678
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Sivam Pasupathipillai
>Assignee: Ting Yang
>Priority: Trivial
>
> The start index parameter in pyspark.sql.functions.slice should be 1-based, 
> otherwise the method fails with an exception.
> This is not documented anywhere. Fix the docstring.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29119) DEFAULT option is not supported in Spark

2019-09-24 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936596#comment-16936596
 ] 

jiaan.geng commented on SPARK-29119:


This is duplicate with https://issues.apache.org/jira/browse/SPARK-27953

> DEFAULT option is not supported in Spark
> 
>
> Key: SPARK-29119
> URL: https://issues.apache.org/jira/browse/SPARK-29119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> PostgreSQL supports *default* option as below
> **CREATE TABLE update_test (
>  a INT *DEFAULT* 10,
>  b INT
> );
> INSERT INTO update_test VALUES (5, 10);
> INSERT INTO update_test(b) VALUES (15);
> SELECT * FROM update_test;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29227) Track rule info in optimization phase

2019-09-24 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-29227:

Description: Track timing info for each rule in optimization phase using 
QueryPlanningTracker in Structured Streaming  (was: tracks timing info for each 
rule in optimization phase using QueryPlanningTracker in Structured Streaming)

> Track rule info in optimization phase 
> --
>
> Key: SPARK-29227
> URL: https://issues.apache.org/jira/browse/SPARK-29227
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Major
>
> Track timing info for each rule in optimization phase using 
> QueryPlanningTracker in Structured Streaming



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29227) Track rule info in optimization phase

2019-09-24 Thread wenxuanguan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wenxuanguan updated SPARK-29227:

Summary: Track rule info in optimization phase   (was: track rule info in 
optimization phase )

> Track rule info in optimization phase 
> --
>
> Key: SPARK-29227
> URL: https://issues.apache.org/jira/browse/SPARK-29227
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: wenxuanguan
>Priority: Major
>
> tracks timing info for each rule in optimization phase using 
> QueryPlanningTracker in Structured Streaming



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29227) track rule info in optimization phase

2019-09-24 Thread wenxuanguan (Jira)

wenxuanguan created SPARK-29227:
---

 Summary: track rule info in optimization phase 
 Key: SPARK-29227
 URL: https://issues.apache.org/jira/browse/SPARK-29227
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: wenxuanguan


tracks timing info for each rule in optimization phase using 
QueryPlanningTracker in Structured Streaming



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29226) Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.

2019-09-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29226:

Component/s: (was: Spark Core)
 Build

> Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.
> ---
>
> Key: SPARK-29226
> URL: https://issues.apache.org/jira/browse/SPARK-29226
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 
> and it will cause a security vulnerabilities. We could get some security info 
> from https://www.tenable.com/cve/CVE-2019-16335
> This reference remind to upgrate the version of `jackson-databind` to 2.9.10 
> or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29226) Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.

2019-09-24 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-29226:

Affects Version/s: (was: 2.4.5)

> Upgrade jackson-databind to 2.9.10 and fix vulnerabilities.
> ---
>
> Key: SPARK-29226
> URL: https://issues.apache.org/jira/browse/SPARK-29226
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 
> and it will cause a security vulnerabilities. We could get some security info 
> from https://www.tenable.com/cve/CVE-2019-16335
> This reference remind to upgrate the version of `jackson-databind` to 2.9.10 
> or later.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29211) Second invocation of custom UDF results in exception (when invoked from shell)

2019-09-24 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936538#comment-16936538
 ] 

angerszhu commented on SPARK-29211:
---

[~dkbiswal]

Set a local repo follow Apache/spark and checkout old version, build and test.

> Second invocation of custom UDF results in exception (when invoked from shell)
> --
>
> Key: SPARK-29211
> URL: https://issues.apache.org/jira/browse/SPARK-29211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dilip Biswal
>Priority: Major
>
> I encountered this while writing documentation for SQL reference. Here is the 
> small repro:
> UDF:
>  =
> {code:java}
> import org.apache.hadoop.hive.ql.exec.UDF;
>   
> public class SimpleUdf extends UDF {
>   public int evaluate(int value) {
> return value + 10;
>   }
> }
> {code}
> {code:java}
> spark.sql("CREATE FUNCTION simple_udf AS 'SimpleUdf' USING JAR 
> '/tmp/SimpleUdf.jar'").show
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> +-+   
>   
> |function_return_value|
> +-+
> |   11|
> |   12|
> +-+
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> scala> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM 
> t1").show
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'SimpleUdf': java.lang.ClassNotFoundException: SimpleUdf; line 1 pos 7
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:245)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:57)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:61)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:78)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:78)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$makeFunctionExpression$2(HiveSessionCatalog.scala:78)
>   at scala.util.Failure.getOrElse(Try.scala:222)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:70)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1176)
>   at 
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1344)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
> {code}
> Please note that the problem does not happen if we try it from a testsuite. 
> So far i have only seen it when i try it from shell. Also i tried it in 2.4.4 
> and observe the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29211) Second invocation of custom UDF results in exception (when invoked from shell)

2019-09-24 Thread Dilip Biswal (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936529#comment-16936529
 ] 

Dilip Biswal commented on SPARK-29211:
--

[~dongjoon] I tried with 2.3.4 and see the same issue. In the spark download 
site, it only has two versions. I am unable to go back below 2.3.4. 

> Second invocation of custom UDF results in exception (when invoked from shell)
> --
>
> Key: SPARK-29211
> URL: https://issues.apache.org/jira/browse/SPARK-29211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dilip Biswal
>Priority: Major
>
> I encountered this while writing documentation for SQL reference. Here is the 
> small repro:
> UDF:
>  =
> {code:java}
> import org.apache.hadoop.hive.ql.exec.UDF;
>   
> public class SimpleUdf extends UDF {
>   public int evaluate(int value) {
> return value + 10;
>   }
> }
> {code}
> {code:java}
> spark.sql("CREATE FUNCTION simple_udf AS 'SimpleUdf' USING JAR 
> '/tmp/SimpleUdf.jar'").show
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> +-+   
>   
> |function_return_value|
> +-+
> |   11|
> |   12|
> +-+
> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM t1").show
> scala> spark.sql("SELECT simple_udf(c1) AS function_return_value FROM 
> t1").show
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout 
> does not exist
> 19/09/23 00:43:18 WARN HiveConf: HiveConf of name hive.stats.retries.wait 
> does not exist
> org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 
> 'SimpleUdf': java.lang.ClassNotFoundException: SimpleUdf; line 1 pos 7
>   at 
> scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:72)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>   at 
> org.apache.spark.sql.hive.HiveShim$HiveFunctionWrapper.createFunction(HiveShim.scala:245)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.function$lzycompute(hiveUDFs.scala:57)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.function(hiveUDFs.scala:57)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.method$lzycompute(hiveUDFs.scala:61)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.method(hiveUDFs.scala:60)
>   at 
> org.apache.spark.sql.hive.HiveSimpleUDF.dataType$lzycompute(hiveUDFs.scala:78)
>   at org.apache.spark.sql.hive.HiveSimpleUDF.dataType(hiveUDFs.scala:78)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$makeFunctionExpression$2(HiveSessionCatalog.scala:78)
>   at scala.util.Failure.getOrElse(Try.scala:222)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.makeFunctionExpression(HiveSessionCatalog.scala:70)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.$anonfun$makeFunctionBuilder$1(SessionCatalog.scala:1176)
>   at 
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1344)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.super$lookupFunction(HiveSessionCatalog.scala:132)
>   at 
> org.apache.spark.sql.hive.HiveSessionCatalog.$anonfun$lookupFunction0$2(HiveSessionCatalog.scala:132)
> {code}
> Please note that the problem does not happen if we try it from a testsuite. 
> So far i have only seen it when i try it from shell. Also i tried it in 2.4.4 
> and observe the same behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 117 matches

Mail list logo