[jira] [Created] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-07 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27090:
--

 Summary: Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
 Key: SPARK-27090
 URL: https://issues.apache.org/jira/browse/SPARK-27090
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places along 
with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver is 
running or an executor.

The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So I 
think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27090) Deplementing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-08 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787822#comment-16787822
 ] 

Attila Zsolt Piros commented on SPARK-27090:


I would have waited a little bit to find out what others opinion about this. 
But fine for me.

> Deplementing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25888) Service requests for persist() blocks via external service after dynamic deallocation

2019-04-24 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825248#comment-16825248
 ] 

Attila Zsolt Piros commented on SPARK-25888:


I am working on this.

> Service requests for persist() blocks via external service after dynamic 
> deallocation
> -
>
> Key: SPARK-25888
> URL: https://issues.apache.org/jira/browse/SPARK-25888
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager, Shuffle, YARN
>Affects Versions: 2.3.2
>Reporter: Adam Kennedy
>Priority: Major
>
> Large and highly multi-tenant Spark on YARN clusters with diverse job 
> execution often display terrible utilization rates (we have observed as low 
> as 3-7% CPU at max container allocation, but 50% CPU utilization on even a 
> well policed cluster is not uncommon).
> As a sizing example, consider a scenario with 1,000 nodes, 50,000 cores, 250 
> users and 50,000 runs of 1,000 distinct applications per week, with 
> predominantly Spark including a mixture of ETL, Ad Hoc tasks and PySpark 
> Notebook jobs (no streaming)
> Utilization problems appear to be due in large part to difficulties with 
> persist() blocks (DISK or DISK+MEMORY) preventing dynamic deallocation.
> In situations where an external shuffle service is present (which is typical 
> on clusters of this type) we already solve this for the shuffle block case by 
> offloading the IO handling of shuffle blocks to the external service, 
> allowing dynamic deallocation to proceed.
> Allowing Executors to transfer persist() blocks to some external "shuffle" 
> service in a similar manner would be an enormous win for Spark multi-tenancy 
> as it would limit deallocation blocking scenarios to only MEMORY-only cache() 
> scenarios.
> I'm not sure if I'm correct, but I seem to recall seeing in the original 
> external shuffle service commits that may have been considered at the time 
> but getting shuffle blocks moved to the external shuffle service was the 
> first priority.
> With support for external persist() DISK blocks in place, we could also then 
> handle deallocation of DISK+MEMORY, as the memory instance could first be 
> dropped, changing the block to DISK only, and then further transferred to the 
> shuffle service.
> We have tried to resolve the persist() issue via extensive user training, but 
> that has typically only allowed us to improve utilization of the worst 
> offenders (10% utilization) up to around 40-60% utilization, as the need for 
> persist() is often legitimate and occurs during the middle stages of a job.
> In a healthy multi-tenant scenario, a large job might spool up to say 10,000 
> cores, persist() data, release executors across a long tail down to 100 
> cores, and then spool back up to 10,000 cores for the following stage without 
> impact on the persist() data.
> In an ideal world, if an new executor started up on a node on which blocks 
> had been transferred to the shuffle service, the new executor might even be 
> able to "recapture" control of those blocks (if that would help with 
> performance in some way).
> And the behavior of gradually expanding up and down several times over the 
> course of a job would not just improve utilization, but would allow resources 
> to more easily be redistributed to other jobs which start on the cluster 
> during the long-tail periods, which would improve multi-tenancy and bring us 
> closer to optimal "envy free" YARN scheduling.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27622) Avoiding network communication when block mangers are running on the host

2019-05-02 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27622:
--

 Summary: Avoiding network communication when block mangers are 
running on the host 
 Key: SPARK-27622
 URL: https://issues.apache.org/jira/browse/SPARK-27622
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


Currently fetching blocks always uses the network even when the two block 
managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27622) Avoiding network communication when block mangers are running on the host

2019-05-02 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831553#comment-16831553
 ] 

Attila Zsolt Piros commented on SPARK-27622:


I am already working on this. A working prototype for RDD blocks are ready and 
working.

> Avoiding network communication when block mangers are running on the host 
> --
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoiding network communication when block mangers are running on the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoiding network communication when block mangers are running on 
the same host   (was: Avoiding network communication when block mangers are 
running on the host )

> Avoiding network communication when block mangers are running on the same 
> host 
> ---
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoiding network communication when block manger fetching from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoiding network communication when block manger fetching from the 
same host  (was: Avoiding network communication when block mangers are running 
on the same host )

> Avoiding network communication when block manger fetching from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoid network communication when block manger fetches from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid network communication when block manger fetches from the 
same host  (was: Avoiding network communication when block manger fetching from 
the same host)

> Avoid network communication when block manger fetches from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27622) Avoid network communication when block manger fetches from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831553#comment-16831553
 ] 

Attila Zsolt Piros edited comment on SPARK-27622 at 5/6/19 6:22 PM:


I am already working on this. There is already a working prototype for RDD 
blocks.


was (Author: attilapiros):
I am already working on this. A working prototype for RDD blocks are ready and 
working.

> Avoid network communication when block manger fetches from the same host
> 
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoid the network when block manager fetches disk persisted RDD blocks from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid the network when block manager fetches disk persisted RDD 
blocks from the same host  (was: Avoid network communication when block manager 
fetches disk persisted RDD blocks from the same host)

> Avoid the network when block manager fetches disk persisted RDD blocks from 
> the same host
> -
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27622) Avoid network communication when block manager fetches disk persisted RDD blocks from the same host

2019-05-06 Thread Attila Zsolt Piros (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-27622:
---
Summary: Avoid network communication when block manager fetches disk 
persisted RDD blocks from the same host  (was: Avoid network communication when 
block manger fetches from the same host)

> Avoid network communication when block manager fetches disk persisted RDD 
> blocks from the same host
> ---
>
> Key: SPARK-27622
> URL: https://issues.apache.org/jira/browse/SPARK-27622
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently fetching blocks always uses the network even when the two block 
> managers are running on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2019-05-07 Thread Attila Zsolt Piros (JIRA)
Attila Zsolt Piros created SPARK-27651:
--

 Summary: Avoid the network when block manager fetches shuffle 
blocks from the same host
 Key: SPARK-27651
 URL: https://issues.apache.org/jira/browse/SPARK-27651
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


When a shuffle block (content) is fetched the network is always used even when 
it is fetched from an executor (or the external shuffle service) running on the 
same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2019-05-07 Thread Attila Zsolt Piros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834832#comment-16834832
 ] 

Attila Zsolt Piros commented on SPARK-27651:


I am already working on this.

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from an executor (or the external shuffle service) running 
> on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269182#comment-17269182
 ] 

Attila Zsolt Piros commented on SPARK-34167:


[~razajafri] could you share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}



> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_tests/936454/PARQUET_DATA").show
> ...
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.putLong(OnHeapColumnVector.java:327)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedRleValuesReader.readLongs(VectorizedRleValuesReader.java:370)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readLongBatch(VectorizedColumnReader.java:514)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:256)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:273)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:497)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:756)
>   at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByt

[jira] [Comment Edited] (SPARK-34167) Reading parquet with Decimal(8,2) written as a Decimal64 blows up

2021-01-21 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17269182#comment-17269182
 ] 

Attila Zsolt Piros edited comment on SPARK-34167 at 1/21/21, 9:49 AM:
--

[~razajafri] could you please share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}




was (Author: attilapiros):
[~razajafri] could you share with us how the parquet files are created?

I tried to reproduce this issue in the following way but I had no luck:

{noformat}
Spark context Web UI available at http://192.168.0.17:4045
Spark context available as 'sc' (master = local, app id = local-1611221568779).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
  /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.math.BigDecimal
import java.math.BigDecimal

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> import org.apache.spark.sql.types.{DecimalType, StructField, StructType}
import org.apache.spark.sql.types.{DecimalType, StructField, StructType}

scala> val schema = StructType(Array(StructField("num", DecimalType(8,2),true)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(num,DecimalType(8,2),true))

scala> val rdd = sc.parallelize((0 to 9).map(v => new 
BigDecimal(s"123456.7$v")))
rdd: org.apache.spark.rdd.RDD[java.math.BigDecimal] = ParallelCollectionRDD[0] 
at parallelize at :27

scala> val df = spark.createDataFrame(rdd.map(Row(_)), schema)
df: org.apache.spark.sql.DataFrame = [num: decimal(8,2)]

scala> df.show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+


scala> df.write.parquet("num.parquet")

scala> spark.read.parquet("num.parquet").show()
+-+
|  num|
+-+
|123456.70|
|123456.71|
|123456.72|
|123456.73|
|123456.74|
|123456.75|
|123456.76|
|123456.77|
|123456.78|
|123456.79|
+-+

{noformat}



> Reading parquet with Decimal(8,2) written as a Decimal64 blows up
> -
>
> Key: SPARK-34167
> URL: https://issues.apache.org/jira/browse/SPARK-34167
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.1
>Reporter: Raza Jafri
>Priority: Major
> Attachments: 
> part-0-7fecd321-b247-4f7e-bff5-c2e4d8facaa0-c000.snappy.parquet, 
> part-0-940f44f1-f323-4a5e-b828-1e65d87895aa-c000.snappy.parquet
>
>
> When reading a parquet file written with Decimals with precision < 10 as a 
> 64-bit representation, Spark tries to read it as an INT and fails
>  
> Steps to reproduce:
> Read the attached file that has a single Decimal(8,2) column with 10 values
> {code:java}
> scala> spark.read.parquet("/tmp/pyspark_t

[jira] [Commented] (SPARK-34154) Flaky Test: LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750)

2021-01-27 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272867#comment-17272867
 ] 

Attila Zsolt Piros commented on SPARK-34154:


I tried to reproduce this issue by running it locally 100 times but all of them 
was successful. 
So my next idea is to reproduce it by the PR builder (and getting the stack 
trace of the thread which got stuck):
https://github.com/apache/spark/pull/31363



> Flaky Test: LocalityPlacementStrategySuite.handle large number of containers 
> and tasks (SPARK-18750)
> 
>
> Key: SPARK-34154
> URL: https://issues.apache.org/jira/browse/SPARK-34154
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `LocalityPlacementStrategySuite` hangs sometimes like the following. We can 
> retriever, but it takes our resource significantly because it hangs until the 
> timeout (6 hours) occurs.
> [https://github.com/apache/spark/runs/1719480243]
> [https://github.com/apache/spark/runs/1724459002]
> [https://github.com/apache/spark/runs/1717958874]
> [https://github.com/apache/spark/runs/1731673955] (branch-3.0)
> {code:java}
> [info] LocalityPlacementStrategySuite:
> 17299[info] *** Test still running after 3 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17300[info] *** Test still running after 8 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17301[info] *** Test still running after 13 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17302[info] *** Test still running after 18 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17303[info] *** Test still running after 23 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17304[info] *** Test still running after 28 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17305[info] *** Test still running after 33 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17306[info] *** Test still running after 38 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17307[info] *** Test still running after 43 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17308[info] *** Test still running after 48 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17309[info] *** Test still running after 53 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17310[info] *** Test still running after 58 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17311[info] *** Test still running after 1 hour, 3 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17312[info] *** Test still running after 1 hour, 8 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17313[info] *** Test still running after 1 hour, 13 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17314[info] *** Test still running after 1 hour, 18 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17315[info] *** Test still running after 1 hour, 23 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17316[info] *** Test still running after 1 hour, 28 minutes, 6 seconds: suite 
> name: LocalityPlacementStrategySuite, test name: handle large number of 
> containers and tasks (SPARK-18750). 
> 17317[info] *** Test still running after 1 hour, 33 minutes, 6 seconds: suite 
> n

[jira] [Commented] (SPARK-34154) Flaky Test: LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750)

2021-01-27 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272990#comment-17272990
 ] 

Attila Zsolt Piros commented on SPARK-34154:


In my PR the bug does not surfaced even after 200 runs.

I think the root cause could be something with the hostname lookup, I have 
checked and it is called from here:

{code:java}
at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java)
at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:585)
at 
org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109)
at 
org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:75)
at 
org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:66)
at 
org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$localityOfRequestedContainers$3(LocalityPreferredContainerPlacementStrategy.scala:142)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
at 
org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:138)
at 
org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94)
at 
org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40)
at java.lang.Thread.run(Thread.java:748)
{code}

To get more information I will modify my code a bit and make it mergeable. This 
way when it fails again in somebody's PR we will have at least the stack trace. 
In addition it will fail fast. It won't run for hour(s) just for 30 seconds.


> Flaky Test: LocalityPlacementStrategySuite.handle large number of containers 
> and tasks (SPARK-18750)
> 
>
> Key: SPARK-34154
> URL: https://issues.apache.org/jira/browse/SPARK-34154
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `LocalityPlacementStrategySuite` hangs sometimes like the following. We can 
> retriever, but it takes our resource significantly because it hangs until the 
> timeout (6 hours) occurs.
> [https://github.com/apache/spark/runs/1719480243]
> [https://github.com/apache/spark/runs/1724459002]
> [https://github.com/apache/spark/runs/1717958874]
> [https://github.com/apache/spark/runs/1731673955] (branch-3.0)
> {code:java}
> [info] LocalityPlacementStrategySuite:
> 17299[info] *** Test still running after 3 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17300[info] *** Test still running after 8 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17301[info] *** Test still running after 13 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17302[info] *** Test still running after 18 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17303[info] *** Test still running after 23 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17304[info] *** Test still running after 28 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17305[info] *** Test still running after 33 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17306[info] *** Test still running after 38 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17307[info] *** Test still running after 43 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17308[info] *** Test still running after 48 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17309[info] *** Test still running after 53 minutes, 6 seconds: suite name: 
> LocalityPlacementStrategySuite, test name: handle large number of containers 
> and tasks (SPARK-18750). 
> 17310[info] *** Test still runn

[jira] [Comment Edited] (SPARK-34179) examples provided in https://spark.apache.org/docs/latest/api/sql/index.html link not working

2021-01-27 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273348#comment-17273348
 ] 

Attila Zsolt Piros edited comment on SPARK-34179 at 1/28/21, 6:09 AM:
--

[~chetdb] The "https://spark.apache.org/docs/latest/api/sql/index.html"; is a 
link to the "latest" version which is NOT 2.4.5 but 3.0.1.

You should check the functions in 
https://spark.apache.org/docs/2.4.5/api/sql/index.html in case of 2.4.5 where 
there is no overlay functions at all.

Regarding the array_sort the only problem is in the text: "Since: 2.4.0", which 
was valid for the old example: """SELECT array_sort(array('b', 'd', null, 'c', 
'a'));""". If something needs to be corrected it might be this text as it is a 
little bit misleading even as it is true (the function itself was introduced in 
2.4.0). 




was (Author: attilapiros):
[~chetdb] The "https://spark.apache.org/docs/latest/api/sql/index.html"; is a 
link to the "latest" version which is NOT 2.4.5 but 3.0.1.

You should check the functions in 
https://spark.apache.org/docs/2.4.5/api/sql/index.html in case of 2.4.5. where 
there is no overlay functions at all.

Regarding the array_sort the only problem is in the text: "Since: 2.4.0", which 
was valid for the old example: """SELECT array_sort(array('b', 'd', null, 'c', 
'a'));""". If something needs to be corrected it might be this text as it is a 
little bit misleading even as it is true (the function itself was introduced in 
2.4.0). 



> examples provided in https://spark.apache.org/docs/latest/api/sql/index.html  
>  link not working
> ---
>
> Key: SPARK-34179
> URL: https://issues.apache.org/jira/browse/SPARK-34179
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
>
> *Issue 1 :*
> *array_sort examples provided in 
> [https://spark.apache.org/docs/latest/api/sql/index.html#array_sort] link not 
> working.*
>  
> SELECT array_sort(array(5, 6, 1), (left, right) -> case when left < right 
> then -1 when left > right then 1 else 0 end); –> *this example when executed 
> in spark-sql fails with below error*
> SELECT array_sort(array(5, 6, 1), (left, right) -> case when left < right 
> then -1 when left > right then 1 else 0 end);
>  Error in query:
>  extraneous input '->' expecting \{')', ','}(line 1, pos 48)
> == SQL ==
>  SELECT array_sort(array(5, 6, 1), (left, right) -> case when left < right 
> then -1 when left > right then 1 else 0 end)
>  ^^^
> spark-sql>
>  
> SELECT array_sort(array('bc', 'ab', 'dc'), (left, right) -> case when left is 
> null and right is null then 0 when left is null then -1 when right is null 
> then 1 when left < right then 1 when left > right then -1 else 0 end);  --> 
> *This example when executed fails with below error*
>  
> spark-sql>
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  > SELECT array_sort(array('bc', 'ab', 'dc'), (left, right) -> case when left 
> is null and right is null then 0 when left is null then -1 when right is null 
> then 1 when left < right then 1 when left > right then -1 else 0 end);
>  Error in query:
>  extraneous input '->' expecting \{')', ','}(line 1, pos 57)
> == SQL ==
>  SELECT array_sort(array('bc', 'ab', 'dc'), (left, right) -> case when left 
> is null and right is null then 0 when left is null then -1 when right is null 
> then 1 when left < right then 1 when left > right then -1 else 0 end)
>  -^^^
> spark-sql>
>  
> *Issue 2 :-*
> *Examples for overlay functions are not working in link - 
> https://spark.apache.org/docs/latest/api/sql/index.html*
>  
> spark-sql> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING '_' FROM 6)
> ---^^^
> spark-sql> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7)
> ---^^^
> spark-sql> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0)
> ---^^^
> spark-sql> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' 

[jira] [Commented] (SPARK-34179) examples provided in https://spark.apache.org/docs/latest/api/sql/index.html link not working

2021-01-27 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273348#comment-17273348
 ] 

Attila Zsolt Piros commented on SPARK-34179:


[~chetdb] The "https://spark.apache.org/docs/latest/api/sql/index.html"; is a 
link to the "latest" version which is NOT 2.4.5 but 3.0.1.

You should check the functions in 
https://spark.apache.org/docs/2.4.5/api/sql/index.html in case of 2.4.5. where 
there is no overlay functions at all.

Regarding the array_sort the only problem is in the text: "Since: 2.4.0", which 
was valid for the old example: """SELECT array_sort(array('b', 'd', null, 'c', 
'a'));""". If something needs to be corrected it might be this text as it is a 
little bit misleading even as it is true (the function itself was introduced in 
2.4.0). 



> examples provided in https://spark.apache.org/docs/latest/api/sql/index.html  
>  link not working
> ---
>
> Key: SPARK-34179
> URL: https://issues.apache.org/jira/browse/SPARK-34179
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
>
> *Issue 1 :*
> *array_sort examples provided in 
> [https://spark.apache.org/docs/latest/api/sql/index.html#array_sort] link not 
> working.*
>  
> SELECT array_sort(array(5, 6, 1), (left, right) -> case when left < right 
> then -1 when left > right then 1 else 0 end); –> *this example when executed 
> in spark-sql fails with below error*
> SELECT array_sort(array(5, 6, 1), (left, right) -> case when left < right 
> then -1 when left > right then 1 else 0 end);
>  Error in query:
>  extraneous input '->' expecting \{')', ','}(line 1, pos 48)
> == SQL ==
>  SELECT array_sort(array(5, 6, 1), (left, right) -> case when left < right 
> then -1 when left > right then 1 else 0 end)
>  ^^^
> spark-sql>
>  
> SELECT array_sort(array('bc', 'ab', 'dc'), (left, right) -> case when left is 
> null and right is null then 0 when left is null then -1 when right is null 
> then 1 when left < right then 1 when left > right then -1 else 0 end);  --> 
> *This example when executed fails with below error*
>  
> spark-sql>
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  >
>  > SELECT array_sort(array('bc', 'ab', 'dc'), (left, right) -> case when left 
> is null and right is null then 0 when left is null then -1 when right is null 
> then 1 when left < right then 1 when left > right then -1 else 0 end);
>  Error in query:
>  extraneous input '->' expecting \{')', ','}(line 1, pos 57)
> == SQL ==
>  SELECT array_sort(array('bc', 'ab', 'dc'), (left, right) -> case when left 
> is null and right is null then 0 when left is null then -1 when right is null 
> then 1 when left < right then 1 when left > right then -1 else 0 end)
>  -^^^
> spark-sql>
>  
> *Issue 2 :-*
> *Examples for overlay functions are not working in link - 
> https://spark.apache.org/docs/latest/api/sql/index.html*
>  
> spark-sql> SELECT overlay('Spark SQL' PLACING '_' FROM 6);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING '_' FROM 6)
> ---^^^
> spark-sql> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING 'CORE' FROM 7)
> ---^^^
> spark-sql> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING 'ANSI ' FROM 7 FOR 0)
> ---^^^
> spark-sql> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 27)
> == SQL ==
> SELECT overlay('Spark SQL' PLACING 'tructured' FROM 2 FOR 4)
> ---^^^
> spark-sql> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 
> 'utf-8') FROM 6);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 44)
> == SQL ==
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 
> 6)
> ^^^
> spark-sql> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 
> 'utf-8') FROM 7);
> Error in query:
> mismatched input 'PLACING' expecting \{')', ','}(line 1, pos 44)
> == SQL ==
> SELECT overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') 
> FROM 7)
> --

[jira] [Commented] (SPARK-34126) SQL running error, spark does not exit, resulting in data quality problems

2021-01-28 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273761#comment-17273761
 ] 

Attila Zsolt Piros commented on SPARK-34126:


[~shikui] are you using the " -f " argument of the spark-sql?

If yes please make sure the "hive.cli.errors.ignore" is false. 

The relevant code is:

https://github.com/apache/spark/blob/b350258c88dfd3eddb9392be5b85a3390a2d39e5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala#L504-L508

As you see it must stop at the first error when the "ignoreErrors" is false.


> SQL running error, spark does not exit, resulting in data quality problems
> --
>
> Key: SPARK-34126
> URL: https://issues.apache.org/jira/browse/SPARK-34126
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
> Environment: spark3.0.1 on yarn 
>Reporter: shikui ye
>Priority: Major
>
> Spark SQL executes a SQL file containing multiple SQL segments. Because one 
> of the SQL segments fails to run, but spark driver or spark context does not 
> exit, an error will occur. The table written by the SQL segment is empty or 
> old data. Depending on this problematic table, the subsequent SQL will have 
> data quality problems even if it runs successfully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2021-01-28 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17273923#comment-17273923
 ] 

Attila Zsolt Piros commented on SPARK-33763:


I started working on this.


> Add metrics for better tracking of dynamic allocation
> -
>
> Key: SPARK-33763
> URL: https://issues.apache.org/jira/browse/SPARK-33763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34285) Implement Parquet StringEndsWith、StringContains Filter

2021-01-29 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274458#comment-17274458
 ] 

Attila Zsolt Piros commented on SPARK-34285:


[~Xudingyu] predicate pushdown is extremely useful when a column group can be 
dropped altogether. 

To support this for each group statistics are stored in the Parquet. It 
contains the min and max value.
In case of "StringStartsWith" you can see dropping the column groups is an easy 
decision (let's say the min is "BBB" and the max is "EEE" in the current column 
group):
- when the pattern is after the max (i.e "F.*") or
- when the pattern is before the min (i.e "A.*")
you can safely drop the whole column.

Regarding the "StringEndsWith" and "StringContains" you cannot make any 
decision based on the min and max value. 


> Implement Parquet StringEndsWith、StringContains Filter
> --
>
> Key: SPARK-34285
> URL: https://issues.apache.org/jira/browse/SPARK-34285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Priority: Major
>
> When create parquetFilters, currently only implements  
> {code:java}
> case sources.StringStartsWith(name, prefix)
> {code}
> But there exists StringEndsWith、StringContains in 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala
> We can implements this two filters, and  rename 
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED 
> {code}
>  to
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_ENABLED 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34285) Implement Parquet StringEndsWith、StringContains Filter

2021-01-30 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274458#comment-17274458
 ] 

Attila Zsolt Piros edited comment on SPARK-34285 at 1/31/21, 4:50 AM:
--

[~Xudingyu] predicate pushdown is extremely useful when a column group can be 
dropped altogether. 

To support this for each group statistics are stored in the Parquet. It 
contains the min and max value.
In case of "StringStartsWith" you can see dropping the column groups is an easy 
decision (let's say the min is "BBB" and the max is "EEE" in the current column 
group):
- when the pattern is after the max (i.e "F.*") or
- when the pattern is before the min (i.e "A.*")
you can safely drop the whole column.

Regarding the "StringEndsWith" and "StringContains" you cannot make any 
decision based on the min and max value (where the min and max is from the 
lexicographical ordering of the strings). 



was (Author: attilapiros):
[~Xudingyu] predicate pushdown is extremely useful when a column group can be 
dropped altogether. 

To support this for each group statistics are stored in the Parquet. It 
contains the min and max value.
In case of "StringStartsWith" you can see dropping the column groups is an easy 
decision (let's say the min is "BBB" and the max is "EEE" in the current column 
group):
- when the pattern is after the max (i.e "F.*") or
- when the pattern is before the min (i.e "A.*")
you can safely drop the whole column.

Regarding the "StringEndsWith" and "StringContains" you cannot make any 
decision based on the min and max value. 


> Implement Parquet StringEndsWith、StringContains Filter
> --
>
> Key: SPARK-34285
> URL: https://issues.apache.org/jira/browse/SPARK-34285
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xudingyu
>Priority: Major
>
> When create parquetFilters, currently only implements  
> {code:java}
> case sources.StringStartsWith(name, prefix)
> {code}
> But there exists StringEndsWith、StringContains in 
> /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala
> We can implements this two filters, and  rename 
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED 
> {code}
>  to
> {code:java}
> PARQUET_FILTER_PUSHDOWN_STRING_ENABLED 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34280) Avoid migrating un-needed shuffle files

2021-01-30 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275783#comment-17275783
 ] 

Attila Zsolt Piros commented on SPARK-34280:


It would be interesting to know whether the context cleaner was running for 
those blocks beforehand.

If you have the log it would be nice to see whether it has any of these: 
 - ["Error cleaning 
shuffle"|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala#L223-L237]
 - ["Error deleting 
data"|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/shuffle/IndexShuffleBlockResolver.scala#L107-L121]

If not re-run it along with the logger "org.apache.spark.ContextCleaner" set to 
DEBUG level.

(I guess this not much help for you Holden but who knows who else reads this 
and looks for answers to similar questions.)


> Avoid migrating un-needed shuffle files
> ---
>
> Key: SPARK-34280
> URL: https://issues.apache.org/jira/browse/SPARK-34280
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Holden Karau
>Priority: Major
>
> In Spark 3.1 we introduced shuffle migrations. However, it is possible that a 
> shuffle file will still exist after it is no longer needed. I've only 
> observed this in a back port branch with SQL, so I'll do some more digging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29220) Flaky test: org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]

2021-01-30 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275786#comment-17275786
 ] 

Attila Zsolt Piros commented on SPARK-29220:


I think we can close this as with https://github.com/apache/spark/pull/31363 
the exception is extended with the stack trace of the stuck thread and by 
https://github.com/apache/spark/pull/31397 a fix is provided.

> Flaky test: 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.handle large 
> number of containers and tasks (SPARK-18750) [hadoop-3.2][java11]
> --
>
> Key: SPARK-29220
> URL: https://issues.apache.org/jira/browse/SPARK-29220
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests, YARN
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Minor
>
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111229/testReport/]
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111236/testReport/]
> {code:java}
> Error Messageorg.scalatest.exceptions.TestFailedException: 
> java.lang.StackOverflowError did not equal 
> nullStacktracesbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedException: java.lang.StackOverflowError 
> did not equal null
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:527)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.$anonfun$new$1(LocalityPlacementStrategySuite.scala:48)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
>   at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
>   at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
>   at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
>   at 
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
>   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
>   at 
> org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
>   at 
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:396)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
>   at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:379)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
>   at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
>   at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
>   at org.scalatest.Suite.run(Suite.scala:1147)
>   at org.scalatest.Suite.run$(Suite.scala:1129)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
>   at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
>   at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
>   at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
>   at 
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:5

[jira] [Commented] (SPARK-29391) Default year-month units

2021-01-31 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275926#comment-17275926
 ] 

Attila Zsolt Piros commented on SPARK-29391:


I am working on this and within a few days a PR will be opened.

> Default year-month units
> 
>
> Key: SPARK-29391
> URL: https://issues.apache.org/jira/browse/SPARK-29391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> PostgreSQL can assume default year-month units by defaults:
> {code}
> maxim=# SELECT interval '1-2'; 
>interval
> ---
>  1 year 2 mons
> {code}
> but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29391) Default year-month units

2021-01-31 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275952#comment-17275952
 ] 

Attila Zsolt Piros commented on SPARK-29391:


If this helps in decision: the change itself can be done very elegantly by 
extending the state maschine in "IntervalUtils.stringToInterval()" with one 
extra state.

> Default year-month units
> 
>
> Key: SPARK-29391
> URL: https://issues.apache.org/jira/browse/SPARK-29391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> PostgreSQL can assume default year-month units by defaults:
> {code}
> maxim=# SELECT interval '1-2'; 
>interval
> ---
>  1 year 2 mons
> {code}
> but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-01-31 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275965#comment-17275965
 ] 

Attila Zsolt Piros commented on SPARK-34194:


[~nchammas] I suggest to check out this config 
"spark.sql.optimizer.metadataOnly".

This way the catalog will be queried directly but you have to know there is a 
good reason why this config is disabled by default:
[https://github.com/apache/spark/blob/a8eb443bf856dca0882c53e02cf08bf268c4a0ea/sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala#L75-L78]

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276874#comment-17276874
 ] 

Attila Zsolt Piros commented on SPARK-34194:


Yes.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34194) Queries that only touch partition columns shouldn't scan through all files

2021-02-01 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276874#comment-17276874
 ] 

Attila Zsolt Piros edited comment on SPARK-34194 at 2/2/21, 6:59 AM:
-

Yes, that is the reason.

[~nchammas] so based on this you should consider closing this issue. 


was (Author: attilapiros):
Yes.

> Queries that only touch partition columns shouldn't scan through all files
> --
>
> Key: SPARK-34194
> URL: https://issues.apache.org/jira/browse/SPARK-34194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> When querying only the partition columns of a partitioned table, it seems 
> that Spark nonetheless scans through all files in the table, even though it 
> doesn't need to.
> Here's an example:
> {code:python}
> >>> data = spark.read.option('mergeSchema', 
> >>> 'false').parquet('s3a://some/dataset')
> [Stage 0:==>  (407 + 12) / 
> 1158]
> {code}
> Note the 1158 tasks. This matches the number of partitions in the table, 
> which is partitioned on a single field named {{file_date}}:
> {code:sh}
> $ aws s3 ls s3://some/dataset | head -n 3
>PRE file_date=2017-05-01/
>PRE file_date=2017-05-02/
>PRE file_date=2017-05-03/
> $ aws s3 ls s3://some/dataset | wc -l
> 1158
> {code}
> The table itself has over 138K files, though:
> {code:sh}
> $ aws s3 ls --recursive --human --summarize s3://some/dataset
> ...
> Total Objects: 138708
>Total Size: 3.7 TiB
> {code}
> Now let's try to query just the {{file_date}} field and see what Spark does.
> {code:python}
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).explain()
> == Physical Plan ==
> TakeOrderedAndProject(limit=1, orderBy=[file_date#11 DESC NULLS LAST], 
> output=[file_date#11])
> +- *(1) ColumnarToRow
>+- FileScan parquet [file_date#11] Batched: true, DataFilters: [], Format: 
> Parquet, Location: InMemoryFileIndex[s3a://some/dataset], PartitionFilters: 
> [], PushedFilters: [], ReadSchema: struct<>
> >>> data.select('file_date').orderBy('file_date', 
> >>> ascending=False).limit(1).show()
> [Stage 2:>   (179 + 12) / 
> 41011]
> {code}
> Notice that Spark has spun up 41,011 tasks. Maybe more will be needed as the 
> job progresses? I'm not sure.
> What I do know is that this operation takes a long time (~20 min) running 
> from my laptop, whereas to list the top-level {{file_date}} partitions via 
> the AWS CLI take a second or two.
> Spark appears to be going through all the files in the table, when it just 
> needs to list the partitions captured in the S3 "directory" structure. The 
> query is only touching {{file_date}}, after all.
> The current workaround for this performance problem / optimizer wastefulness, 
> is to [query the catalog 
> directly|https://stackoverflow.com/a/65724151/877069]. It works, but is a lot 
> of extra work compared to the elegant query against {{file_date}} that users 
> actually intend.
> Spark should somehow know when it is only querying partition fields and skip 
> iterating through all the individual files in a table.
> Tested on Spark 3.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34334) ExecutorPodsAllocator fails to identify some excess requests during downscaling

2021-02-02 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277436#comment-17277436
 ] 

Attila Zsolt Piros commented on SPARK-34334:


I am working on the this.

> ExecutorPodsAllocator fails to identify some excess requests during 
> downscaling
> ---
>
> Key: SPARK-34334
> URL: https://issues.apache.org/jira/browse/SPARK-34334
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> During downscaling there are two kinds of POD requests which can lead to POD 
> deletion as it is identified as excess requests by the dynamic allocation:
>  * timed out newly created POD requests (which even haven't reached the Pod 
> Pending state, yet) 
>  * timed out pending pod requests
> The current implementation fails to delete a timed out pending pod requests 
> when there are not enough timed out newly created POD requests to delete but 
> there are some non-timed out ones.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34334) ExecutorPodsAllocator fails to identify some excess requests during downscaling

2021-02-02 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34334:
--

 Summary: ExecutorPodsAllocator fails to identify some excess 
requests during downscaling
 Key: SPARK-34334
 URL: https://issues.apache.org/jira/browse/SPARK-34334
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.0, 3.2.0
Reporter: Attila Zsolt Piros


During downscaling there are two kinds of POD requests which can lead to POD 
deletion as it is identified as excess requests by the dynamic allocation:
 * timed out newly created POD requests (which even haven't reached the Pod 
Pending state, yet) 
 * timed out pending pod requests

The current implementation fails to delete a timed out pending pod requests 
when there are not enough timed out newly created POD requests to delete but 
there are some non-timed out ones.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2021-02-02 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277675#comment-17277675
 ] 

Attila Zsolt Piros commented on SPARK-33763:


I am ready with the executor removals (1 and 4 from the above list) but I would 
like to discuss stage resubmitted (I think you probably meant stage and not 
jobs) and fetch failures.

I thought about these two missing metrics (and checked the code too) and I have 
a suggestion. Let's combine those two to one single metric: stage resubmitted 
because of fetch failure. 

Justification: the number of fetch failures will be very much dependant on the 
cluster size (and even worse on the 
data).  When one executor is down all the others fetching from that failed one 
will report a fetch failure. So this information is not that helpful as it 
depends on how many reducers are referring to that single mapper.

[~holden] what do you think? 

> Add metrics for better tracking of dynamic allocation
> -
>
> Key: SPARK-33763
> URL: https://issues.apache.org/jira/browse/SPARK-33763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29391) Default year-month units

2021-02-03 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-29391.

Resolution: Won't Fix

As PostgreSQL feature parity is not a requirement anymore and this syntax is 
not common in other databases.

,

> Default year-month units
> 
>
> Key: SPARK-29391
> URL: https://issues.apache.org/jira/browse/SPARK-29391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> PostgreSQL can assume default year-month units by defaults:
> {code}
> maxim=# SELECT interval '1-2'; 
>interval
> ---
>  1 year 2 mons
> {code}
> but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29391) Default year-month units

2021-02-03 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278006#comment-17278006
 ] 

Attila Zsolt Piros edited comment on SPARK-29391 at 2/3/21, 1:35 PM:
-

Closing this as PostgreSQL feature parity is not a requirement anymore and this 
syntax is not common in other databases.

,


was (Author: attilapiros):
As PostgreSQL feature parity is not a requirement anymore and this syntax is 
not common in other databases.

,

> Default year-month units
> 
>
> Key: SPARK-29391
> URL: https://issues.apache.org/jira/browse/SPARK-29391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> PostgreSQL can assume default year-month units by defaults:
> {code}
> maxim=# SELECT interval '1-2'; 
>interval
> ---
>  1 year 2 mons
> {code}
> but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29391) Default year-month units

2021-02-03 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278010#comment-17278010
 ] 

Attila Zsolt Piros commented on SPARK-29391:


This is fine but learning from this case we should check the similar jiras and 
close the ones which are not needed anymore.
Otherwise this might cause some unnecessary frustration.

[~maxgekk] do you agree? Could you please help me with this goal?
If you make a list I of similar jiras (or create a jira filter) we can divide 
the work and I can help you in the validation of issues.

> Default year-month units
> 
>
> Key: SPARK-29391
> URL: https://issues.apache.org/jira/browse/SPARK-29391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> PostgreSQL can assume default year-month units by defaults:
> {code}
> maxim=# SELECT interval '1-2'; 
>interval
> ---
>  1 year 2 mons
> {code}
> but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29391) Default year-month units

2021-02-03 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278015#comment-17278015
 ] 

Attila Zsolt Piros commented on SPARK-29391:


I found this: https://issues.apache.org/jira/browse/SPARK-27764 :)

> Default year-month units
> 
>
> Key: SPARK-29391
> URL: https://issues.apache.org/jira/browse/SPARK-29391
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> PostgreSQL can assume default year-month units by defaults:
> {code}
> maxim=# SELECT interval '1-2'; 
>interval
> ---
>  1 year 2 mons
> {code}
> but the same produces NULL in Spark:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27764) Feature Parity between PostgreSQL and Spark

2021-02-03 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278065#comment-17278065
 ] 

Attila Zsolt Piros commented on SPARK-27764:


Sorry to chime in but we have (after SPARK-30374 and SPARK-30375) 55 open 
sub-tasks, are this 55 sub-task closeable? 
I am asking as I happen to start work on one of those (it was SPARK-29391) and 
after some work it turned out it is not needed and this is a bit frustrating. 

I am offering my help too to avoid further unnecessary frustration for the 
others who would start to work on a issue but we have to discuss our strategy 
first.
If it is closable then let's close it. If we need to go through one by one then 
let's divide it among us.
Or should I just go ahead give one or two weeks grace period before closing it?
How can I help here?

> Feature Parity between PostgreSQL and Spark
> ---
>
> Key: SPARK-27764
> URL: https://issues.apache.org/jira/browse/SPARK-27764
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> PostgreSQL is one of the most advanced open source databases. This umbrella 
> Jira is trying to track the missing features and bugs. 
> UPDATED: This umbrella tickets basically intend to include bug reports and 
> general issues for the feature parity. For implementation-dependent 
> behaviours and ANS/SQL standard topics, you need to check the two umbrella 
> below;
>  - SPARK-30374 Feature Parity between PostgreSQL and Spark (ANSI/SQL)
>  - SPARK-30375 Feature Parity between PostgreSQL and Spark 
> (implementation-dependent behaviours)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34361) Dynamic allocation on K8s kills executors with running tasks

2021-02-04 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34361:
--

 Summary: Dynamic allocation on K8s kills executors with running 
tasks
 Key: SPARK-34361
 URL: https://issues.apache.org/jira/browse/SPARK-34361
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1, 3.0.0, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2
Reporter: Attila Zsolt Piros


There is race between executor POD allocator and cluster scheduler backend. 
During downscaling (in dynamic allocation) we experienced a lot of killed new 
executors with running task on them.

The pattern in the log is the following:

{noformat}
21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new 
total is 138)
...
21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 
2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests 
(408,312,307).
...
21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138: 
The executor with id 312 was deleted by a user or the framework.
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34361) Dynamic allocation on K8s kills executors with running tasks

2021-02-04 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278932#comment-17278932
 ] 

Attila Zsolt Piros commented on SPARK-34361:


I am working on this.

> Dynamic allocation on K8s kills executors with running tasks
> 
>
> Key: SPARK-34361
> URL: https://issues.apache.org/jira/browse/SPARK-34361
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> There is race between executor POD allocator and cluster scheduler backend. 
> During downscaling (in dynamic allocation) we experienced a lot of killed new 
> executors with running task on them.
> The pattern in the log is the following:
> {noformat}
> 21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new 
> total is 138)
> ...
> 21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 
> 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
> 21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests 
> (408,312,307).
> ...
> 21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 
> 100.100.18.138: The executor with id 312 was deleted by a user or the 
> framework.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34370:
--

 Summary: Supporting Avro schema evolution for partitioned Hive 
tables using "avro.schema.url"
 Key: SPARK-34370
 URL: https://issues.apache.org/jira/browse/SPARK-34370
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.0, 2.3.1, 3.1.0, 3.2.0
Reporter: Attila Zsolt Piros






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279156#comment-17279156
 ] 

Attila Zsolt Piros commented on SPARK-34370:


I am working on this.

> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 
https://github.com/apache/spark/pull/31133#discussion_r570561321

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via 
> https://github.com/apache/spark/pull/31133#discussion_r570561321



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.



> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via "avro.schema.url".
With time the schema is evolved and the new schema is set for the table 
"avro.schema.url" when data is read from the old p

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 
https://github.com/apache/spark/pull/31133#discussion_r570561321


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via "avro.schema.url".
> With time the schema is evolved and the new schema is set for the table 
> "avro.schema.url" when data is read from the old p



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via "avro.schema.url".
With time the schema is evolved and the new schema is set for the table 
"avro.schema.url" when data is read from the old partition this new evolved 
schema must be used.

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via "avro.schema.url".
With time the schema is evolved and the new schema is set for the table 
"avro.schema.url" when data is read from the old p


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via "avro.schema.url".
> With time the schema is evolved and the new schema is set for the table 
> "avro.schema.url" when data is read from the old partition this new evolved 
> schema must be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281134#comment-17281134
 ] 

Attila Zsolt Piros commented on SPARK-34389:


[~ranju] I understand your concern but Spark could have got the missing 
resources anytime (depending on the load of the k8s cluster): it can be just a 
few minutes / hours or even never but waiting for the resources and use the 
existing one(s) is also an answer to this problem.

And from the log you can see that Spark uses that single successfully allocated 
executor as 89 jobs are finished.

>From the Spark side the request is sent and we assume the resource manager 
>eventually will satisfy it. This is very similar why we does not have retry 
>logic for unsatisfied requests:
 
[https://github.com/apache/spark/blob/c03258ebef8fdd1c3d83c1fe5b77732a2069aa53/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L69-L70]

So this problem is external to the Spark. The minExecutors in this sense means 
the minimum number which will be requested (and during downscaling those 
minimum executors will be protected from killing by the driver).

If you accept my explanation please close this issue. 

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-09 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281645#comment-17281645
 ] 

Attila Zsolt Piros commented on SPARK-34389:


> is it possible to get the available resources of the cluster and match it 
> with the required executor resources and if it satisfies then submits the job

[~ranju] This is out of the scope of Spark. A quick search gave me this: 
https://github.com/kubernetes/kubernetes/issues/17512. You can try out the 
suggestions and if you find one good enough for your case you can use it in a 
script which could start Spark when the resources are available. 

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-09 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281815#comment-17281815
 ] 

Attila Zsolt Piros commented on SPARK-34372:


A Direct output committer with speculation could lead to this kind of problems 
even to data loss.

Please check this out
 https://issues.apache.org/jira/browse/SPARK-10063

Although DirectParquetOutputCommitter is removed you are using 
DirectFileOutputCommitter.

There must be a warning in the logs:
[https://github.com/apache/spark/blob/18b30107adb37d3c7a767a20cc02813f0fdb86da/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L1050-L1057]

> Speculation results in broken CSV files in Amazon S3
> 
>
> Key: SPARK-34372
> URL: https://issues.apache.org/jira/browse/SPARK-34372
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.7
> Environment: Amazon EMR with AMI version 5.32.0
>Reporter: Daehee Han
>Priority: Minor
>  Labels: csv, s3, spark, speculation
>
> Hi, we've been experiencing some rows get corrupted while partitioned CSV 
> files were written to Amazon S3. Some records were found broken without any 
> error on Spark. Digging into the root cause, we found out Spark speculation 
> tried to upload a partition being uploaded slowly and ended up uploading only 
> a part of the partition, letting broken data uploaded to S3.
> Here're stacktraces we've found. There are two executor involved - A: the 
> first executor which tried to upload the file, but it took much longer than 
> other executor (but still succeeded), which made spark speculation cut in and 
> kick off another executor B. Executor B started to upload the file too, but 
> was interrupted during uploading (killed: another attempt succeeded), and 
> ended up uploading only a part of the whole file. You can see in the log, the 
> file executor A uploaded (8461990 bytes originally) was overwritten by 
> executor B (uploaded only 3145728 bytes).
>  
> Executor A:
> {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
> 13201) 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 10 local blocks and 460 remote blocks 
>  21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 18 ms 
>  21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
>  21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
> closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
>  21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
> upload of 1 parts 8461990 bytes 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
> \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. 
> Elapsed seconds: 10. 
>  21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of 
> task because needsTaskCommit=false: 
> attempt_20210128172219_0045_m_000426_13201 
>  21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
> 13201). 8782 bytes result sent to driver
> {quote}
> Executor B:
> {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
> 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
> 13245) 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
> blocks including 11 local blocks and 459 remote blocks 
>  21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote 
> fetches in 2 ms 
>  21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2 
>  21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> true 
>  21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
>  21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer 
> class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
>  21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
> stage 45.0 (TID 13245), reason: another attempt succeeded 
>  21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
> s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
>  21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart

[jira] [Comment Edited] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2021-02-09 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282201#comment-17282201
 ] 

Attila Zsolt Piros edited comment on SPARK-33763 at 2/10/21, 3:30 AM:
--

I am not positive about the "stage re-submitted because of fetch failure" 
solution too as "stages.failedStages.count" is already available and most 
failed stages are retried.

When the tests on my PR (which contains the counter metrics for the different 
loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP 
label).


was (Author: attilapiros):
I am not positive about the "stage re-submitted because of fetch failure" 
solution too as "stages.failedStages.count" is already available and most 
failed stages are retried.

When the tests on my PR (which contains the counter metrics for the different 
loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP 
label).{{}}

> Add metrics for better tracking of dynamic allocation
> -
>
> Key: SPARK-33763
> URL: https://issues.apache.org/jira/browse/SPARK-33763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33763) Add metrics for better tracking of dynamic allocation

2021-02-09 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282201#comment-17282201
 ] 

Attila Zsolt Piros commented on SPARK-33763:


I am not positive about the "stage re-submitted because of fetch failure" 
solution too as "stages.failedStages.count" is already available and most 
failed stages are retried.

When the tests on my PR (which contains the counter metrics for the different 
loss reasons) are finished I will reopen it as non-WIP PR (or remove the WIP 
label).{{}}

> Add metrics for better tracking of dynamic allocation
> -
>
> Key: SPARK-33763
> URL: https://issues.apache.org/jira/browse/SPARK-33763
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>
> We should add metrics to track the following:
> 1- Graceful decommissions & DA scheduled deletes
> 2- Jobs resubmitted
> 3- Fetch failures
> 4- Unexpected (e.g. non-Spark triggered) executor removals.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-10 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282614#comment-17282614
 ] 

Attila Zsolt Piros commented on SPARK-34389:


[~ranju] the description of this config is a little bit misleading. Because the 
word "pending" has more meanings (a little bit overloaded). Here it means it is 
newly created and not processed by k8s which is different from the POD requests 
accepted by k8s and being in PENDING state.

So for this newly created pendings requests we have the timeout: 
[https://github.com/apache/spark/blob/1fbd5764105e2c09caf4ab57a7095dd794307b02/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L126-L134]

And not for PENDING state pod requests:
[https://github.com/apache/spark/blob/1fbd5764105e2c09caf4ab57a7095dd794307b02/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L193-L196]

 

I  can correct the description in one of my existing open PR (i.e 
[https://github.com/apache/spark/pull/31513).|https://github.com/apache/spark/pull/31513)]

 

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-10 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282871#comment-17282871
 ] 

Attila Zsolt Piros commented on SPARK-34389:


@Ranju from the logs you attached it is clear here all the POD requests are 
processed and they reached the PENDING state, check this part from [^Steps to 
reproduce.docx]:

 
{noformat}
exportdata-66af7377812a3b0c-exec-1      1/1     Running       0              
20s                                             

exportdata-66af7377812a3b0c-exec-2      0/1     Pending       0              
19s                                                           
                                              

{noformat}
>From such on overloaded k8s cluster, where the available memory is 12GB and 
>executors with 10GB is requested (and 1 already allocated so the available 
>memory is now 2GB) there is no chance to allocate more. So things just works 
>as expected.

You have three options: increasing the cluster size, decreasing the load on the 
cluster, requesting executors with less memory.

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-11 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282994#comment-17282994
 ] 

Attila Zsolt Piros commented on SPARK-34389:


[~ranju] as I see your question is more about the timeout length of the POD 
allocation.
It was added by this PR: [https://github.com/apache/spark/pull/30155 
|https://github.com/apache/spark/pull/30155]and the whole PR is about this 
timeout. 

You can see from the description the default value is chosen by using real 
world clusters under load and you can increase it if needed.

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34389) Spark job on Kubernetes scheduled For Zero or less than minimum number of executors and Wait indefinitely under resource starvation

2021-02-12 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282614#comment-17282614
 ] 

Attila Zsolt Piros edited comment on SPARK-34389 at 2/12/21, 8:36 AM:
--

[~ranju] the description of this config is a little bit misleading. Because the 
word "pending" has more meanings (a little bit overloaded). Here it means it is 
newly created and not processed by k8s which is different from the POD requests 
accepted by k8s and being in PENDING state.

So for this newly created pendings requests we have the timeout: 
 
[https://github.com/apache/spark/blob/1fbd5764105e2c09caf4ab57a7095dd794307b02/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L126-L134]

And not for PENDING state pod requests:
 
[https://github.com/apache/spark/blob/1fbd5764105e2c09caf4ab57a7095dd794307b02/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L193-L196]

 

I  can correct the description in one of my existing open PR (i.e 
[https://github.com/apache/spark/pull/31513]).


was (Author: attilapiros):
[~ranju] the description of this config is a little bit misleading. Because the 
word "pending" has more meanings (a little bit overloaded). Here it means it is 
newly created and not processed by k8s which is different from the POD requests 
accepted by k8s and being in PENDING state.

So for this newly created pendings requests we have the timeout: 
[https://github.com/apache/spark/blob/1fbd5764105e2c09caf4ab57a7095dd794307b02/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L126-L134]

And not for PENDING state pod requests:
[https://github.com/apache/spark/blob/1fbd5764105e2c09caf4ab57a7095dd794307b02/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala#L193-L196]

 

I  can correct the description in one of my existing open PR (i.e 
[https://github.com/apache/spark/pull/31513).|https://github.com/apache/spark/pull/31513)]

 

> Spark job on Kubernetes scheduled For Zero or less than minimum number of 
> executors and Wait indefinitely under resource starvation
> ---
>
> Key: SPARK-34389
> URL: https://issues.apache.org/jira/browse/SPARK-34389
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1
>Reporter: Ranju
>Priority: Major
> Attachments: DriverLogs_ExecutorLaunchedLessThanMinExecutor.txt, 
> Steps to reproduce.docx
>
>
> In case Cluster does not have sufficient resource (CPU/ Memory ) for minimum 
> number of executors , the executors goes in Pending State for indefinite time 
> until the resource gets free.
> Suppose, Cluster Configurations are:
> total Memory=204Gi
> used Memory=200Gi
> free memory= 4Gi
> SPARK.EXECUTOR.MEMORY=10G
> SPARK.DYNAMICALLOCTION.MINEXECUTORS=4
> SPARK.DYNAMICALLOCATION.MAXEXECUTORS=8
> Rather, the job should be cancelled if requested number of minimum executors 
> are not available at that point of time because of resource unavailability.
> Currently it is doing partial scheduling or no scheduling and waiting 
> indefinitely. And the job got stuck.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34426) Add driver and executors POD logs to integration tests log when the test fails

2021-02-12 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34426:
---
Summary: Add driver and executors POD logs to integration tests log when 
the test fails  (was: Add driver and executors POD logs to integration-tests 
log when the test fails)

> Add driver and executors POD logs to integration tests log when the test fails
> --
>
> Key: SPARK-34426
> URL: https://issues.apache.org/jira/browse/SPARK-34426
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Now both the driver and executors logs are lost.
> In https://spark.apache.org/developer-tools.html there is a hint:
> "Getting logs from the pods and containers directly is an exercise left to 
> the reader."
> But when the test is executed by Jenkins and there is failure we really need 
> the POD logs to analyze problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34426) Add driver and executors POD logs to integration-tests log when the test fails

2021-02-12 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283974#comment-17283974
 ] 

Attila Zsolt Piros commented on SPARK-34426:


I am working on this.

> Add driver and executors POD logs to integration-tests log when the test fails
> --
>
> Key: SPARK-34426
> URL: https://issues.apache.org/jira/browse/SPARK-34426
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Now both the driver and executors logs are lost.
> In https://spark.apache.org/developer-tools.html there is a hint:
> "Getting logs from the pods and containers directly is an exercise left to 
> the reader."
> But when the test is executed by Jenkins and there is failure we really need 
> the POD logs to analyze problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34426) Add driver and executors POD logs to integration-tests log when the test fails

2021-02-12 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34426:
--

 Summary: Add driver and executors POD logs to integration-tests 
log when the test fails
 Key: SPARK-34426
 URL: https://issues.apache.org/jira/browse/SPARK-34426
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


Now both the driver and executors logs are lost.

In https://spark.apache.org/developer-tools.html there is a hint:
"Getting logs from the pods and containers directly is an exercise left to the 
reader."

But when the test is executed by Jenkins and there is failure we really need 
the POD logs to analyze problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34433) Lock jekyll version by Gemfile and Bundler

2021-02-13 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17284287#comment-17284287
 ] 

Attila Zsolt Piros commented on SPARK-34433:


I am working on this.

> Lock jekyll version by Gemfile and Bundler
> --
>
> Key: SPARK-34433
> URL: https://issues.apache.org/jira/browse/SPARK-34433
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Deploy, Documentation
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> The Jekyll version can be pinned to specific 4.2.0 with Gemfile.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34433) Lock jekyll version by Gemfile and Bundler

2021-02-13 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34433:
--

 Summary: Lock jekyll version by Gemfile and Bundler
 Key: SPARK-34433
 URL: https://issues.apache.org/jira/browse/SPARK-34433
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy, Documentation
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


The Jekyll version can be pinned to specific 4.2.0 with Gemfile.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34509) Make dynamic allocation upscaling more progressive on K8S

2021-02-23 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34509:
--

 Summary: Make dynamic allocation upscaling more progressive on K8S
 Key: SPARK-34509
 URL: https://issues.apache.org/jira/browse/SPARK-34509
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.0.2, 2.4.7, 2.3.4, 3.2.0
Reporter: Attila Zsolt Piros


Currently even a single late pod request stops upscaling. As we have allocation 
batch size it would be better to go up to that limit as soon as possible (if 
serving pod requests are slow we have to make them as early as possible). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34509) Make dynamic allocation upscaling more progressive on K8S

2021-02-23 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289158#comment-17289158
 ] 

Attila Zsolt Piros commented on SPARK-34509:


I am working on this.

> Make dynamic allocation upscaling more progressive on K8S
> -
>
> Key: SPARK-34509
> URL: https://issues.apache.org/jira/browse/SPARK-34509
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.4, 2.4.7, 3.0.2, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> Currently even a single late pod request stops upscaling. As we have 
> allocation batch size it would be better to go up to that limit as soon as 
> possible (if serving pod requests are slow we have to make them as early as 
> possible). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34732) The logForFailedTest throws an exception when driver is not started

2021-03-13 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34732:
--

 Summary: The logForFailedTest throws an exception when driver is 
not started
 Key: SPARK-34732
 URL: https://issues.apache.org/jira/browse/SPARK-34732
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


In SPARK-34426 the logForFailedTest method is introduced to add the driver and 
executors log to the integration-tests.log but when the driver failed to start 
an exception will be thrown as the list of pods is empty but we still try to 
access the first item from the list:


{noformat}
- PVs with local storage *** FAILED ***
  java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
  at java.util.ArrayList.rangeCheck(ArrayList.java:659)
  at java.util.ArrayList.get(ArrayList.java:435)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.logForFailedTest(KubernetesSuite.scala:83)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:181)
  at 
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
  at 
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
  at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
 {noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34736) Kubernetes and Minikube version upgrade for integration tests

2021-03-13 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17301046#comment-17301046
 ] 

Attila Zsolt Piros commented on SPARK-34736:


Working on this

> Kubernetes and Minikube version upgrade for integration tests
> -
>
> Key: SPARK-34736
> URL: https://issues.apache.org/jira/browse/SPARK-34736
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> As [discussed in the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]:
>  upgrading Minikube version from v0.34.1 to v1.7.3 and kubernetes version 
> from v1.15.12 to v1.17.3.
> Moreover Minikube version will be checked.
> By making this upgrade we can simplify how the kubernetes client is 
> configured for Minikube.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34736) Kubernetes and Minikube version upgrade for integration tests

2021-03-13 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34736:
--

 Summary: Kubernetes and Minikube version upgrade for integration 
tests
 Key: SPARK-34736
 URL: https://issues.apache.org/jira/browse/SPARK-34736
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes, Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


As [discussed in the mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]:
 upgrading Minikube version from v0.34.1 to v1.7.3 and kubernetes version from 
v1.15.12 to v1.17.3.

Moreover Minikube version will be checked.

By making this upgrade we can simplify how the kubernetes client is configured 
for Minikube.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-14 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34738:
--

 Summary: Upgrade Minikube and kubernetes cluster version on Jenkins
 Key: SPARK-34738
 URL: https://issues.apache.org/jira/browse/SPARK-34738
 Project: Spark
  Issue Type: Task
  Components: jenkins, Kubernetes
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


[~shaneknapp] as we discussed [on the mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
 Minikube can be upgraded to the latest (v1.18.1) and kubernetes version should 
be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).

[Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
method to configure the kubernetes client. Thanks in advance to use it for 
testing on the Jenkins after the Minikube version is updated.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-17 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303287#comment-17303287
 ] 

Attila Zsolt Piros commented on SPARK-34738:


Thanks Shane! This sounds great!

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34789:
--

 Summary: Introduce Jetty based construct for integration tests 
where HTTP(S) is used
 Key: SPARK-34789
 URL: https://issues.apache.org/jira/browse/SPARK-34789
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


This came up during 
https://github.com/apache/spark/pull/31877#discussion_r596831803.

Short summary: we have some tests where HTTP(S) is used to access files. The 
current solution uses github urls like 
"https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
This connects two Spark version in an unhealthy way like connecting the 
"master" branch which is moving part with the committed test code which is a 
non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something 
(filename, content, path) from a the latter release and what is worse when the 
moving version is changed the earlier test will break. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304245#comment-17304245
 ] 

Attila Zsolt Piros commented on SPARK-34789:


I am working on this


> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34789:
---
Description: 
This came up during 
https://github.com/apache/spark/pull/31877#discussion_r596831803.

Short summary: we have some tests where HTTP(S) is used to access files. The 
current solution uses github urls like 
"https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
This connects two Spark version in an unhealthy way like connecting the 
"master" branch which is moving part with the committed test code which is a 
non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something 
(filename, content, path) from a the latter release and what is worse when the 
moving version is changed the earlier test will break. 

The idea is to introduce a method like:

{noformat}
withHttpServer(files) {
}
{noformat}

Which uses a Jetty ResourceHandler to serve the listed files (or directories / 
or just the root where it is started from) and stops the server in the finally.

 

  was:
This came up during 
https://github.com/apache/spark/pull/31877#discussion_r596831803.

Short summary: we have some tests where HTTP(S) is used to access files. The 
current solution uses github urls like 
"https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
This connects two Spark version in an unhealthy way like connecting the 
"master" branch which is moving part with the committed test code which is a 
non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something 
(filename, content, path) from a the latter release and what is worse when the 
moving version is changed the earlier test will break. 

 


> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
> The idea is to introduce a method like:
> {noformat}
> withHttpServer(files) {
> }
> {noformat}
> Which uses a Jetty ResourceHandler to serve the listed files (or directories 
> / or just the root where it is started from) and stops the server in the 
> finally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-23 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17306895#comment-17306895
 ] 

Attila Zsolt Piros commented on SPARK-34684:


> spark pi example job keeps failing where executor do not know how to talk to 
> hdfs

In your example the executor does not need to be able to access HDFS but the 
driver should.

Have you tried to specify the hdfs url with hostname? 
hdfs:///tmp/spark-examples_2.12-3.0.125067.jar => 
hdfs:///tmp/spark-examples_2.12-3.0.125067.jar?

Have you tried to create a POD from a simple linux image with hadoop client 
tools and access HDFS from command line?



> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assig

[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-23 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307392#comment-17307392
 ] 

Attila Zsolt Piros commented on SPARK-34684:



> Have you tried to create a POD from a simple linux image with hadoop client 
> tools and access HDFS from command line?



> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 21/03/10 07:00:01 INFO Executor: Fetching 
> spark://org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078/jars/spark-examples_2.12-3.0.125067.jar
>  w

[jira] [Commented] (SPARK-34684) Hadoop config could not be successfully serilized from driver pods to executor pods

2021-03-24 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307710#comment-17307710
 ] 

Attila Zsolt Piros commented on SPARK-34684:


I did not say there is no error but by following my advice you can make sure 
your basic settings is right. This is how I can help you right now.

> Hadoop config could not be successfully serilized from driver pods to 
> executor pods
> ---
>
> Key: SPARK-34684
> URL: https://issues.apache.org/jira/browse/SPARK-34684
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2
>Reporter: Yue Peng
>Priority: Major
>
> I have set HADOOP_CONF_DIR correctly. And I have verified that hadoop configs 
> have been stored into a configmap and mounted to driver. However, spark pi 
> example job keeps failing where executor do not know how to talk to hdfs. I 
> highly suspect that there is a bug causing it, as I manually create a 
> configmap storing hadoop configs and mounted it to executor in template file, 
> which could fix the error. 
>  
> Spark submit command:
> /opt/spark-3.0/bin/spark-submit --class org.apache.spark.examples.SparkPi 
> --deploy-mode cluster --master k8s://https://10.***.18.96:6443 
> --num-executors 1 --conf spark.kubernetes.namespace=test --conf 
> spark.kubernetes.container.image= --conf 
> spark.kubernetes.driver.podTemplateFile=/opt/spark-3.0/conf/spark-driver.template
>  --conf 
> spark.kubernetes.executor.podTemplateFile=/opt/spark-3.0/conf/spark-executor.template
>   --conf spark.kubernetes.file.upload.path=/opt/spark-3.0/examples/jars 
> hdfs:///tmp/spark-examples_2.12-3.0.125067.jar 1000
>  
>  
> Error log:
>  
> 21/03/10 06:59:58 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 608 ms (392 ms spent in bootstraps)
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls to: root
> 21/03/10 06:59:58 INFO SecurityManager: Changing view acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: Changing modify acls groups to:
> 21/03/10 06:59:58 INFO SecurityManager: SecurityManager: authentication 
> enabled; ui acls disabled; users with view permissions: Set(root); groups 
> with view permissions: Set(); users with modify permissions: Set(root); 
> groups with modify permissions: Set()
> 21/03/10 06:59:59 INFO TransportClientFactory: Successfully created 
> connection to 
> org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc/100.64.0.191:7078
>  after 130 ms (104 ms spent in bootstraps)
> 21/03/10 06:59:59 INFO DiskBlockManager: Created local directory at 
> /var/data/spark-0f541e3d-994f-4c7a-843f-f7dac57dfc13/blockmgr-981cfb62-5b27-4d1a-8fbd-eddb466faf1d
> 21/03/10 06:59:59 INFO MemoryStore: MemoryStore started with capacity 2047.2 
> MiB
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Connecting to driver: 
> spark://coarsegrainedschedu...@org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO ResourceUtils: Resources for spark.executor:
> 21/03/10 06:59:59 INFO ResourceUtils: 
> ==
> 21/03/10 06:59:59 INFO CoarseGrainedExecutorBackend: Successfully registered 
> with driver
> 21/03/10 06:59:59 INFO Executor: Starting executor ID 1 on host 100.64.0.192
> 21/03/10 07:00:00 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37956.
> 21/03/10 07:00:00 INFO NettyBlockTransferService: Server created on 
> 100.64.0.192:37956
> 21/03/10 07:00:00 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:00 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(1, 100.64.0.192, 37956, None)
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 0
> 21/03/10 07:00:01 INFO CoarseGrainedExecutorBackend: Got assigned task 1
> 21/03/10 07:00:01 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
> 21/03/10 07:00:01 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
> 21/03/10 07:00:01 INFO Executor: Fetching 
> spark://org-apache-spark-examples-sparkpi-0e58b6781aeef2d5-driver-svc.test.svc:7078/jars/spark-examples_2.1

[jira] [Created] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output

2021-03-25 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34869:
--

 Summary: Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with 
pod describe output
 Key: SPARK-34869
 URL: https://issues.apache.org/jira/browse/SPARK-34869
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output

2021-03-25 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Description: 

As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend 


> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output

2021-03-25 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Description: 
As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe 
output.


  was:

As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend 



> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output

2021-03-25 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros reassigned SPARK-34869:
--

Assignee: Attila Zsolt Piros

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output

2021-03-25 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17308679#comment-17308679
 ] 

Attila Zsolt Piros commented on SPARK-34869:


I am working on this.

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output
> 
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output

2021-03-25 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Description: 
As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods 
output.


  was:
As came up in https://github.com/apache/spark/pull/31923#issuecomment-806706204 
we can extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe 
output.



> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34869) Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output

2021-03-25 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34869:
---
Summary: Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe 
pods output  (was: Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with pod 
describe output)

> Extend k8s "EXTRA LOGS FOR THE FAILED TEST" section with describe pods output
> -
>
> Key: SPARK-34869
> URL: https://issues.apache.org/jira/browse/SPARK-34869
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> As came up in 
> https://github.com/apache/spark/pull/31923#issuecomment-806706204 we can 
> extend  "EXTRA LOGS FOR THE FAILED TEST" section with pod describe output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-03-31 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312851#comment-17312851
 ] 

Attila Zsolt Piros commented on SPARK-34738:


Shane, as Docker is available on every platform I could check the it locally 
(but I can only promise next week to do that) if I manage to reproduce this 
locally I can take look into the details. 

After this I think we should extend our documentation and suggest only those 
drivers where the tests was running (or at least mentioning docker is used for 
testing). 

Thanks for your effort! 

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-04-01 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17313144#comment-17313144
 ] 

Attila Zsolt Piros commented on SPARK-34738:


So I wanted to check whether I can help next week but I run into a different 
problem:

{noformat}
KubernetesSuite:
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
  io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create]  
for kind: [Namespace]  with name: [null]  in namespace: [default]  failed.
  at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
  at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
  at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:349)
  at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.lambda$createNew$0(BaseOperation.java:358)
  at 
io.fabric8.kubernetes.api.model.DoneableNamespace.done(DoneableNamespace.java:31)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesTestComponents.createNamespace(KubernetesTestComponents.scala:49)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.$anonfun$new$1(KubernetesSuite.scala:189)
  at org.scalatest.BeforeAndAfter.runTest(BeforeAndAfter.scala:210)
  at org.scalatest.BeforeAndAfter.runTest$(BeforeAndAfter.scala:203)
  at 
org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.runTest(KubernetesSuite.scala:44)
  ...
  Cause: java.net.SocketTimeoutException: connect timed out
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
  at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  at java.net.Socket.connect(Socket.java:607)
  at okhttp3.internal.platform.Platform.connectSocket(Platform.java:129)
  at 
okhttp3.internal.connection.RealConnection.connectSocket(RealConnection.java:247)
  at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167)
  at 
okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:258)
  ...
{noformat}


This error definitely comes because of the docker driver (at least at my side) 
as with hyperkit there was no such problem.  

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-09 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-36965:
--

 Summary: Extend python test runner by logging out the temp output 
files
 Key: SPARK-36965
 URL: https://issues.apache.org/jira/browse/SPARK-36965
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.3.0
Reporter: Attila Zsolt Piros
Assignee: Attila Zsolt Piros


I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code I got for the 
parallel execution each test has its own temporary output file which are only 
added to the unit-tests.log when a test is finished with a failure (after 
acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of the 
temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-09 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-36965:
---
Priority: Minor  (was: Major)

> Extend python test runner by logging out the temp output files
> --
>
> Key: SPARK-36965
> URL: https://issues.apache.org/jira/browse/SPARK-36965
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> I was running a python test which was extremely slow and I was surprised the 
> unit-tests.log has not been even created. Looked into the code I got for the 
> parallel execution each test has its own temporary output file which are only 
> added to the unit-tests.log when a test is finished with a failure (after 
> acquiring a lock to avoid parallel write on unit-tests.log). 
> To avoid such a confusion it would make sense to log out the path of the 
> temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-09 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-36965:
---
Description: 
I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code and as I got 
tests can be executed in parallel and each one has its own temporary output 
file which is only added to the unit-tests.log when a test is finished with a 
failure (after acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of the 
temporary output files.

  was:
I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code I got for the 
parallel execution each test has its own temporary output file which are only 
added to the unit-tests.log when a test is finished with a failure (after 
acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of the 
temporary output files.


> Extend python test runner by logging out the temp output files
> --
>
> Key: SPARK-36965
> URL: https://issues.apache.org/jira/browse/SPARK-36965
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> I was running a python test which was extremely slow and I was surprised the 
> unit-tests.log has not been even created. Looked into the code and as I got 
> tests can be executed in parallel and each one has its own temporary output 
> file which is only added to the unit-tests.log when a test is finished with a 
> failure (after acquiring a lock to avoid parallel write on unit-tests.log). 
> To avoid such a confusion it would make sense to log out the path of the 
> temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-09 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-36965:
---
Description: 
I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code and as I got the 
tests can be executed in parallel and each one has its own temporary output 
file which is only added to the unit-tests.log when a test is finished with a 
failure (after acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of the 
temporary output files.

  was:
I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code and as I got 
tests can be executed in parallel and each one has its own temporary output 
file which is only added to the unit-tests.log when a test is finished with a 
failure (after acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of the 
temporary output files.


> Extend python test runner by logging out the temp output files
> --
>
> Key: SPARK-36965
> URL: https://issues.apache.org/jira/browse/SPARK-36965
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> I was running a python test which was extremely slow and I was surprised the 
> unit-tests.log has not been even created. Looked into the code and as I got 
> the tests can be executed in parallel and each one has its own temporary 
> output file which is only added to the unit-tests.log when a test is finished 
> with a failure (after acquiring a lock to avoid parallel write on 
> unit-tests.log). 
> To avoid such a confusion it would make sense to log out the path of the 
> temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-09 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-36965:
---
Description: 
I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code and as I got the 
tests can be executed in parallel and each one has its own temporary output 
file which is only added to the unit-tests.log when a test is finished with a 
failure (after acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of those 
temporary output files.

  was:
I was running a python test which was extremely slow and I was surprised the 
unit-tests.log has not been even created. Looked into the code and as I got the 
tests can be executed in parallel and each one has its own temporary output 
file which is only added to the unit-tests.log when a test is finished with a 
failure (after acquiring a lock to avoid parallel write on unit-tests.log). 

To avoid such a confusion it would make sense to log out the path of the 
temporary output files.


> Extend python test runner by logging out the temp output files
> --
>
> Key: SPARK-36965
> URL: https://issues.apache.org/jira/browse/SPARK-36965
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
>
> I was running a python test which was extremely slow and I was surprised the 
> unit-tests.log has not been even created. Looked into the code and as I got 
> the tests can be executed in parallel and each one has its own temporary 
> output file which is only added to the unit-tests.log when a test is finished 
> with a failure (after acquiring a lock to avoid parallel write on 
> unit-tests.log). 
> To avoid such a confusion it would make sense to log out the path of those 
> temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36965) Extend python test runner by logging out the temp output files

2021-10-18 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-36965.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34233
[https://github.com/apache/spark/pull/34233]

> Extend python test runner by logging out the temp output files
> --
>
> Key: SPARK-36965
> URL: https://issues.apache.org/jira/browse/SPARK-36965
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Minor
> Fix For: 3.3.0
>
>
> I was running a python test which was extremely slow and I was surprised the 
> unit-tests.log has not been even created. Looked into the code and as I got 
> the tests can be executed in parallel and each one has its own temporary 
> output file which is only added to the unit-tests.log when a test is finished 
> with a failure (after acquiring a lock to avoid parallel write on 
> unit-tests.log). 
> To avoid such a confusion it would make sense to log out the path of those 
> temporary output files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35672) Spark fails to launch executors with very large user classpath lists on YARN

2021-11-16 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-35672.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34120
[https://github.com/apache/spark/pull/34120]

> Spark fails to launch executors with very large user classpath lists on YARN
> 
>
> Key: SPARK-35672
> URL: https://issues.apache.org/jira/browse/SPARK-35672
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 3.1.2
> Environment: Linux RHEL7
> Spark 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0
>
>
> When running Spark on YARN, the {{user-class-path}} argument to 
> {{CoarseGrainedExecutorBackend}} is used to pass a list of user JAR URIs to 
> executor processes. The argument is specified once for each JAR, and the URIs 
> are fully-qualified, so the paths can be quite long. With large user JAR 
> lists (say 1000+), this can result in system-level argument length limits 
> being exceeded, typically manifesting as the error message:
> {code}
> /bin/bash: Argument list too long
> {code}
> A [Google 
> search|https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22]
>  indicates that this is not a theoretical problem and afflicts real users, 
> including ours. This issue was originally observed on Spark 2.3, but has been 
> confirmed to exist in the master branch as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37373) Collect LocalSparkContext worker logs in case of test failure

2021-11-18 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-37373:
--

 Summary: Collect LocalSparkContext worker logs in case of test 
failure
 Key: SPARK-37373
 URL: https://issues.apache.org/jira/browse/SPARK-37373
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros
Assignee: Attila Zsolt Piros


About 50 test suites are using LocalSparkContext by specifying "local-cluster" 
as the cluster URL. In this case executor logs will be under the worker dir 
which is a temp directory and as such will be deleted at shutdown (for details 
see 
https://github.com/apache/spark/blob/0a4961df29aab6912492e87e4e719865fe20d981/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala#L70)
So when a test fails and the error was on the executor side the log will be 
lost. 

This is only for local cluster tests and not for standalone tests where logs 
will be kept in the "/work".  






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37373) Collect LocalSparkContext worker logs in case of test failure

2021-11-18 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-37373:
---
Affects Version/s: 3.3.0
   (was: 3.2.0)

> Collect LocalSparkContext worker logs in case of test failure
> -
>
> Key: SPARK-37373
> URL: https://issues.apache.org/jira/browse/SPARK-37373
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> About 50 test suites are using LocalSparkContext by specifying 
> "local-cluster" as the cluster URL. In this case executor logs will be under 
> the worker dir which is a temp directory and as such will be deleted at 
> shutdown (for details see 
> https://github.com/apache/spark/blob/0a4961df29aab6912492e87e4e719865fe20d981/core/src/main/scala/org/apache/spark/deploy/LocalSparkCluster.scala#L70)
> So when a test fails and the error was on the executor side the log will be 
> lost. 
> This is only for local cluster tests and not for standalone tests where logs 
> will be kept in the "/work".  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39152) StreamCorruptedException cause job failure for disk persisted RDD

2022-05-11 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-39152:
--

 Summary: StreamCorruptedException cause job failure for disk 
persisted RDD
 Key: SPARK-39152
 URL: https://issues.apache.org/jira/browse/SPARK-39152
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.4.0
Reporter: Attila Zsolt Piros
Assignee: Attila Zsolt Piros


In case of a disk corruption a disk persisted RDD block will lead to job 
failure as the block registration is always leads to the same file. So even 
when the task is rescheduled on a different executor the job will fail.

*Example*

First failure (the block is locally available):
{noformat}
22/04/25 07:15:28 ERROR executor.Executor: Exception in task 17024.0 in stage 
12.0 (TID 51853)
java.io.StreamCorruptedException: invalid stream header: 
  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:943)
  at java.io.ObjectInputStream.(ObjectInputStream.java:401)
  at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
  at 
org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
  at 
org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
  at 
org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
  at 
org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:617)
  at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:897)
  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
{noformat}

Then the task might be rescheduled on a different executor but as the block is 
registered to the first blockmanager the error will be the same:

{noformat}
java.io.StreamCorruptedException: invalid stream header: 
  at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:943)
  at java.io.ObjectInputStream.(ObjectInputStream.java:401)
  at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
  at 
org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
  at 
org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
  at 
org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
  at 
org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:698)
  at 
org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:696)
  at scala.Option.map(Option.scala:146)
  at 
org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:696)
  at org.apache.spark.storage.BlockManager.get(BlockManager.scala:831)
  at 
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
  at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
{noformat}

My idea is to retry the IO operations a few times and when all of them failed 
deregistering the block and let the following task to recompute it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39152) StreamCorruptedException cause job failure for disk persisted RDD

2022-05-11 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17535044#comment-17535044
 ] 

Attila Zsolt Piros commented on SPARK-39152:


I am working on this

> StreamCorruptedException cause job failure for disk persisted RDD
> -
>
> Key: SPARK-39152
> URL: https://issues.apache.org/jira/browse/SPARK-39152
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.4.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> In case of a disk corruption a disk persisted RDD block will lead to job 
> failure as the block registration is always leads to the same file. So even 
> when the task is rescheduled on a different executor the job will fail.
> *Example*
> First failure (the block is locally available):
> {noformat}
> 22/04/25 07:15:28 ERROR executor.Executor: Exception in task 17024.0 in stage 
> 12.0 (TID 51853)
> java.io.StreamCorruptedException: invalid stream header: 
>   at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:943)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:401)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
>   at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
>   at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:617)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:897)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:286)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> {noformat}
> Then the task might be rescheduled on a different executor but as the block 
> is registered to the first blockmanager the error will be the same:
> {noformat}
> java.io.StreamCorruptedException: invalid stream header: 
>   at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:943)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:401)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:63)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:63)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:122)
>   at 
> org.apache.spark.serializer.SerializerManager.dataDeserializeStream(SerializerManager.scala:209)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:698)
>   at 
> org.apache.spark.storage.BlockManager$$anonfun$getRemoteValues$1.apply(BlockManager.scala:696)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:696)
>   at org.apache.spark.storage.BlockManager.get(BlockManager.scala:831)
>   at 
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:886)
>   at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335)
> {noformat}
> My idea is to retry the IO operations a few times and when all of them failed 
> deregistering the block and let the following task to recompute it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39473) NullPointerException in SparkRackResolver

2022-06-14 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-39473:
--

 Summary: NullPointerException in SparkRackResolver 
 Key: SPARK-39473
 URL: https://issues.apache.org/jira/browse/SPARK-39473
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.1, 
3.0.0, 2.4.8
Reporter: Attila Zsolt Piros
Assignee: Attila Zsolt Piros


When the underlying DNSToSwitchMapping's resolver returns a null we get a 
NullPointerException:

{noformat}
22/05/24 08:12:12 980 ERROR DAGSchedulerEventProcessLoop: 
DAGSchedulerEventProcessLoop failed; shutting down SparkContext
java.lang.NullPointerException
at 
scala.collection.convert.Wrappers$JListWrapper.isEmpty(Wrappers.scala:87)
at 
org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:76)
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-39473) NullPointerException in SparkRackResolver

2022-06-14 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-39473.

Resolution: Invalid

This is because of Scala 2.11 on latter version *asScala* is prepared to handle 
null values.

> NullPointerException in SparkRackResolver 
> --
>
> Key: SPARK-39473
> URL: https://issues.apache.org/jira/browse/SPARK-39473
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.8, 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 
> 3.2.0, 3.2.1
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
>
> When the underlying DNSToSwitchMapping's resolver returns a null we get a 
> NullPointerException:
> {noformat}
> 22/05/24 08:12:12 980 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.NullPointerException
>   at 
> scala.collection.convert.Wrappers$JListWrapper.isEmpty(Wrappers.scala:87)
>   at 
> org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:76)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs

2021-04-01 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros reassigned SPARK-34779:
--

Assignee: Baohe Zhang

> ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat 
> occurs
> ---
>
> Key: SPARK-34779
> URL: https://issues.apache.org/jira/browse/SPARK-34779
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
>
> The current implementation of ExecutoMetricsPoller uses task count in each 
> stage to decide whether to keep a stage entry or not. In the case of the 
> executor only has 1 core, it may have these issues:
>  # Peak metrics missing (due to stage entry being removed within a heartbeat 
> interval)
>  # Unnecessary and frequent hashmap entry removal and insertion.
> Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to 
> stage (0,0)) to execute in a heartbeat interval, the workflow in current 
> ExecutorMetricsPoller implementation would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost.
> 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost
> 7. heartbeat() ->  empty or inaccurate peak metrics for stage(0,0) reported.
> We can fix the issue by keeping entries with task count = 0 in stageTCMP map 
> until a heartbeat occurs. At the heartbeat, after reporting the peak metrics 
> for each stage, we scan each stage in stageTCMP and remove entries with task 
> count = 0.
> After the fix, the workflow would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 4. task2 start -> task count of stage (0,0) increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 7. heartbeat() ->  accurate peak metrics for stage (0, 0) reported. Remove 
> entry for stage (0,0) in stageTCMP because its task count is 0.
>  
> How to verify the behavior? 
> Submit a job with a custom polling interval (e.g., 2s) and 
> spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34779) ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat occurs

2021-04-01 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros resolved SPARK-34779.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31871
[https://github.com/apache/spark/pull/31871]

> ExecutorMetricsPoller should keep stage entry in stageTCMP until a heartbeat 
> occurs
> ---
>
> Key: SPARK-34779
> URL: https://issues.apache.org/jira/browse/SPARK-34779
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Baohe Zhang
>Assignee: Baohe Zhang
>Priority: Major
> Fix For: 3.2.0
>
>
> The current implementation of ExecutoMetricsPoller uses task count in each 
> stage to decide whether to keep a stage entry or not. In the case of the 
> executor only has 1 core, it may have these issues:
>  # Peak metrics missing (due to stage entry being removed within a heartbeat 
> interval)
>  # Unnecessary and frequent hashmap entry removal and insertion.
> Assuming an executor with 1 core has 2 tasks (task1 and task2, both belong to 
> stage (0,0)) to execute in a heartbeat interval, the workflow in current 
> ExecutorMetricsPoller implementation would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost.
> 4. task2 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0, stage (0, 0) entry 
> removed, peak metrics lost
> 7. heartbeat() ->  empty or inaccurate peak metrics for stage(0,0) reported.
> We can fix the issue by keeping entries with task count = 0 in stageTCMP map 
> until a heartbeat occurs. At the heartbeat, after reporting the peak metrics 
> for each stage, we scan each stage in stageTCMP and remove entries with task 
> count = 0.
> After the fix, the workflow would be:
> 1. task1 start -> stage (0, 0) entry created in stageTCMP, task count 
> increment to1
> 2. 1st poll() -> update peak metrics of stage (0, 0)
> 3. task1 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 4. task2 start -> task count of stage (0,0) increment to1
> 5. 2nd poll() -> update peak metrics of stage (0, 0)
> 6. task2 end -> stage (0, 0) task count decrement to 0,but the entry (0,0) 
> still remain.
> 7. heartbeat() ->  accurate peak metrics for stage (0, 0) reported. Remove 
> entry for stage (0,0) in stageTCMP because its task count is 0.
>  
> How to verify the behavior? 
> Submit a job with a custom polling interval (e.g., 2s) and 
> spark.executor.cores=1 and check the debug logs of ExecutoMetricsPoller.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-04-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317034#comment-17317034
 ] 

Attila Zsolt Piros commented on SPARK-34738:


hi [~shaneknapp]! 

Can you please share with us the test output and the 
"resource-managers/kubernetes/integration-tests/target/integration-tests.log" 
file for point 4?

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34738) Upgrade Minikube and kubernetes cluster version on Jenkins

2021-04-08 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317381#comment-17317381
 ] 

Attila Zsolt Piros commented on SPARK-34738:


No worries. 

I have guess. Check 
"resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/PVTestsSuite.scala"
there is a match expression to select the node for mount:

{noformat}
  .withMatchExpressions(new NodeSelectorRequirementBuilder()
  .withKey("kubernetes.io/hostname")
  .withOperator("In")
  .withValues("minikube", "m01", "docker-for-desktop", 
"docker-desktop")
  .build()) 
{noformat}

This is very suspicious. 

> Upgrade Minikube and kubernetes cluster version on Jenkins
> --
>
> Key: SPARK-34738
> URL: https://issues.apache.org/jira/browse/SPARK-34738
> Project: Spark
>  Issue Type: Task
>  Components: jenkins, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Assignee: Shane Knapp
>Priority: Major
> Attachments: integration-tests.log
>
>
> [~shaneknapp] as we discussed [on the mailing 
> list|http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html]
>  Minikube can be upgraded to the latest (v1.18.1) and kubernetes version 
> should be v1.17.3 (`minikube config set kubernetes-version v1.17.3`).
> [Here|https://github.com/apache/spark/pull/31829] is my PR which uses a new 
> method to configure the kubernetes client. Thanks in advance to use it for 
> testing on the Jenkins after the Minikube version is updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >