date:20190507

[jira] [Commented] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835306#comment-16835306
 ] 

Hyukjin Kwon commented on SPARK-27654:
--

Please just don't copy and paste the error message. no one can investigate 
further.

How did you reproduce this? Is the parquet file really not corrupted? can you 
check via Parquet tools?

What's expected input and output?

> spark unable to read parquet file- corrupt footer
> -
>
> Key: SPARK-27654
> URL: https://issues.apache.org/jira/browse/SPARK-27654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gautham Rajendiran
>Priority: Major
>
> Reading large parquet file produces corrupt footer error with parquet file
>  
> {code:java}
> --- 
> Py4JJavaError Traceback (most recent call last)  in 
> () > 1 df = spark.read.parquet("data1") 2 df.head(1) 
> /databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths) 
> 314 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 
> 315 """ --> 316 return 
> self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 317 318 
> @ignore_unicode_prefix 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
> 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def 
> deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except 
> py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name) 326 raise 
> Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
> format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o1045.parquet. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 
> (TID 1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
> org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 
> Could not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
> isDirectory=false; length=66061642673; replication=0; blocksize=0; 
> modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
> isSymlink=false} at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
>  at 
>

[jira] [Resolved] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27654.
--
Resolution: Invalid

> spark unable to read parquet file- corrupt footer
> -
>
> Key: SPARK-27654
> URL: https://issues.apache.org/jira/browse/SPARK-27654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gautham Rajendiran
>Priority: Major
>
> Reading large parquet file produces corrupt footer error with parquet file
>  
> {code:java}
> --- 
> Py4JJavaError Traceback (most recent call last)  in 
> () > 1 df = spark.read.parquet("data1") 2 df.head(1) 
> /databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths) 
> 314 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 
> 315 """ --> 316 return 
> self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 317 318 
> @ignore_unicode_prefix 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
> 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def 
> deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except 
> py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name) 326 raise 
> Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
> format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o1045.parquet. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 
> (TID 1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
> org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 
> Could not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
> isDirectory=false; length=66061642673; replication=0; blocksize=0; 
> modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
> isSymlink=false} at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
>  at 
> org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:52)
>  at 
> org.apache.spark.util.threads.CapturedSparkThreadLocals.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:71)
>  at 
> org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2.apply(ThreadUtils.scala:419)
>

[jira] [Updated] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27654:
-
Priority: Major  (was: Blocker)

> spark unable to read parquet file- corrupt footer
> -
>
> Key: SPARK-27654
> URL: https://issues.apache.org/jira/browse/SPARK-27654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gautham Rajendiran
>Priority: Major
>
> Reading large parquet file produces corrupt footer error with parquet file
>  
> {code:java}
> --- 
> Py4JJavaError Traceback (most recent call last)  in 
> () > 1 df = spark.read.parquet("data1") 2 df.head(1) 
> /databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths) 
> 314 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 
> 315 """ --> 316 return 
> self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 317 318 
> @ignore_unicode_prefix 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
> 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def 
> deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except 
> py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name) 326 raise 
> Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
> format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o1045.parquet. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 
> (TID 1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
> org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 
> Could not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
> isDirectory=false; length=66061642673; replication=0; blocksize=0; 
> modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
> isSymlink=false} at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
>  at 
> org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:52)
>  at 
> org.apache.spark.util.threads.CapturedSparkThreadLocals.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:71)
>  at 
>

[jira] [Updated] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27654:
-
Component/s: (was: Spark Core)
 SQL

> spark unable to read parquet file- corrupt footer
> -
>
> Key: SPARK-27654
> URL: https://issues.apache.org/jira/browse/SPARK-27654
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Gautham Rajendiran
>Priority: Major
>
> Reading large parquet file produces corrupt footer error with parquet file
>  
> {code:java}
> --- 
> Py4JJavaError Traceback (most recent call last)  in 
> () > 1 df = spark.read.parquet("data1") 2 df.head(1) 
> /databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths) 
> 314 [('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 
> 315 """ --> 316 return 
> self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 317 318 
> @ignore_unicode_prefix 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
> 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def 
> deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except 
> py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name) 326 raise 
> Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
> format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o1045.parquet. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 
> (TID 1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
> org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 
> Could not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
> isDirectory=false; length=66061642673; replication=0; blocksize=0; 
> modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
> isSymlink=false} at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
>  at 
> org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:52)
>  at 
> org.apache.spark.util.threads.CapturedSparkThreadLocals.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:71)
>  at 
>

[jira] [Resolved] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25139.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24542
[https://github.com/apache/spark/pull/24542]

> PythonRunner#WriterThread released block after TaskRunner finally block which 
>  invoke BlockManager#releaseAllLocksForTask
> -
>
> Key: SPARK-25139
> URL: https://issues.apache.org/jira/browse/SPARK-25139
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Xingbo Jiang
>Priority: Major
> Fix For: 3.0.0
>
>
> We run pyspark streaming on YARN, the executor will die caused by the error: 
> the task released lock while finished, but the python writer haven't do real 
> releasing lock.
> Normally the task just double check the lock, but it ran wrong in front.
> The executor trace log is below:
>  18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG 
> BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task 
> launch worker for task 137 DEBUG BlockManager: Level for block 
> input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 
> 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found 
> block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch 
> worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, 
> finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: 
> Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor 
> task launch worker for task 137 INFO Executor: 1 block locks were not 
> released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout 
> writer for python ERROR Utils: Uncaught exception in thread stdout writer for 
> python java.lang.AssertionError: assertion failed: Block 
> input-0-1534485138800 is not locked for reading at 
> scala.Predef$.assert(Predef.scala:170) at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) 
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) 
> at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
>  at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
>  at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) 
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
>  at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
>  18/08/17 13:52:20 stdout writer for python ERROR 
> SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout 
> writer for python,5,main]
>  
> I think shoud wait WriterThread after Task#run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25139:


Assignee: Xingbo Jiang  (was: Hyukjin Kwon)

> PythonRunner#WriterThread released block after TaskRunner finally block which 
>  invoke BlockManager#releaseAllLocksForTask
> -
>
> Key: SPARK-25139
> URL: https://issues.apache.org/jira/browse/SPARK-25139
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Xingbo Jiang
>Priority: Major
>
> We run pyspark streaming on YARN, the executor will die caused by the error: 
> the task released lock while finished, but the python writer haven't do real 
> releasing lock.
> Normally the task just double check the lock, but it ran wrong in front.
> The executor trace log is below:
>  18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG 
> BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task 
> launch worker for task 137 DEBUG BlockManager: Level for block 
> input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 
> 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found 
> block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch 
> worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, 
> finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: 
> Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor 
> task launch worker for task 137 INFO Executor: 1 block locks were not 
> released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout 
> writer for python ERROR Utils: Uncaught exception in thread stdout writer for 
> python java.lang.AssertionError: assertion failed: Block 
> input-0-1534485138800 is not locked for reading at 
> scala.Predef$.assert(Predef.scala:170) at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) 
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) 
> at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
>  at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
>  at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) 
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
>  at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
>  18/08/17 13:52:20 stdout writer for python ERROR 
> SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout 
> writer for python,5,main]
>  
> I think shoud wait WriterThread after Task#run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25139) PythonRunner#WriterThread released block after TaskRunner finally block which invoke BlockManager#releaseAllLocksForTask

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25139:


Assignee: Hyukjin Kwon

> PythonRunner#WriterThread released block after TaskRunner finally block which 
>  invoke BlockManager#releaseAllLocksForTask
> -
>
> Key: SPARK-25139
> URL: https://issues.apache.org/jira/browse/SPARK-25139
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 2.3.1
>Reporter: DENG FEI
>Assignee: Hyukjin Kwon
>Priority: Major
>
> We run pyspark streaming on YARN, the executor will die caused by the error: 
> the task released lock while finished, but the python writer haven't do real 
> releasing lock.
> Normally the task just double check the lock, but it ran wrong in front.
> The executor trace log is below:
>  18/08/17 13:52:20 Executor task launch worker for task 137 DEBUG 
> BlockManager: Getting local block input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> trying to acquire read lock for input-0-1534485138800 18/08/17 13:52:20 
> Executor task launch worker for task 137 TRACE BlockInfoManager: Task 137 
> acquired read lock for input-0-1534485138800 18/08/17 13:52:20 Executor task 
> launch worker for task 137 DEBUG BlockManager: Level for block 
> input-0-1534485138800 is StorageLevel(disk, memory, 1 replicas) 18/08/17 
> 13:52:20 Executor task launch worker for task 137 INFO BlockManager: Found 
> block input-0-1534485138800 locally 18/08/17 13:52:20 Executor task launch 
> worker for task 137 INFO PythonRunner: Times: total = 8, boot = 3, init = 5, 
> finish = 0 18/08/17 13:52:20 stdout writer for python TRACE BlockInfoManager: 
> Task 137 releasing lock for input-0-1534485138800 18/08/17 13:52:20 Executor 
> task launch worker for task 137 INFO Executor: 1 block locks were not 
> released by TID = 137: [input-0-1534485138800] 18/08/17 13:52:20 stdout 
> writer for python ERROR Utils: Uncaught exception in thread stdout writer for 
> python java.lang.AssertionError: assertion failed: Block 
> input-0-1534485138800 is not locked for reading at 
> scala.Predef$.assert(Predef.scala:170) at 
> org.apache.spark.storage.BlockInfoManager.unlock(BlockInfoManager.scala:299) 
> at org.apache.spark.storage.BlockManager.releaseLock(BlockManager.scala:769) 
> at 
> org.apache.spark.storage.BlockManager$$anonfun$1.apply$mcV$sp(BlockManager.scala:540)
>  at 
> org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
>  at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:33) 
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at 
> scala.collection.Iterator$class.foreach(Iterator.scala:893) at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:213)
>  at 
> org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:407)
>  at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:215)
>  at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1991) at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:170)
>  18/08/17 13:52:20 stdout writer for python ERROR 
> SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout 
> writer for python,5,main]
>  
> I think shoud wait WriterThread after Task#run.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27655) Persistent the table statistics to metadata after fall back to hdfs

2019-05-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27655:


Assignee: Apache Spark

> Persistent the table statistics to metadata after fall back to hdfs
> ---
>
> Key: SPARK-27655
> URL: https://issues.apache.org/jira/browse/SPARK-27655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: disableFallBackToHdfs.png, enableFallBackToHdfs.png
>
>
> It's a real case. We need to join many times from different tables, but the 
> statistics of some tables are incorrect. So this job need 43 min,  It only 
> need 3.7min after set 
> {{spark.sql.statistics.persistentStatsAfterFallBack=true}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27655) Persistent the table statistics to metadata after fall back to hdfs

2019-05-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27655:


Assignee: (was: Apache Spark)

> Persistent the table statistics to metadata after fall back to hdfs
> ---
>
> Key: SPARK-27655
> URL: https://issues.apache.org/jira/browse/SPARK-27655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: disableFallBackToHdfs.png, enableFallBackToHdfs.png
>
>
> It's a real case. We need to join many times from different tables, but the 
> statistics of some tables are incorrect. So this job need 43 min,  It only 
> need 3.7min after set 
> {{spark.sql.statistics.persistentStatsAfterFallBack=true}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27655) Persistent the table statistics to metadata after fall back to hdfs

2019-05-07 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27655:

Attachment: disableFallBackToHdfs.png
enableFallBackToHdfs.png

> Persistent the table statistics to metadata after fall back to hdfs
> ---
>
> Key: SPARK-27655
> URL: https://issues.apache.org/jira/browse/SPARK-27655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: disableFallBackToHdfs.png, enableFallBackToHdfs.png
>
>
> It's a real case. We need to join many times from different tables, but the 
> statistics of some tables are incorrect. So this job need 43 min,  It only 
> need 3.7min after set 
> {{spark.sql.statistics.persistentStatsAfterFallBack=true}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27655) Persistent the table statistics to metadata after fall back to hdfs

2019-05-07 Thread Yuming Wang (JIRA)

Yuming Wang created SPARK-27655:
---

 Summary: Persistent the table statistics to metadata after fall 
back to hdfs
 Key: SPARK-27655
 URL: https://issues.apache.org/jira/browse/SPARK-27655
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


It's a real case. We need to join many times from different tables, but the 
statistics of some tables are incorrect. So this job need 43 min,  It only need 
3.7min after set {{spark.sql.statistics.persistentStatsAfterFallBack=true}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27645) Cache result of count function to that RDD

2019-05-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835294#comment-16835294
 ] 

Hyukjin Kwon commented on SPARK-27645:
--

You can wrap {{Dataset}} to have the pre-calculated count, or take an argument 
{{count}}. Or you can have a class that holds the count.

> Cache result of count function to that RDD
> --
>
> Key: SPARK-27645
> URL: https://issues.apache.org/jira/browse/SPARK-27645
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Seungmin Lee
>Priority: Major
>
> I'm not sure whether there have been an update for this(as far as I know, 
> there isn't such feature), since RDD is immutable, why don't we keep the 
> result from count function of that RDD and reuse it in future calls?
> Sometimes, we only have RDD variable but don't have previously run result 
> from count.
> In this case, not running whole count action to entire dataset would be very 
> beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26944.
--
Resolution: Not A Problem

I think it's not a big deal anyway ..

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27647) Metric Gauge not threadsafe

2019-05-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835288#comment-16835288
 ] 

Hyukjin Kwon commented on SPARK-27647:
--

I think we should clarify a reproducer or test before fixing it or filing it as 
an issue.

> Metric Gauge not threadsafe
> ---
>
> Key: SPARK-27647
> URL: https://issues.apache.org/jira/browse/SPARK-27647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: bettermouse
>Priority: Major
>
> when I read class DAGSchedulerSource,I find some Gauges may be not 
> threadSafe.like
>  metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new 
> Gauge[Int] {
>  override def getValue: Int = dagScheduler.failedStages.size
>  })
> this method may be called in other thread,but failedStages field is not 
> thread safe
> filed runningStages,waitingStages have same problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27639) InMemoryTableScan shows the table name on UI if possible

2019-05-07 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27639.
---
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/24534

> InMemoryTableScan shows the table name on UI if possible
> 
>
> Key: SPARK-27639
> URL: https://issues.apache.org/jira/browse/SPARK-27639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> It only shows InMemoryTableScan when scanning InMemoryTable.
> When there are many InMemoryTables, it is difficult to distinguish which one 
> is what we are looking for. This PR show the table name when scanning 
> InMemoryTable. 
> !https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27647) Metric Gauge not threadsafe

2019-05-07 Thread bettermouse (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835259#comment-16835259
 ] 

bettermouse edited comment on SPARK-27647 at 5/8/19 2:29 AM:
-

First, in DAGScheduler thread, Field dagScheduler.failedStages is updated.
 Second DAGSchedulerSource mainly uses Metrics jar. This method use Gauge.
 Metrics should call this method periodically, this is the second thread.

you can see Gauge class.

A gauge metric is an instantaneous reading of a particular value. To instrument 
a queue's depth, for example: 
 final Queue queue = new ConcurrentLinkedQueue();
 final Gauge queueDepth = new Gauge() {
 public Integer getValue()

{ return queue.size(); }

};

it uses a ConcurrentLinkedQueue which is thread safe.in my local,I write a 
example,
 I find this method is called in Metrics thread.
 so I think it is not safe.one thread is DAGScheduler thread. on thread is 
Metrics thread
 to report the message.the both read dagScheduler.failedStages field.

I mean it's not safe to read a variable(dagScheduler.failedStages) in other 
thread.


was (Author: bettermouse):
First in DAGScheduler thread, Field dagScheduler.failedStages is updated.
Second DAGSchedulerSource mainly use Metrics jar.This method use Gauge.
Metrics should call this method periodically,this is second thread.


you can see Gauge class.

A gauge metric is an instantaneous reading of a particular value. To instrument 
a queue's depth, for example: 
 final Queue queue = new ConcurrentLinkedQueue();
 final Gauge queueDepth = new Gauge() {
 public Integer getValue() {
 return queue.size();
 }
 };
 
 it uses a ConcurrentLinkedQueue which is thread safe.in my local,I write a 
example,
I find this method is called in Metrics thread.
so I think it is not safe.one thread is DAGScheduler thread. on thread is 
Metrics thread
to report the message.the both read dagScheduler.failedStages field.

> Metric Gauge not threadsafe
> ---
>
> Key: SPARK-27647
> URL: https://issues.apache.org/jira/browse/SPARK-27647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: bettermouse
>Priority: Major
>
> when I read class DAGSchedulerSource,I find some Gauges may be not 
> threadSafe.like
>  metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new 
> Gauge[Int] {
>  override def getValue: Int = dagScheduler.failedStages.size
>  })
> this method may be called in other thread,but failedStages field is not 
> thread safe
> filed runningStages,waitingStages have same problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27647) Metric Gauge not threadsafe

2019-05-07 Thread bettermouse (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835259#comment-16835259
 ] 

bettermouse commented on SPARK-27647:
-

First in DAGScheduler thread, Field dagScheduler.failedStages is updated.
Second DAGSchedulerSource mainly use Metrics jar.This method use Gauge.
Metrics should call this method periodically,this is second thread.


you can see Gauge class.

A gauge metric is an instantaneous reading of a particular value. To instrument 
a queue's depth, for example: 
 final Queue queue = new ConcurrentLinkedQueue();
 final Gauge queueDepth = new Gauge() {
 public Integer getValue() {
 return queue.size();
 }
 };
 
 it uses a ConcurrentLinkedQueue which is thread safe.in my local,I write a 
example,
I find this method is called in Metrics thread.
so I think it is not safe.one thread is DAGScheduler thread. on thread is 
Metrics thread
to report the message.the both read dagScheduler.failedStages field.

> Metric Gauge not threadsafe
> ---
>
> Key: SPARK-27647
> URL: https://issues.apache.org/jira/browse/SPARK-27647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: bettermouse
>Priority: Major
>
> when I read class DAGSchedulerSource,I find some Gauges may be not 
> threadSafe.like
>  metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new 
> Gauge[Int] {
>  override def getValue: Int = dagScheduler.failedStages.size
>  })
> this method may be called in other thread,but failedStages field is not 
> thread safe
> filed runningStages,waitingStages have same problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23191) Workers registration failes in case of network drop

2019-05-07 Thread zuotingbing (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835256#comment-16835256
 ] 

zuotingbing edited comment on SPARK-23191 at 5/8/19 2:21 AM:
-

See these detail logs, master changed from vmax18 to vmax17.

In master vmax18,  worker be removed because got no heartbeat but soon got 
heartbeat and asking to re-register with master vmax18.

In the same time, worker has bean registered with master vmax17 when master 
vmax17 got leadership.

So Worker registration failed: Duplicate worker ID.

 

spark-mr-master-vmax18.log：
{code:java}
2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost 
leadership
2019-03-15 20:22:14,544 WARN Master: Removing 
worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
2019-03-15 20:22:14,544 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker 
worker-20190218183101-vmax18-33129. Asking it to re-register.
2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master 
shutting down.
{code}
 

spark-mr-master-vmax17.log:
{code:java}
2019-03-15 20:22:14,870 INFO Master: Registering worker vmax18:33129 with 21 
cores, 125.0 GB RAM
2019-03-15 20:22:15,261 INFO Master: vmax18:33129 got disassociated, removing 
it.
2019-03-15 20:22:15,263 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:15,311 ERROR Inbox: Ignoring error
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /spark/master_status/worker_worker-20190218183101-vmax18-33129
{code}
 

spark-mr-worker-vmax18.log:
{code:java}
2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at 
spark://vmax17:7077
2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect.
2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register 
with the master, since there is an attempt scheduled already.
2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate 
worker ID
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process!
2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called{code}
 

PS, this will result another issue: The leader will always in 
COMPLETING_RECOVERY state.

worker-vmax18 shut down cause duplicate worker ID,and clear the worker's node 
on persist Engine(we use zookeeper). Then the new leader(master-vmax17) find 
the worker died and trying to remove it ,and try to clear the node on 
zookeeper,but the node has been removed yet during worker-vmax18 shut down ,so 
{color:#ff}*an exception was thrown in function completeRecovery()* *. Then 
the leader will always in COMPLETING_RECOVERY state.*{color}

 


was (Author: zuo.tingbing9):
See these detail logs, master changed from vmax18 to vmax17.

In master vmax18,  worker be removed because got no heartbeat but soon got 
heartbeat and asking to re-register with master vmax18.

In the same time, worker has bean registered with master vmax17 when master 
vmax17 got leadership.

So Worker registration failed: Duplicate worker ID.

 

spark-mr-master-vmax18.log：
**
{code:java}
2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost 
leadership
2019-03-15 20:22:14,544 WARN Master: Removing 
worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
2019-03-15 20:22:14,544 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker 
worker-20190218183101-vmax18-33129. Asking it to re-register.
2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master 
shutting down.
{code}
 

spark-mr-master-vmax17.log:

 
{code:java}
2019-03-15 20:22:14,870 INFO Master: Registering worker vmax18:33129 with 21 
cores, 125.0 GB RAM
2019-03-15 20:22:15,261 INFO Master: vmax18:33129 got disassociated, removing 
it.
2019-03-15 20:22:15,263 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:15,311 ERROR Inbox: Ignoring error
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /spark/master_status/worker_worker-20190218183101-vmax18-33129
{code}
 

 

spark-mr-worker-vmax18.log:

 
{code:java}
2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at 
spark://vmax17:7077
2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect.
2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register 
with the master, since there is an attempt scheduled already.
2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate 
worker ID
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process!
2019-03-15

[jira] [Updated] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Gautham Rajendiran (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautham Rajendiran updated SPARK-27654:
---
Description: 
Reading large parquet file produces corrupt footer error with parquet file

 
{code:java}
--- 
Py4JJavaError Traceback (most recent call last)  in 
() > 1 df = spark.read.parquet("data1") 2 df.head(1) 
/databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths) 314 
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 315 """ 
--> 316 return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 
317 318 @ignore_unicode_prefix 
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
__call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, 
**kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError 
as e: 65 s = e.java_exception.toString() 
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name) 326 raise 
Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
Py4JJavaError: An error occurred while calling o1045.parquet. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 
1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: Exception 
thrown in awaitResult: at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
org.apache.spark.scheduler.Task.run(Task.scala:112) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Could 
not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
isDirectory=false; length=66061642673; replication=0; blocksize=0; 
modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
isSymlink=false} at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
 at 
org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
 at 
org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:52)
 at 
org.apache.spark.util.threads.CapturedSparkThreadLocals.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:71)
 at 
org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2.apply(ThreadUtils.scala:419)
 at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) 
at 
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at

[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2019-05-07 Thread zuotingbing (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835256#comment-16835256
 ] 

zuotingbing commented on SPARK-23191:
-

See these detail logs, master changed from vmax18 to vmax17.

In master vmax18,  worker be removed because got no heartbeat but soon got 
heartbeat and asking to re-register with master vmax18.

In the same time, worker has bean registered with master vmax17 when master 
vmax17 got leadership.

So Worker registration failed: Duplicate worker ID.

 

spark-mr-master-vmax18.log：
**
{code:java}
2019-03-15 20:22:09,441 INFO ZooKeeperLeaderElectionAgent: We have lost 
leadership
2019-03-15 20:22:14,544 WARN Master: Removing 
worker-20190218183101-vmax18-33129 because we got no heartbeat in 60 seconds
2019-03-15 20:22:14,544 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:14,864 WARN Master: Got heartbeat from unregistered worker 
worker-20190218183101-vmax18-33129. Asking it to re-register.
2019-03-15 20:22:14,975 ERROR Master: Leadership has been revoked -- master 
shutting down.
{code}
 

spark-mr-master-vmax17.log:

 
{code:java}
2019-03-15 20:22:14,870 INFO Master: Registering worker vmax18:33129 with 21 
cores, 125.0 GB RAM
2019-03-15 20:22:15,261 INFO Master: vmax18:33129 got disassociated, removing 
it.
2019-03-15 20:22:15,263 INFO Master: Removing worker 
worker-20190218183101-vmax18-33129 on vmax18:33129
2019-03-15 20:22:15,311 ERROR Inbox: Ignoring error
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode 
for /spark/master_status/worker_worker-20190218183101-vmax18-33129
{code}
 

 

spark-mr-worker-vmax18.log:

 
{code:java}
2019-03-15 20:22:10,474 INFO Worker: Master has changed, new master is at 
spark://vmax17:7077
2019-03-15 20:22:14,862 INFO Worker: Master with url spark://vmax18:7077 
requested this worker to reconnect.
2019-03-15 20:22:14,865 INFO Worker: Not spawning another attempt to register 
with the master, since there is an attempt scheduled already.
2019-03-15 20:22:14,879 ERROR Worker: Worker registration failed: Duplicate 
worker ID
2019-03-15 20:22:14,895 INFO ExecutorRunner: Killing process!
2019-03-15 20:22:14,896 INFO ShutdownHookManager: Shutdown hook called{code}
 

 

PS, this will result another issue: The leader will always in 
COMPLETING_RECOVERY state.

worker-vmax18 shut down cause duplicate worker ID,and clear the worker's node 
on persist Engine(we use zookeeper). Then the new leader(master-vmax17) find 
the worker died and trying to remove it ,and try to clear the node on 
zookeeper,but the node has been removed yet during worker-vmax18 shut down ,so 
{color:#FF}*an exception was thrown in function completeRecovery()* *. Then 
the leader will always in COMPLETING_RECOVERY state.*{color}

 

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
>

[jira] [Updated] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Gautham Rajendiran (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gautham Rajendiran updated SPARK-27654:
---
Affects Version/s: (was: 2.4.3)
   2.4.0

> spark unable to read parquet file- corrupt footer
> -
>
> Key: SPARK-27654
> URL: https://issues.apache.org/jira/browse/SPARK-27654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Gautham Rajendiran
>Priority: Blocker
>
> Reading large parquet file produces corrupt footer error with parquet file
>  
> {code:java}
> --- 
> Py4JJavaError Traceback (most recent call last)  in 
> () > 1 df = spark.read.parquet("/mnt/valassis/data1") 2 
> df.head(1) /databricks/spark/python/pyspark/sql/readwriter.py in 
> parquet(self, *paths) 314 [('name', 'string'), ('year', 'int'), ('month', 
> 'int'), ('day', 'int')] 315 """ --> 316 return 
> self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 317 318 
> @ignore_unicode_prefix 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
> 1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
> self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def 
> deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except 
> py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() 
> /databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
> get_return_value(answer, gateway_client, target_id, name) 326 raise 
> Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
> format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
> Py4JJavaError: An error occurred while calling o1045.parquet. : 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 
> (TID 1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: 
> Exception thrown in awaitResult: at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
> org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
> org.apache.spark.scheduler.Task.run(Task.scala:112) at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: 
> Could not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
> isDirectory=false; length=66061642673; replication=0; blocksize=0; 
> modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
> isSymlink=false} at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
>  at 
> org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
>  at 
> org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:52)
>  at 
> org.apache.spark.util.threads.CapturedSparkThreadLocals.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:71)
>  at 
>

[jira] [Created] (SPARK-27654) spark unable to read parquet file- corrupt footer

2019-05-07 Thread Gautham Rajendiran (JIRA)

Gautham Rajendiran created SPARK-27654:
--

 Summary: spark unable to read parquet file- corrupt footer
 Key: SPARK-27654
 URL: https://issues.apache.org/jira/browse/SPARK-27654
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Gautham Rajendiran


Reading large parquet file produces corrupt footer error with parquet file

 
{code:java}
--- 
Py4JJavaError Traceback (most recent call last)  in 
() > 1 df = spark.read.parquet("/mnt/valassis/data1") 2 df.head(1) 
/databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, *paths) 314 
[('name', 'string'), ('year', 'int'), ('month', 'int'), ('day', 'int')] 315 """ 
--> 316 return self._df(self._jreader.parquet(_to_seq(self._spark._sc, paths))) 
317 318 @ignore_unicode_prefix 
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in 
__call__(self, *args) 1255 answer = self.gateway_client.send_command(command) 
1256 return_value = get_return_value( -> 1257 answer, self.gateway_client, 
self.target_id, self.name) 1258 1259 for temp_arg in temp_args: 
/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 61 def deco(*a, 
**kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError 
as e: 65 s = e.java_exception.toString() 
/databricks/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name) 326 raise 
Py4JJavaError( 327 "An error occurred while calling {0}{1}{2}.\n". --> 328 
format(target_id, ".", name), value) 329 else: 330 raise Py4JError( 
Py4JJavaError: An error occurred while calling o1045.parquet. : 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 
1458, 10.139.64.5, executor 0): org.apache.spark.SparkException: Exception 
thrown in awaitResult: at 
org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:355) at 
org.apache.spark.util.ThreadUtils$.parmap(ThreadUtils.scala:422) at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readParquetFootersInParallel(ParquetFileFormat.scala:602)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:675)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$9.apply(ParquetFileFormat.scala:667)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:817)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:304) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.doRunTask(Task.scala:139) at 
org.apache.spark.scheduler.Task.run(Task.scala:112) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: Could 
not read footer for file: FileStatus{path=dbfs:/mnt/valassis/data1; 
isDirectory=false; length=66061642673; replication=0; blocksize=0; 
modification_time=0; access_time=0; owner=; group=; permission=rw-rw-rw-; 
isSymlink=false} at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:615)
 at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:602)
 at 
org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2$$anonfun$apply$3.apply(ThreadUtils.scala:419)
 at 
org.apache.spark.util.threads.SparkThreadLocalCapturingHelper$class.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:52)
 at 
org.apache.spark.util.threads.CapturedSparkThreadLocals.runWithCaptured(SparkThreadLocalForwardingThreadPoolExecutor.scala:71)
 at 
org.apache.spark.util.ThreadUtils$$anonfun$5$$anonfun$apply$2.apply(ThreadUtils.scala:419)
 at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
 at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) 
at 
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
 at

[jira] [Updated] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-07 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27347:
--
Labels:   (was: mesos)

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Priority: Major
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-07 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27347:
--
Summary: Fix supervised driver retry logic when agent crashes/restarts  
(was: [MESOS] Fix supervised driver retry logic when agent crashes/restarts)

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Priority: Major
>  Labels: mesos
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27653) Add max_by() / min_by() SQL aggregate functions

2019-05-07 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-27653:
--

 Summary: Add max_by() / min_by() SQL aggregate functions
 Key: SPARK-27653
 URL: https://issues.apache.org/jira/browse/SPARK-27653
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen


It would be useful if Spark SQL supported the {{max_by()}} SQL aggregate 
function. Quoting from the [Presto 
docs|https://prestodb.github.io/docs/current/functions/aggregate.html#max_by]:
{quote}max_by(x, y) → [same as x]
 Returns the value of x associated with the maximum value of y over all input 
values.
{quote}
{{min_by}} works similarly.

Technically I can emulate this behavior using window functions but the 
resulting syntax is much more verbose and non-intuitive compared to {{max_by}} 
/ {{min_by}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27347) Fix supervised driver retry logic when agent crashes/restarts

2019-05-07 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27347:
--
Shepherd:   (was: Timothy Chen)

> Fix supervised driver retry logic when agent crashes/restarts
> -
>
> Key: SPARK-27347
> URL: https://issues.apache.org/jira/browse/SPARK-27347
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.2.1, 2.3.2, 2.4.0
>Reporter: Sam Tran
>Priority: Major
>  Labels: mesos
>
> Ran into scenarios where {{--supervised}} Spark jobs were retried multiple 
> times when an agent would crash, come back, and re-register even when those 
> jobs had already relaunched on a different agent.
> That is:
>  * supervised driver is running on agent1
>  * agent1 crashes
>  * driver is relaunched on another agent as `-retry-1`
>  * agent1 comes back online and re-registers with scheduler
>  * spark relaunches the same job as `-retry-2`
>  * now there are two jobs running simultaneously and the first retry job is 
> effectively orphaned within Zookeeper
> This is because when an agent comes back and re-registers, it sends a status 
> update {{TASK_FAILED}} for its old driver-task. Previous logic would 
> indiscriminately remove the {{submissionId from Zookeeper's launchedDrivers}} 
> node and add it to {{retryList}} node.
> Then, when a new offer came in, it would relaunch another -retry task even 
> though one was previously running.
> Sample log looks something like this: 
> {code:java}
> 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: 
> ... [offers] ...
> 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver 
> driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
> ...
> 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
> 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed 
> out' reason=REASON_SLAVE_REMOVED
> ...
> 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-1"
> ...
> 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
> 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable 
> agent re-reregistered'
> ...
> 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal 
> executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
> 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with 
> driver-20190115192138-0001 in status update
> ...
> 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 
> 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver 
> driver-20190115192138-0001 with taskId: value: 
> "driver-20190115192138-0001-retry-2"
> ...
> 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
> 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: 
> taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker

2019-05-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-27548.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24070
[https://github.com/apache/spark/pull/24070]

> PySpark toLocalIterator does not raise errors from worker
> -
>
> Key: SPARK-27548
> URL: https://issues.apache.org/jira/browse/SPARK-27548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 3.0.0
>
>
> When using a PySpark RDD local iterator and an error occurs on the worker, it 
> is not picked up by Py4J and raised in the Python driver process. So unless 
> looking at logs, there is no way for the application to know the worker had 
> an error. This is a test that should pass if the error is raised in the 
> driver:
> {code}
> def test_to_local_iterator_error(self):
> def fail(_):
> raise RuntimeError("local iterator error")
> rdd = self.sc.parallelize(range(10)).map(fail)
> with self.assertRaisesRegexp(Exception, "local iterator error"):
> for _ in rdd.toLocalIterator():
> pass{code}
> but it does not raise an exception:
> {noformat}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 428, in main
>     process()
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 423, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 505, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
>     return f(*args, **kwargs)
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in 
> fail
>     raise RuntimeError("local iterator error")
> RuntimeError: local iterator error
>     at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> FAIL
> ==
> FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests)
> --
> Traceback (most recent call last):
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in 
> test_to_local_iterator_error
>     pass
> AssertionError: Exception not raised{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23961) pyspark toLocalIterator throws an exception

2019-05-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-23961:


Assignee: Bryan Cutler

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Assignee: Bryan Cutler
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27548) PySpark toLocalIterator does not raise errors from worker

2019-05-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-27548:


Assignee: Bryan Cutler

> PySpark toLocalIterator does not raise errors from worker
> -
>
> Key: SPARK-27548
> URL: https://issues.apache.org/jira/browse/SPARK-27548
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.1
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Major
>
> When using a PySpark RDD local iterator and an error occurs on the worker, it 
> is not picked up by Py4J and raised in the Python driver process. So unless 
> looking at logs, there is no way for the application to know the worker had 
> an error. This is a test that should pass if the error is raised in the 
> driver:
> {code}
> def test_to_local_iterator_error(self):
> def fail(_):
> raise RuntimeError("local iterator error")
> rdd = self.sc.parallelize(range(10)).map(fail)
> with self.assertRaisesRegexp(Exception, "local iterator error"):
> for _ in rdd.toLocalIterator():
> pass{code}
> but it does not raise an exception:
> {noformat}
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 428, in main
>     process()
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/worker.py", line 
> 423, in process
>     serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/serializers.py", 
> line 505, in dump_stream
>     vs = list(itertools.islice(iterator, batch))
>   File "/home/bryan/git/spark/python/lib/pyspark.zip/pyspark/util.py", line 
> 99, in wrapper
>     return f(*args, **kwargs)
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 742, in 
> fail
>     raise RuntimeError("local iterator error")
> RuntimeError: local iterator error
>     at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:453)
> ...
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>     at java.lang.Thread.run(Thread.java:748)
> FAIL
> ==
> FAIL: test_to_local_iterator_error (pyspark.tests.test_rdd.RDDTests)
> --
> Traceback (most recent call last):
>   File "/home/bryan/git/spark/python/pyspark/tests/test_rdd.py", line 748, in 
> test_to_local_iterator_error
>     pass
> AssertionError: Exception not raised{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23961) pyspark toLocalIterator throws an exception

2019-05-07 Thread Bryan Cutler (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-23961.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24070
[https://github.com/apache/spark/pull/24070]

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Assignee: Bryan Cutler
>Priority: Minor
>  Labels: DataFrame, pyspark
> Fix For: 3.0.0
>
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-05-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835074#comment-16835074
 ] 

shane knapp commented on SPARK-26944:
-

hmm, actually...  i don't think i will need to do anything.  the code is a 
little confusing, but this is actually behaving as intended.  :)

if all py* unit tests pass, nothing is written to python/unit-tests.log

if a test fails, then that output is written to python/unit-tests.log and 
testing stops.

the *actual* output from the python unit tests (failing or successful) is 
written to target/unit-tests.log along with EVERY OTHER BIT OF UNIT TEST OUTPUT 
FROM THE ENTIRE BUILD (emphasis mine).   

this isn't actually that useful, and the target/unit-tests.log file is 
gigabytes in size.  however, breaking the python-only output out of here will 
most likely be difficult as this output is logged through scala (ie:  
core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala).

does this make sense?

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-05-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835031#comment-16835031
 ] 

shane knapp edited comment on SPARK-26944 at 5/7/19 7:33 PM:
-

going through my queue, i found this and realized i hadn't verified that it 
worked.  and it doesn't.

because there's a bug in python/run-tests.py that skips writing unit test logs 
if things pass.

PR incoming.  this will need to be backported to 2.3 and 2.4.


was (Author: shaneknapp):
going through my queue, i found this and realized i hadn't verified that it 
worked.  and it doesn't.

because there's a bug in python/run-tests.py that overrides the output to a 
different location *after* it prints out where it plans on writing the log file 
to.

PR incoming.  this will need to be backported to 2.3 and 2.4.

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26944) Python unit-tests.log not available in artifacts for a build in Jenkins

2019-05-07 Thread shane knapp (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835031#comment-16835031
 ] 

shane knapp commented on SPARK-26944:
-

going through my queue, i found this and realized i hadn't verified that it 
worked.  and it doesn't.

because there's a bug in python/run-tests.py that overrides the output to a 
different location *after* it prints out where it plans on writing the log file 
to.

PR incoming.  this will need to be backported to 2.3 and 2.4.

> Python unit-tests.log not available in artifacts for a build in Jenkins
> ---
>
> Key: SPARK-26944
> URL: https://issues.apache.org/jira/browse/SPARK-26944
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Alessandro Bellina
>Assignee: shane knapp
>Priority: Minor
> Attachments: Screen Shot 2019-03-05 at 12.08.43 PM.png
>
>
> I had a pr where the python unit tests failed.  The tests point at the 
> `/home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log` file, 
> but I can't get to that from jenkins UI it seems (are all prs writing to the 
> same file?).
> {code:java}
> 
> Running PySpark tests
> 
> Running PySpark tests. Output is in 
> /home/jenkins/workspace/SparkPullRequestBuilder/python/unit-tests.log{code}
> For reference, please see this build: 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/102518/console
> This Jira is to make it available under the artifacts for each build.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27294) Multi-cluster Kafka delegation token support

2019-05-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27294:
--

Assignee: Gabor Somogyi

> Multi-cluster Kafka delegation token support
> 
>
> Key: SPARK-27294
> URL: https://issues.apache.org/jira/browse/SPARK-27294
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> Kafka delegation token only supports single cluster at the moment.
> I've created a small document with the proposed Spark approach 
> [here|https://docs.google.com/document/d/1yuwIxKqUnzo5RxJDIqrWWC2s67hh5Tb1QtfIwEKVWtM/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27294) Multi-cluster Kafka delegation token support

2019-05-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27294.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24305
[https://github.com/apache/spark/pull/24305]

> Multi-cluster Kafka delegation token support
> 
>
> Key: SPARK-27294
> URL: https://issues.apache.org/jira/browse/SPARK-27294
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> Kafka delegation token only supports single cluster at the moment.
> I've created a small document with the proposed Spark approach 
> [here|https://docs.google.com/document/d/1yuwIxKqUnzo5RxJDIqrWWC2s67hh5Tb1QtfIwEKVWtM/edit?usp=sharing].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27396) SPIP: Public APIs for extended Columnar Processing Support

2019-05-07 Thread Thomas Graves (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835014#comment-16835014
 ] 

Thomas Graves commented on SPARK-27396:
---

I just started a vote thread on this SPIP please take a look and vote.

> SPIP: Public APIs for extended Columnar Processing Support
> --
>
> Key: SPARK-27396
> URL: https://issues.apache.org/jira/browse/SPARK-27396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> *SPIP: Columnar Processing Without Arrow Formatting Guarantees.*
>  
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> The Dataset/DataFrame API in Spark currently only exposes to users one row at 
> a time when processing data.  The goals of this are to
>  # Add to the current sql extensions mechanism so advanced users can have 
> access to the physical SparkPlan and manipulate it to provide columnar 
> processing for existing operators, including shuffle.  This will allow them 
> to implement their own cost based optimizers to decide when processing should 
> be columnar and when it should not.
>  # Make any transitions between the columnar memory layout and a row based 
> layout transparent to the users so operations that are not columnar see the 
> data as rows, and operations that are columnar see the data as columns.
>  
> Not Requirements, but things that would be nice to have.
>  # Transition the existing in memory columnar layouts to be compatible with 
> Apache Arrow.  This would make the transformations to Apache Arrow format a 
> no-op. The existing formats are already very close to those layouts in many 
> cases.  This would not be using the Apache Arrow java library, but instead 
> being compatible with the memory 
> [layout|https://arrow.apache.org/docs/format/Layout.html] and possibly only a 
> subset of that layout.
>  
> *Q2.* What problem is this proposal NOT designed to solve? 
> The goal of this is not for ML/AI but to provide APIs for accelerated 
> computing in Spark primarily targeting SQL/ETL like workloads.  ML/AI already 
> have several mechanisms to get data into/out of them. These can be improved 
> but will be covered in a separate SPIP.
> This is not trying to implement any of the processing itself in a columnar 
> way, with the exception of examples for documentation.
> This does not cover exposing the underlying format of the data.  The only way 
> to get at the data in a ColumnVector is through the public APIs.  Exposing 
> the underlying format to improve efficiency will be covered in a separate 
> SPIP.
> This is not trying to implement new ways of transferring data to external 
> ML/AI applications.  That is covered by separate SPIPs already.
> This is not trying to add in generic code generation for columnar processing. 
>  Currently code generation for columnar processing is only supported when 
> translating columns to rows.  We will continue to support this, but will not 
> extend it as a general solution. That will be covered in a separate SPIP if 
> we find it is helpful.  For now columnar processing will be interpreted.
> This is not trying to expose a way to get columnar data into Spark through 
> DataSource V2 or any other similar API.  That would be covered by a separate 
> SPIP if we find it is needed.
>  
> *Q3.* How is it done today, and what are the limits of current practice?
> The current columnar support is limited to 3 areas.
>  # Internal implementations of FileFormats, optionally can return a 
> ColumnarBatch instead of rows.  The code generation phase knows how to take 
> that columnar data and iterate through it as rows for stages that wants rows, 
> which currently is almost everything.  The limitations here are mostly 
> implementation specific. The current standard is to abuse Scala’s type 
> erasure to return ColumnarBatches as the elements of an RDD[InternalRow]. The 
> code generation can handle this because it is generating java code, so it 
> bypasses scala’s type checking and just casts the InternalRow to the desired 
> ColumnarBatch.  This makes it difficult for others to implement the same 
> functionality for different processing because they can only do it through 
> code generation. There really is no clean separate path in the code 
> generation for columnar vs row based. Additionally, because it is only 
> supported through code generation if for any reason code generation would 
> fail there is no backup.  This is typically fine for input formats but can be 
> problematic when we get into more extensive processing.
>  # When caching data it can optionally be cached in a columnar format if the 
> input is also columnar.  This is similar to the first area and

[jira] [Resolved] (SPARK-27610) Yarn external shuffle service fails to start when spark.shuffle.io.mode=EPOLL

2019-05-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-27610.

   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24502
[https://github.com/apache/spark/pull/24502]

> Yarn external shuffle service fails to start when spark.shuffle.io.mode=EPOLL
> -
>
> Key: SPARK-27610
> URL: https://issues.apache.org/jira/browse/SPARK-27610
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.2
>Reporter: Adrian Muraru
>Priority: Minor
> Fix For: 3.0.0
>
>
> Enabling netty epoll mode in yarn shuffle service 
> ({{spark.shuffle.io.mode=EPOLL}}) makes the Yarn NodeManager to abort.
>  Checking the stracktrace, it seems that while the io.netty package is 
> shaded, the native libraries provided by netty-all are not:
>   
> {noformat}
> Caused by: java.io.FileNotFoundException: 
> META-INF/native/liborg_spark_project_netty_transport_native_epoll_x86_64.so{noformat}
> *Full stack trace:*
> {noformat}
> 2019-04-24 23:14:46,372 ERROR [main] nodemanager.NodeManager 
> (NodeManager.java:initAndStartNodeManager(639)) - Error starting NodeManager
> java.lang.UnsatisfiedLinkError: failed to load the required native library
> at 
> org.spark_project.io.netty.channel.epoll.Epoll.ensureAvailability(Epoll.java:81)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoop.(EpollEventLoop.java:55)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.newChild(EpollEventLoopGroup.java:134)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.newChild(EpollEventLoopGroup.java:35)
> at 
> org.spark_project.io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:84)
> at 
> org.spark_project.io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:58)
> at 
> org.spark_project.io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:47)
> at 
> org.spark_project.io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:59)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.(EpollEventLoopGroup.java:104)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.(EpollEventLoopGroup.java:91)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.(EpollEventLoopGroup.java:68)
> at 
> org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:52)
> at 
> org.apache.spark.network.server.TransportServer.init(TransportServer.java:95)
> at 
> org.apache.spark.network.server.TransportServer.(TransportServer.java:75)
> at 
> org.apache.spark.network.TransportContext.createServer(TransportContext.java:108)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:186)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:147)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:268)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> Caused by: java.lang.UnsatisfiedLinkError: could not load a native library: 
> org_spark_project_netty_transport_native_epoll_x86_64
> at 
> org.spark_project.io.netty.util.internal.NativeLibraryLoader.load(NativeLibraryLoader.java:205)
> at 
> org.spark_project.io.netty.channel.epoll.Native.loadNativeLibrary(Native.java:207)
> at 
> org.spark_project.io.netty.channel.epoll.Native.(Native.java:65)
> at org.spark_project.io.netty.channel.epoll.Epoll.(Epoll.java:33)
> ... 26 more
> Suppressed: java.lang.UnsatisfiedLinkError: could not load a native 
> library: org_spark_project_netty_transport_native_epoll
> at 
> org.spark_project.io.netty.util.internal.NativeLibraryLoader.load(NativeLibraryLoader.java:205)

[jira] [Assigned] (SPARK-27610) Yarn external shuffle service fails to start when spark.shuffle.io.mode=EPOLL

2019-05-07 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-27610:
--

Assignee: Adrian Muraru

> Yarn external shuffle service fails to start when spark.shuffle.io.mode=EPOLL
> -
>
> Key: SPARK-27610
> URL: https://issues.apache.org/jira/browse/SPARK-27610
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.2
>Reporter: Adrian Muraru
>Assignee: Adrian Muraru
>Priority: Minor
> Fix For: 3.0.0
>
>
> Enabling netty epoll mode in yarn shuffle service 
> ({{spark.shuffle.io.mode=EPOLL}}) makes the Yarn NodeManager to abort.
>  Checking the stracktrace, it seems that while the io.netty package is 
> shaded, the native libraries provided by netty-all are not:
>   
> {noformat}
> Caused by: java.io.FileNotFoundException: 
> META-INF/native/liborg_spark_project_netty_transport_native_epoll_x86_64.so{noformat}
> *Full stack trace:*
> {noformat}
> 2019-04-24 23:14:46,372 ERROR [main] nodemanager.NodeManager 
> (NodeManager.java:initAndStartNodeManager(639)) - Error starting NodeManager
> java.lang.UnsatisfiedLinkError: failed to load the required native library
> at 
> org.spark_project.io.netty.channel.epoll.Epoll.ensureAvailability(Epoll.java:81)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoop.(EpollEventLoop.java:55)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.newChild(EpollEventLoopGroup.java:134)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.newChild(EpollEventLoopGroup.java:35)
> at 
> org.spark_project.io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:84)
> at 
> org.spark_project.io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:58)
> at 
> org.spark_project.io.netty.util.concurrent.MultithreadEventExecutorGroup.(MultithreadEventExecutorGroup.java:47)
> at 
> org.spark_project.io.netty.channel.MultithreadEventLoopGroup.(MultithreadEventLoopGroup.java:59)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.(EpollEventLoopGroup.java:104)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.(EpollEventLoopGroup.java:91)
> at 
> org.spark_project.io.netty.channel.epoll.EpollEventLoopGroup.(EpollEventLoopGroup.java:68)
> at 
> org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:52)
> at 
> org.apache.spark.network.server.TransportServer.init(TransportServer.java:95)
> at 
> org.apache.spark.network.server.TransportServer.(TransportServer.java:75)
> at 
> org.apache.spark.network.TransportContext.createServer(TransportContext.java:108)
> at 
> org.apache.spark.network.yarn.YarnShuffleService.serviceInit(YarnShuffleService.java:186)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices.serviceInit(AuxServices.java:147)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.serviceInit(ContainerManagerImpl.java:268)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceInit(NodeManager.java:357)
> at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:636)
> at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:684)
> Caused by: java.lang.UnsatisfiedLinkError: could not load a native library: 
> org_spark_project_netty_transport_native_epoll_x86_64
> at 
> org.spark_project.io.netty.util.internal.NativeLibraryLoader.load(NativeLibraryLoader.java:205)
> at 
> org.spark_project.io.netty.channel.epoll.Native.loadNativeLibrary(Native.java:207)
> at 
> org.spark_project.io.netty.channel.epoll.Native.(Native.java:65)
> at org.spark_project.io.netty.channel.epoll.Epoll.(Epoll.java:33)
> ... 26 more
> Suppressed: java.lang.UnsatisfiedLinkError: could not load a native 
> library: org_spark_project_netty_transport_native_epoll
> at 
> org.spark_project.io.netty.util.internal.NativeLibraryLoader.load(NativeLibraryLoader.java:205)
> at 
>

[jira] [Resolved] (SPARK-27590) do not consider skipped tasks when scheduling speculative tasks

2019-05-07 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-27590.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24485
[https://github.com/apache/spark/pull/24485]

> do not consider skipped tasks when scheduling speculative tasks
> ---
>
> Key: SPARK-27590
> URL: https://issues.apache.org/jira/browse/SPARK-27590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27652) Caught Hive MetaException when query by partition (partition col start with underscore)

2019-05-07 Thread Tongqing Qiu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongqing Qiu resolved SPARK-27652.
--
Resolution: Not A Bug

> Caught Hive MetaException when query by partition (partition col start with 
> underscore)
> ---
>
> Key: SPARK-27652
> URL: https://issues.apache.org/jira/browse/SPARK-27652
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.2, 2.4.3
>Reporter: Tongqing Qiu
>Priority: Minor
>
> create a table, set location as parquet, do msck repair table to get the data.
> But when query with partition column, got some errors (adding backtick would 
> not help)
> {code:java}
> select count from some_table where `_partition_date` = '2015-01-01'{code}
> Error
> {code:java}
> com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: 
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:783)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:774)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:322)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:230)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:222)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:222)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:772)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:373)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:372)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:101)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.getPartitionsByFilter(PoolingHiveClient.scala:372)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1295)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1288)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1$$anonfun$apply$1.apply(HiveExternalCatalog.scala:141)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$maybeSynchronized(HiveExternalCatalog.scala:104)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1.apply(HiveExternalCatalog.scala:139)
>  at 
> com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:345)
>  at 
> com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:331)
>  at 
> com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:23)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:137)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1288)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:261)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1045)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:62)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:277)
>  at 
>

[jira] [Updated] (SPARK-27652) Caught Hive MetaException when query by partition (partition col start with underscore)

2019-05-07 Thread Tongqing Qiu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongqing Qiu updated SPARK-27652:
-
Description: 
create a table, set location as parquet, do msck repair table to get the data.

But when query with partition column, got some errors (adding backtick would 
not help)
{code:java}
select count from some_table where `_partition_date` = '2015-01-01'{code}
Error
{code:java}
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: 
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:783)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:774)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:322)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:230)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:222)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:266)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:222)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:772)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:373)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:372)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:101)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient.getPartitionsByFilter(PoolingHiveClient.scala:372)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1295)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1288)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1$$anonfun$apply$1.apply(HiveExternalCatalog.scala:141)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$maybeSynchronized(HiveExternalCatalog.scala:104)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1.apply(HiveExternalCatalog.scala:139)
 at 
com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:345)
 at 
com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:331)
 at 
com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:23)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:137)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1288)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:261)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1045)
 at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:62)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:277)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:277)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:276) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
 at

[jira] [Updated] (SPARK-27652) Caught Hive MetaException when query by partition (partition col start with underscore)

2019-05-07 Thread Tongqing Qiu (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongqing Qiu updated SPARK-27652:
-
Affects Version/s: 2.4.1
   2.4.2

> Caught Hive MetaException when query by partition (partition col start with 
> underscore)
> ---
>
> Key: SPARK-27652
> URL: https://issues.apache.org/jira/browse/SPARK-27652
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1, 2.4.2, 2.4.3
>Reporter: Tongqing Qiu
>Priority: Minor
>
> create a table, set location as parquet, do msck repair table to get the data.
> But when query with partition column, got some errors (adding backtick would 
> not help)
> {code:java}
> select count from some_table where `_partition_date` = '2015-01-01'{code}
> Error
> {code:java}
> com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: 
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARK at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:783)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:774)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:322)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:230)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:222)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:222)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:772)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:373)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:372)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:101)
>  at 
> org.apache.spark.sql.hive.client.PoolingHiveClient.getPartitionsByFilter(PoolingHiveClient.scala:372)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1295)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1288)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1$$anonfun$apply$1.apply(HiveExternalCatalog.scala:141)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$maybeSynchronized(HiveExternalCatalog.scala:104)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1.apply(HiveExternalCatalog.scala:139)
>  at 
> com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:345)
>  at 
> com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:331)
>  at 
> com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:23)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:137)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1288)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:261)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1045)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:62)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:277)
>  at 
>

[jira] [Created] (SPARK-27652) Caught Hive MetaException when query by partition (partition col start with underscore)

2019-05-07 Thread Tongqing Qiu (JIRA)

Tongqing Qiu created SPARK-27652:


 Summary: Caught Hive MetaException when query by partition 
(partition col start with underscore)
 Key: SPARK-27652
 URL: https://issues.apache.org/jira/browse/SPARK-27652
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.3
Reporter: Tongqing Qiu


create a table, set location as parquet, do msck repair table to get the data.

But when query with partition column, got some errors.

select count(*) from some_table where `_partition_date` = '2015-01-01'

Error
{code:java}
com.databricks.backend.common.rpc.DatabricksExceptions$SQLExecutionException: 
java.lang.RuntimeException: Caught Hive MetaException attempting to get 
partition metadata by filter from Hive. You can set the Spark configuration 
setting spark.sql.hive.manageFilesourcePartitions to false to work around this 
problem, however this will result in degraded performance. Please report a bug: 
https://issues.apache.org/jira/browse/SPARK at 
org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:783)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:774)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:772)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:322)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:230)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$retryLocked$1.apply(HiveClientImpl.scala:222)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.synchronizeOnObject(HiveClientImpl.scala:266)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:222)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:772)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:373)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient$$anonfun$getPartitionsByFilter$1.apply(PoolingHiveClient.scala:372)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient.withHiveClient(PoolingHiveClient.scala:101)
 at 
org.apache.spark.sql.hive.client.PoolingHiveClient.getPartitionsByFilter(PoolingHiveClient.scala:372)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1295)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1288)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1$$anonfun$apply$1.apply(HiveExternalCatalog.scala:141)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.org$apache$spark$sql$hive$HiveExternalCatalog$$maybeSynchronized(HiveExternalCatalog.scala:104)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$withClient$1.apply(HiveExternalCatalog.scala:139)
 at 
com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:345)
 at 
com.databricks.backend.daemon.driver.ProgressReporter$.withStatusCode(ProgressReporter.scala:331)
 at 
com.databricks.spark.util.SparkDatabricksProgressReporter$.withStatusCode(ProgressReporter.scala:23)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:137)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1288)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:261)
 at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:1045)
 at 
org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:62)
 at 
org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:277)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:277)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:77)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:276) 
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
 at

[jira] [Commented] (SPARK-27563) automatically get the latest Spark versions in HiveExternalCatalogVersionsSuite

2019-05-07 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834855#comment-16834855
 ] 

Apache Spark commented on SPARK-27563:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/24544

> automatically get the latest Spark versions in 
> HiveExternalCatalogVersionsSuite
> ---
>
> Key: SPARK-27563
> URL: https://issues.apache.org/jira/browse/SPARK-27563
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.4, 3.0.0, 2.4.3
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2019-05-07 Thread Attila Zsolt Piros (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834832#comment-16834832
 ] 

Attila Zsolt Piros commented on SPARK-27651:


I am already working on this.

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from an executor (or the external shuffle service) running 
> on the same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27566) SIGSEV in Spark SQL during broadcast

2019-05-07 Thread Martin Studer (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834834#comment-16834834
 ] 

Martin Studer commented on SPARK-27566:
---

In an attempt to reproduce the issue I found that I seem to be affected by 
https://issues.apache.org/jira/browse/SPARK-23614. Not sure how this relates to 
the SIGSEV but that's where my investigation stands at the moment.

> SIGSEV in Spark SQL during broadcast
> 
>
> Key: SPARK-27566
> URL: https://issues.apache.org/jira/browse/SPARK-27566
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: Hortonworks HDP 2.6.5, Spark 2.3.0.2.6.5.1050-37
>Reporter: Martin Studer
>Priority: Major
>
> During execution of a broadcast exchange the JVM aborts with a segmentation 
> fault:
> {noformat}
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7feb5d024ea2, pid=26118, tid=0x7feabf1ca700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 
> 1.8.0_102-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # j  scala.runtime.BoxesRunTime.unboxToLong(Ljava/lang/Object;)J+9
> {noformat}
> The corresponding information from the {{hs_err_pid}} is:
> {noformat}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7feb5d024ea2, pid=26118, tid=0x7feabf1ca700
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_102-b14) (build 
> 1.8.0_102-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.102-b14 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # j  scala.runtime.BoxesRunTime.unboxToLong(Ljava/lang/Object;)J+9
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x7feaf8c54000):  JavaThread "broadcast-exchange-2" 
> daemon [_thread_in_Java, id=30475, 
> stack(0x7feabf0ca000,0x7feabf1cb000)]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 
> 0x003c7fc8
> Registers:
> RAX=0x0007c0011570, RBX=0x003c7f90, RCX=0x0038, 
> RDX=0x000775daf6d0
> RSP=0x7feabf1c8ab0, RBP=0x7feabf1c8af0, RSI=0x7feb0830, 
> RDI=0x0001
> R8 =0x7feb0800b280, R9 =0x7feb0800c6a0, R10=0x7feb73d59100, 
> R11=0x7feb73181700
> R12=0x, R13=0x7feb5c5a5951, R14=0x7feabf1c8b00, 
> R15=0x7feaf8c54000
> RIP=0x7feb5d024ea2, EFLAGS=0x00010283, CSGSFS=0x0033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x7feabf1c8ab0)
> 0x7feabf1c8ab0:   7feabf1c8ab0 7feb5c5a5951
> 0x7feabf1c8ac0:   7feabf1c8b00 7feb5c5a9610
> 0x7feabf1c8ad0:   7feb4e626068 7feb5c5a5970
> 0x7feabf1c8ae0:    7feabf1c8b00
> 0x7feabf1c8af0:   7feabf1c8b68 7feb5d007dd0
> 0x7feabf1c8b00:    7feb5d007dd0
> 0x7feabf1c8b10:   000775daf6d0 
> 0x7feabf1c8b20:   000774fd2048 7feabf1c8b18
> 0x7feabf1c8b30:   7feb4f27548f 7feabf1c8c20
> 0x7feabf1c8b40:   7feb4f275cd0 
> 0x7feabf1c8b50:   7feb4f2755f0 7feabf1c8b10
> 0x7feabf1c8b60:   7feabf1c8bf0 7feabf1c8c78
> 0x7feabf1c8b70:   7feb5d007dd0 
> 0x7feabf1c8b80:    
> 0x7feabf1c8b90:    
> 0x7feabf1c8ba0:    
> 0x7feabf1c8bb0:    
> 0x7feabf1c8bc0:    
> 0x7feabf1c8bd0:    0001
> 0x7feabf1c8be0:    000774fd2048
> 0x7feabf1c8bf0:   000775daf6f8 000775daf6e8
> 0x7feabf1c8c00:    7feb5d008040
> 0x7feabf1c8c10:   0020 7feb5d008040
> 0x7feabf1c8c20:   000774fdb080 0001
> 0x7feabf1c8c30:   000774fd2048 7feabf1c8c28
> 0x7feabf1c8c40:   7feb4f7636a3 7feabf1c8cc8
> 0x7feabf1c8c50:   7feabfb37848 
> 0x7feabf1c8c60:   7feb4f763720 7feabf1c8bf0
> 0x7feabf1c8c70:   7feabf1c8ca0 7feabf1c8d20
> 0x7feabf1c8c80:   7feb5d007dd0 
> 0x7feabf1c8c90:    0006c03f26e8
> 0x7feabf1c8ca0:   0006c03f26e8  
>

[jira] [Created] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2019-05-07 Thread Attila Zsolt Piros (JIRA)

Attila Zsolt Piros created SPARK-27651:
--

 Summary: Avoid the network when block manager fetches shuffle 
blocks from the same host
 Key: SPARK-27651
 URL: https://issues.apache.org/jira/browse/SPARK-27651
 Project: Spark
  Issue Type: Improvement
  Components: Block Manager
Affects Versions: 3.0.0
Reporter: Attila Zsolt Piros


When a shuffle block (content) is fetched the network is always used even when 
it is fetched from an executor (or the external shuffle service) running on the 
same host.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27645) Cache result of count function to that RDD

2019-05-07 Thread Seungmin Lee (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834797#comment-16834797
 ] 

Seungmin Lee commented on SPARK-27645:
--

I know of course that I can do this:
{code:java}
val count = dataset.count(){code}
But sometimes, we can have functions which call count in it.

 
{code:java}
def funcA(dataset: Dataset) = {
  ...
  return dataset.count()
}
def funcB(dataset: Dataset) = {
  ...
  return dataset.count()
}
{code}
In this case, count function will be run twice. But since dataset cannot be 
modified, I suggest to allow dataset itself cache the count value once it's 
executed.

 

> Cache result of count function to that RDD
> --
>
> Key: SPARK-27645
> URL: https://issues.apache.org/jira/browse/SPARK-27645
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Seungmin Lee
>Priority: Major
>
> I'm not sure whether there have been an update for this(as far as I know, 
> there isn't such feature), since RDD is immutable, why don't we keep the 
> result from count function of that RDD and reuse it in future calls?
> Sometimes, we only have RDD variable but don't have previously run result 
> from count.
> In this case, not running whole count action to entire dataset would be very 
> beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24708) Document the default spark url of master in standalone is "spark://localhost:7070"

2019-05-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-24708.
---
Resolution: Won't Fix

> Document the default spark url of master in standalone is 
> "spark://localhost:7070"
> --
>
> Key: SPARK-24708
> URL: https://issues.apache.org/jira/browse/SPARK-24708
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.1
>Reporter: Chia-Ping Tsai
>Priority: Trivial
>
> In the section "Starting a Cluster Manually" we give a example of starting a 
> worker.
> {code:java}
> ./sbin/start-slave.sh {code}
> However, we only mention the default "web port" so readers may be misled into 
> using the "web port" to start the worker. (of course, I am a "reader" too :()
> It seems to me that adding a bit description of default spark url of master 
> can avoid above ambiguity.
> for example:
> {code:java}
> - Similarly, you can start one or more workers and connect them to the master 
> via:
> + Similarly, you can start one or more workers and connect them to the 
> master's spark URL (default: spark://:7070) via:{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27258) The value of "spark.app.name" or "--name" starts with number , which causes resourceName does not match regular expression

2019-05-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27258.
---
Resolution: Won't Fix

> The value of "spark.app.name" or "--name" starts with number , which causes 
> resourceName does not match regular expression
> --
>
> Key: SPARK-27258
> URL: https://issues.apache.org/jira/browse/SPARK-27258
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: hehuiyuan
>Priority: Minor
>
> {code:java}
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: 
> POST at: https://xxx:xxx/api/v1/namespaces/xxx/services. Message: Service 
> "1min-machinereg-yf-1544604108931-driver-svc" is invalid: metadata.name: 
> Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a DNS-1035 
> label must consist of lower case alphanumeric characters or '-', start with 
> an alphabetic character, and end with an alphanumeric character (e.g. 
> 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'). Received status: Status(apiVersion=v1, 
> code=422, details=StatusDetails(causes=[StatusCause(field=metadata.name, 
> message=Invalid value: "1min-machinereg-yf-1544604108931-driver-svc": a 
> DNS-1035 label must consist of lower case alphanumeric characters or '-', 
> start with an alphabetic character, and end with an alphanumeric character 
> (e.g. 'my-name',  or 'abc-123', regex used for validation is 
> '[a-z]([-a-z0-9]*[a-z0-9])?'), reason=FieldValueInvalid, 
> additionalProperties={})], group=null, kind=Service, 
> name=1min-machinereg-yf-1544604108931-driver-svc, retryAfterSeconds=null, 
> uid=null, additionalProperties={}).
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27577) Wrong thresholds selected by BinaryClassificationMetrics when downsampling

2019-05-07 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27577.
---
   Resolution: Fixed
 Assignee: Shaochen Shi
Fix Version/s: 2.4.4
   3.0.0
   2.3.4

Resolved by https://github.com/apache/spark/pull/24470

> Wrong thresholds selected by BinaryClassificationMetrics when downsampling
> --
>
> Key: SPARK-27577
> URL: https://issues.apache.org/jira/browse/SPARK-27577
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0
>Reporter: Shaochen Shi
>Assignee: Shaochen Shi
>Priority: Minor
> Fix For: 2.3.4, 3.0.0, 2.4.4
>
>
> In binary metrics, a threshold means any instance with a score >= threshold 
> will be considered as positive.
> However, in the existing implementation:
>  # When `numBins` is set when creating a `BinaryClassificationMetrics` 
> object, all records (ordered by scores in DESC) will be grouped into chunks.
>  # In each chunk, statistics (in `BinaryLabelCounter`) of records are 
> accumulated while the first record's score (also the largest) is selected as 
> threshold.
>  # All these generated/sampled records form a new smaller data set to 
> calculate binary metrics.
> At the second step, it brings the BUG that the score/threshold of a record is 
> correlated with wrong values like larger `true positive`, smaller `false 
> negative` when calculating `recallByThresholds`, `precisionByThresholds`, etc.
> Thus, the BUG fix is straightfoward. Let's pick up the last records's core in 
> all chunks as thresholds while statistics merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2019-05-07 Thread Nicholas Studenski (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834770#comment-16834770
 ] 

Nicholas Studenski commented on SPARK-19335:


+1 This feature would be very useful.

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27639) InMemoryTableScan shows the table name on UI if possible

2019-05-07 Thread Yuming Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-27639:

Summary: InMemoryTableScan shows the table name on UI if possible  (was: 
InMemoryTableScan should show the table name on UI)

> InMemoryTableScan shows the table name on UI if possible
> 
>
> Key: SPARK-27639
> URL: https://issues.apache.org/jira/browse/SPARK-27639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> It only shows InMemoryTableScan when scanning InMemoryTable.
> When there are many InMemoryTables, it is difficult to distinguish which one 
> is what we are looking for. This PR show the table name when scanning 
> InMemoryTable. 
> !https://user-images.githubusercontent.com/5399861/57213799-7bccf100-701a-11e9-9872-d90b4a185dc6.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23191) Workers registration failes in case of network drop

2019-05-07 Thread wuyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834703#comment-16834703
 ] 

wuyi commented on SPARK-23191:
--

Hi [~neeraj20gupta]

What do you mean by _multiple driver running in some scenario_ & _As a result 
of which duplicate driver is launched ?_

Do you mean there're multiple drivers running concurrently for one app ? 

> Workers registration failes in case of network drop
> ---
>
> Key: SPARK-23191
> URL: https://issues.apache.org/jira/browse/SPARK-23191
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.3, 2.2.1, 2.3.0
> Environment: OS:- Centos 6.9(64 bit)
>  
>Reporter: Neeraj Gupta
>Priority: Critical
>
> We have a 3 node cluster. We were facing issues of multiple driver running in 
> some scenario in production.
> On further investigation we were able to reproduce iin both 1.6.3 and 2.2.1 
> versions the scenario with following steps:-
>  # Setup a 3 node cluster. Start master and slaves.
>  # On any node where the worker process is running block the connections on 
> port 7077 using iptables.
> {code:java}
> iptables -A OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # After about 10-15 secs we get the error on node that it is unable to 
> connect to master.
> {code:java}
> 2018-01-23 12:08:51,639 [rpc-client-1-1] WARN  
> org.apache.spark.network.server.TransportChannelHandler - Exception in 
> connection from 
> java.io.IOException: Connection timed out
>     at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>     at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>     at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>     at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>     at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>     at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
>     at 
> io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
>     at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
>     at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
>     at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
>     at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
>     at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
>     at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
>     at java.lang.Thread.run(Thread.java:745)
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> 2018-01-23 12:08:51,647 [dispatcher-event-loop-0] ERROR 
> org.apache.spark.deploy.worker.Worker - Connection to master failed! Waiting 
> for master to reconnect...
> {code}
>  # Once we get this exception we renable the connections to port 7077 using
> {code:java}
> iptables -D OUTPUT -p tcp --dport 7077 -j DROP
> {code}
>  # Worker tries to register again with master but is unable to do so. It 
> gives following error
> {code:java}
> 2018-01-23 12:08:58,657 [worker-register-master-threadpool-2] WARN  
> org.apache.spark.deploy.worker.Worker - Failed to connect to master 
> :7077
> org.apache.spark.SparkException: Exception thrown in awaitResult:
>     at 
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>     at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:100)
>     at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:108)
>     at 
> org.apache.spark.deploy.worker.Worker$$anonfun$org$apache$spark$deploy$worker$Worker$$tryRegisterAllMasters$1$$anon$1.run(Worker.scala:241)
>     at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Failed to connect to :7077
>     at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)
>     at 
>

[jira] [Resolved] (SPARK-27643) Add supported Hive version list in doc

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27643.
--
Resolution: Not A Problem

> Add supported Hive version list in doc
> --
>
> Key: SPARK-27643
> URL: https://issues.apache.org/jira/browse/SPARK-27643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.3.3, 2.4.2, 3.0.0
>Reporter: Zhichao  Zhang
>Priority: Minor
>
> Add supported Hive version list for each spark version in doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27645) Cache result of count function to that RDD

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27645.
--
Resolution: Invalid

> Cache result of count function to that RDD
> --
>
> Key: SPARK-27645
> URL: https://issues.apache.org/jira/browse/SPARK-27645
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Seungmin Lee
>Priority: Major
>
> I'm not sure whether there have been an update for this(as far as I know, 
> there isn't such feature), since RDD is immutable, why don't we keep the 
> result from count function of that RDD and reuse it in future calls?
> Sometimes, we only have RDD variable but don't have previously run result 
> from count.
> In this case, not running whole count action to entire dataset would be very 
> beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27645) Cache result of count function to that RDD

2019-05-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834701#comment-16834701
 ] 

Hyukjin Kwon commented on SPARK-27645:
--

I am not sure what you mean. You can run once and keep it somewhere in your 
application.

> Cache result of count function to that RDD
> --
>
> Key: SPARK-27645
> URL: https://issues.apache.org/jira/browse/SPARK-27645
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Seungmin Lee
>Priority: Major
>
> I'm not sure whether there have been an update for this(as far as I know, 
> there isn't such feature), since RDD is immutable, why don't we keep the 
> result from count function of that RDD and reuse it in future calls?
> Sometimes, we only have RDD variable but don't have previously run result 
> from count.
> In this case, not running whole count action to entire dataset would be very 
> beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27644) Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27644:
-
Affects Version/s: (was: 3.1.0)
   3.0.0

> Enable spark.sql.optimizer.nestedSchemaPruning.enabled by default
> -
>
> Key: SPARK-27644
> URL: https://issues.apache.org/jira/browse/SPARK-27644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> We can enable this after resolving all on-going issues and finishing more 
> verifications.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27647) Metric Gauge not threadsafe

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27647.
--
Resolution: Invalid

> Metric Gauge not threadsafe
> ---
>
> Key: SPARK-27647
> URL: https://issues.apache.org/jira/browse/SPARK-27647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: bettermouse
>Priority: Major
>
> when I read class DAGSchedulerSource,I find some Gauges may be not 
> threadSafe.like
>  metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new 
> Gauge[Int] {
>  override def getValue: Int = dagScheduler.failedStages.size
>  })
> this method may be called in other thread,but failedStages field is not 
> thread safe
> filed runningStages,waitingStages have same problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27645) Cache result of count function to that RDD

2019-05-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834697#comment-16834697
 ] 

Hyukjin Kwon commented on SPARK-27645:
--

Please avoid to set a target version which is usually reserved for committers.

> Cache result of count function to that RDD
> --
>
> Key: SPARK-27645
> URL: https://issues.apache.org/jira/browse/SPARK-27645
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Seungmin Lee
>Priority: Major
>
> I'm not sure whether there have been an update for this(as far as I know, 
> there isn't such feature), since RDD is immutable, why don't we keep the 
> result from count function of that RDD and reuse it in future calls?
> Sometimes, we only have RDD variable but don't have previously run result 
> from count.
> In this case, not running whole count action to entire dataset would be very 
> beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27645) Cache result of count function to that RDD

2019-05-07 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-27645:
-
Target Version/s:   (was: 2.4.3)

> Cache result of count function to that RDD
> --
>
> Key: SPARK-27645
> URL: https://issues.apache.org/jira/browse/SPARK-27645
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Seungmin Lee
>Priority: Major
>
> I'm not sure whether there have been an update for this(as far as I know, 
> there isn't such feature), since RDD is immutable, why don't we keep the 
> result from count function of that RDD and reuse it in future calls?
> Sometimes, we only have RDD variable but don't have previously run result 
> from count.
> In this case, not running whole count action to entire dataset would be very 
> beneficial in terms of performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27647) Metric Gauge not threadsafe

2019-05-07 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834696#comment-16834696
 ] 

Hyukjin Kwon commented on SPARK-27647:
--

How is this called by multiple thread? 

> Metric Gauge not threadsafe
> ---
>
> Key: SPARK-27647
> URL: https://issues.apache.org/jira/browse/SPARK-27647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: bettermouse
>Priority: Major
>
> when I read class DAGSchedulerSource,I find some Gauges may be not 
> threadSafe.like
>  metricRegistry.register(MetricRegistry.name("stage", "failedStages"), new 
> Gauge[Int] {
>  override def getValue: Int = dagScheduler.failedStages.size
>  })
> this method may be called in other thread,but failedStages field is not 
> thread safe
> filed runningStages,waitingStages have same problem



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27549) Commit Kafka Source offsets to facilitate external tooling

2019-05-07 Thread Stavros Kontopoulos (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834626#comment-16834626
 ] 

Stavros Kontopoulos commented on SPARK-27549:
-

[~app-tarush] yes I am working on it .

> Commit Kafka Source offsets to facilitate external tooling
> --
>
> Key: SPARK-27549
> URL: https://issues.apache.org/jira/browse/SPARK-27549
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Tools monitoring consumer lag could benefit from having the option of saving 
> the source offsets. Sources use the implementation of 
> org.apache.spark.sql.sources.v2.reader.streaming.
> SparkDataStream. KafkaMicroBatchStream currently [does not 
> commit|https://github.com/apache/spark/blob/5bf5d9d854db53541956dedb03e2de8eecf65b81/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L170]
>  anything as expected so we could expand that.
> Other streaming engines like 
> [Flink|https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration]
>  allow you to enable `auto.commit` at the expense of not having checkpointing.
> Here the proposal is to allow commit the sources offsets when progress has 
> been made.
> I am also aware that another option would be to have a StreamingQueryListener 
> and intercept when batches are completed and then write the offsets anywhere 
> you need to but it would be great if Kafka integration with Structured 
> Streaming could do some of this work anyway.
> [~c...@koeninger.org]  [~marmbrus] what do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27650) sepate the row iterator functionality from ColumnarBatch

2019-05-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27650:


Assignee: Wenchen Fan  (was: Apache Spark)

> sepate the row iterator functionality from ColumnarBatch
> 
>
> Key: SPARK-27650
> URL: https://issues.apache.org/jira/browse/SPARK-27650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27650) sepate the row iterator functionality from ColumnarBatch

2019-05-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27650:


Assignee: Apache Spark  (was: Wenchen Fan)

> sepate the row iterator functionality from ColumnarBatch
> 
>
> Key: SPARK-27650
> URL: https://issues.apache.org/jira/browse/SPARK-27650
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27650) sepate the row iterator functionality from ColumnarBatch

2019-05-07 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-27650:
---

 Summary: sepate the row iterator functionality from ColumnarBatch
 Key: SPARK-27650
 URL: https://issues.apache.org/jira/browse/SPARK-27650
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27549) Commit Kafka Source offsets to facilitate external tooling

2019-05-07 Thread Gabor Somogyi (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834621#comment-16834621
 ] 

Gabor Somogyi commented on SPARK-27549:
---

Yeah, please see the last comment.

> Commit Kafka Source offsets to facilitate external tooling
> --
>
> Key: SPARK-27549
> URL: https://issues.apache.org/jira/browse/SPARK-27549
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Tools monitoring consumer lag could benefit from having the option of saving 
> the source offsets. Sources use the implementation of 
> org.apache.spark.sql.sources.v2.reader.streaming.
> SparkDataStream. KafkaMicroBatchStream currently [does not 
> commit|https://github.com/apache/spark/blob/5bf5d9d854db53541956dedb03e2de8eecf65b81/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L170]
>  anything as expected so we could expand that.
> Other streaming engines like 
> [Flink|https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration]
>  allow you to enable `auto.commit` at the expense of not having checkpointing.
> Here the proposal is to allow commit the sources offsets when progress has 
> been made.
> I am also aware that another option would be to have a StreamingQueryListener 
> and intercept when batches are completed and then write the offsets anywhere 
> you need to but it would be great if Kafka integration with Structured 
> Streaming could do some of this work anyway.
> [~c...@koeninger.org]  [~marmbrus] what do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27649) Unify the way you use 'spark.network.timeout'

2019-05-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27649:


Assignee: Apache Spark

> Unify the way you use 'spark.network.timeout'
> -
>
> Key: SPARK-27649
> URL: https://issues.apache.org/jira/browse/SPARK-27649
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Minor
>
> For historical reasons, structured streaming still has some old way of use
> {code:java}
> spark.network.timeout{code}
> , even though 
> {code:java}
> org.apache.spark.internal.config.Network.NETWORK_TIMEOUT{code}
> is now available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27649) Unify the way you use 'spark.network.timeout'

2019-05-07 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27649:


Assignee: (was: Apache Spark)

> Unify the way you use 'spark.network.timeout'
> -
>
> Key: SPARK-27649
> URL: https://issues.apache.org/jira/browse/SPARK-27649
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Minor
>
> For historical reasons, structured streaming still has some old way of use
> {code:java}
> spark.network.timeout{code}
> , even though 
> {code:java}
> org.apache.spark.internal.config.Network.NETWORK_TIMEOUT{code}
> is now available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27649) Unify the way you use 'spark.network.timeout'

2019-05-07 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27649:
--

 Summary: Unify the way you use 'spark.network.timeout'
 Key: SPARK-27649
 URL: https://issues.apache.org/jira/browse/SPARK-27649
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.0, 2.3.0
Reporter: jiaan.geng


For historical reasons, structured streaming still has some old way of use
{code:java}
spark.network.timeout{code}
, even though 
{code:java}
org.apache.spark.internal.config.Network.NETWORK_TIMEOUT{code}
is now available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27549) Commit Kafka Source offsets to facilitate external tooling

2019-05-07 Thread Tarush Grover (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834603#comment-16834603
 ] 

Tarush Grover commented on SPARK-27549:
---

Anyone working on this?

> Commit Kafka Source offsets to facilitate external tooling
> --
>
> Key: SPARK-27549
> URL: https://issues.apache.org/jira/browse/SPARK-27549
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Tools monitoring consumer lag could benefit from having the option of saving 
> the source offsets. Sources use the implementation of 
> org.apache.spark.sql.sources.v2.reader.streaming.
> SparkDataStream. KafkaMicroBatchStream currently [does not 
> commit|https://github.com/apache/spark/blob/5bf5d9d854db53541956dedb03e2de8eecf65b81/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchStream.scala#L170]
>  anything as expected so we could expand that.
> Other streaming engines like 
> [Flink|https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration]
>  allow you to enable `auto.commit` at the expense of not having checkpointing.
> Here the proposal is to allow commit the sources offsets when progress has 
> been made.
> I am also aware that another option would be to have a StreamingQueryListener 
> and intercept when batches are completed and then write the offsets anywhere 
> you need to but it would be great if Kafka integration with Structured 
> Streaming could do some of this work anyway.
> [~c...@koeninger.org]  [~marmbrus] what do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27638) date format yyyy-M-dd string comparison not handled properly

2019-05-07 Thread peng bo (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834322#comment-16834322
 ] 

peng bo edited comment on SPARK-27638 at 5/7/19 9:21 AM:
-

[~maxgekk] 

I'd love to propose a PR for this. However i am in the middle of something, I 
will try to do it by the end of this week if that's convenient for you as well.

Besides, what's your suggestion about corner cases like `date_col > 
'invalid_date_string'` mentioned by [~cloud_fan] ? Switch back to string 
comparison?

Thanks


was (Author: pengbo):
[~maxgekk] 

I'd love propose a PR for this. However i am in the middle of something, I will 
try to do it by the end of this week if that's convenient for you as well.

Besides, what's your suggestion about corner cases like `date_col > 
'invalid_date_string'` mentioned by [~cloud_fan] ? Switch back to string 
comparison?

Thanks

> date format -M-dd string comparison not handled properly 
> -
>
> Key: SPARK-27638
> URL: https://issues.apache.org/jira/browse/SPARK-27638
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: peng bo
>Priority: Major
>
> The below example works with both Mysql and Hive, however not with spark.
> {code:java}
> mysql> select * from date_test where date_col >= '2000-1-1';
> ++
> | date_col   |
> ++
> | 2000-01-01 |
> ++
> {code}
> The reason is that Spark casts both sides to String type during date and 
> string comparison for partial date support. Please find more details in 
> https://issues.apache.org/jira/browse/SPARK-8420.
> Based on some tests, the behavior of Date and String comparison in Hive and 
> Mysql:
> Hive: Cast to Date, partial date is not supported
> Mysql: Cast to Date,  certain "partial date" is supported by defining certain 
> date string parse rules. Check out {{str_to_datetime}} in 
> https://github.com/mysql/mysql-server/blob/5.5/sql-common/my_time.c
> Here's 2 proposals:
> a. Follow Mysql parse rule, but some partial date string comparison cases 
> won't be supported either. 
> b. Cast String value to Date, if it passes use date.toString, original string 
> otherwise.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
 But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
 This program has been running for 14 days (under SPARK 2.2, running with this 
submit resource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
 The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
Under Spark 2.4, I counted the size of executor memory as time went by during 
the running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
 But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
 This program has been running for 14 days (under SPARK 2.2, running with this 
submit resource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
 The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|

[jira] [Created] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)

tommy duan created SPARK-27648:
--

 Summary: In Spark2.4 Structured Streaming：The executor storage 
memory increasing over time
 Key: SPARK-27648
 URL: https://issues.apache.org/jira/browse/SPARK-27648
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: tommy duan


*Spark Program Code Business:*
Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
*1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before
My spark-submit script is as follows:
{code:java}
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark program. 
And we found that the use of spark program was reduced, but a problem appeared. 
With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
This program has been running for 14 days (under SPARK 2.2, running with this 
commit resource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:

 
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
 But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
 This program has been running for 14 days (under SPARK 2.2, running with this 
submit resource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
 The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
 But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
 This program has been running for 14 days (under SPARK 2.2, running with this 
submitresource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
 The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
 {code}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G \{code}
 On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day. The solution is 
to set the executor memory larger than before My spark-submit script is as 
follows: 

 

In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
 {code}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G \{code}
 On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:

 
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
 {code}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G \{code}
 On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
 But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
 This program has been running for 14 days (under SPARK 2.2, running with this 
submitresource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
 The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before 
{code:java}
 My spark-submit script is as follows:
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
So we upgraded spark to SPARK 2.4 under CDH, tried to run the spark program, 
and found that the use of memory was reduced.
But a problem arises, as the running time increases, the storage memory of 
executor is growing (see Executors - > Storage Memory from the Spark on Yarn 
Resource Manager UI).
This program has been running for 14 days (under SPARK 2.2, running with this 
commit resource, the normal running time is not more than one day, Executor 
memory abnormalities will occur).
The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G 
{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|

[jira] [Updated] (SPARK-27648) In Spark2.4 Structured Streaming：The executor storage memory increasing over time

2019-05-07 Thread tommy duan (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tommy duan updated SPARK-27648:
---
Description: 
*Spark Program Code Business:*
 Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
 *1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day. The solution is 
to set the executor memory larger than before My spark-submit script is as 
follows: 

 

In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
 So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark 
program. 
 And we found that the use of spark program was reduced, but a problem 
appeared. 
 With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
 The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
 {code}
/spark-submit \
 --conf “spark.yarn.executor.memoryOverhead=4096M”
 --num-executors 15 \
 --executor-memory 3G \
 --executor-cores 2 \
 --driver-memory 6G \{code}
 On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|
|309H|1228.8MB/1.5GB|3.976699029|

  was:
*Spark Program Code Business:*
Read the topic on kafka, aggregate the stream data sources, and then output it 
to another topic line of kafka.

*Problem Description:*
*1) Using spark structured streaming in CDH environment (spark 2.2)*, memory 
overflow problems often occur (because of too many versions of state stored in 
memory, this bug has been modified in spark 2.4).
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
Executor memory exceptions occur when running with this submit resource under 
SPARK 2.2 and the normal running time does not exceed one day.

The solution is to set the executor memory larger than before
My spark-submit script is as follows:
{code:java}
/spark-submit\
conf "spark. yarn. executor. memoryOverhead = 4096M"
num-executors 15\
executor-memory 46G\
executor-cores 3\
driver-memory 6G\
...{code}
In this case, the spark program can be guaranteed to run stably for a long 
time, and the executor storage memory is less than 10M (it has been running 
stably for more than 20 days).

*2) From the upgrade information of Spark 2.4, we can see that the problem of 
large memory consumption of state storage has been solved in Spark 2.4.* 
So we upgraded spark to SPARK 2.4 under CDH and tried to run the spark program. 
And we found that the use of spark program was reduced, but a problem appeared. 
With the increase of running time, the storage memory of executor was 
increasing (see "Executors - > Storage Memory" from Spark UI). 
The program has been running for 14 days (running under SPARK 2.2 with this 
submission resource). Executor memory problems occur when the normal P-line 
time does not exceed 1 thrifty L.

The script submitted by the program under spark2.4 is as follows:
{code:java}
/spark-submit \
--conf “spark.yarn.executor.memoryOverhead=4096M”
--num-executors 15 \
--executor-memory 3G \
--executor-cores 2 \
--driver-memory 6G \{code}
On Spark 2.4, I counted the size of executor memory as time went by during the 
running of the spark program:
|Run-time(hour)|Storage Memory size（MB）|Memory growth rate(MB/hour)|
|23.5H|41.6MB/1.5GB|1.770212766|
|108.4H|460.2MB/1.5GB|4.245387454|
|131.7H|559.1MB/1.5GB|4.245254366|
|135.4H|575MB/1.5GB|4.246676514|
|153.6H|641.2MB/1.5GB|4.174479167|
|219H|888.1MB/1.5GB|4.055251142|
|263H|1126.4MB/1.5GB|4.282889734|

81 matches

Mail list logo