[jira] [Commented] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292163#comment-16292163
 ] 

Hyukjin Kwon commented on SPARK-22792:
--

Does this work in 2.1.x or lower version? Just want to check if this is a 
regression.

Also, could you maybe make a minimised reproducer? it seems hard to follow. 
Does the object you used work with normal regular pickle in your local?

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in 

[jira] [Commented] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Annamalai Venugopal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292162#comment-16292162
 ] 

Annamalai Venugopal commented on SPARK-22792:
-

Sorry am new to this.I'll change it now

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> 

[jira] [Updated] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22792:
-
Fix Version/s: (was: 2.2.1)

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> 

[jira] [Updated] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22792:
-
Target Version/s:   (was: 2.2.1)

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> 

[jira] [Updated] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-22792:
-
Priority: Major  (was: Blocker)

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>  Labels: windows
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> 

[jira] [Commented] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292158#comment-16292158
 ] 

Hyukjin Kwon commented on SPARK-22792:
--

Please don't set the blocker and target version that are usually reserved for 
committers, and fixed version that we usually set when it's actually fixed.

> PySpark UDF registering issue
> -
>
> Key: SPARK-22792
> URL: https://issues.apache.org/jira/browse/SPARK-22792
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.2.1
> Environment: Windows OS, Python pycharm ,Spark
>Reporter: Annamalai Venugopal
>Priority: Blocker
>  Labels: windows
> Fix For: 2.2.1
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> I am doing a project with pyspark i am struck with an issue
> Traceback (most recent call last):
>   File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 
> 187, in 
> hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
> hypernym_fn(F.column("token_extracted_data")))
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1957, in wrapper
> return udf_obj(*args)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1916, in __call__
> judf = self._judf
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1900, in _judf
> self._judf_placeholder = self._create_judf()
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1909, in _create_judf
> wrapped_func = _wrap_function(sc, self.func, self.returnType)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
>  line 1866, in _wrap_function
> pickled_command, broadcast_vars, env, includes = 
> _prepare_for_python_RDD(sc, command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
>  line 2374, in _prepare_for_python_RDD
> pickled_command = ser.dumps(command)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
>  line 460, in dumps
> return cloudpickle.dumps(obj, 2)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 704, in dumps
> cp.dump(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 148, in dump
> return Pickler.dump(self, obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 409, in dump
> self.save(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 736, in save_tuple
> save(element)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 821, in save_dict
> self._batch_setitems(obj.items())
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 852, in _batch_setitems
> save(v)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 249, in save_function
> self.save_function_tuple(obj)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
>  line 297, in save_function_tuple
> save(f_globals)
>   File 
> "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
> line 476, in save
> f(self, 

[jira] [Updated] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web

2017-12-14 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22794:
--
Attachment: task_is_succeeded_in_yarn_web.png

> Spark Job failed, but the state is succeeded in Yarn Web
> 
>
> Key: SPARK-22794
> URL: https://issues.apache.org/jira/browse/SPARK-22794
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
> Attachments: task_is_succeeded_in_yarn_web.png
>
>
> I run a job in yarn mode, the job is failed:
> {noformat}
> 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'.
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
> not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {noformat}
> but in the yarn web, the job state is succeeded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web

2017-12-14 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-22794:
--
Description: 
I run a job in yarn mode, the job is failed:

{noformat}
17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
{noformat}


but in the yarn web, the job state is succeeded.

  was:
I run a job in yarn mode, the job is failed:

{noformat}
17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
{noformat}


but in the yarn web, the job state is succeeded:
!C:\Users\g15071\Downloads\task_is_succeeded_in_yarn_web.png!


> Spark Job failed, but the state is succeeded in Yarn Web
> 
>
> Key: SPARK-22794
> URL: https://issues.apache.org/jira/browse/SPARK-22794
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.2.1
>Reporter: KaiXinXIaoLei
>
> I run a job in yarn mode, the job is failed:
> {noformat}
> 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'.
> Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
> not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss;
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
> {noformat}
> but in the yarn web, the job state is succeeded.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web

2017-12-14 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-22794:
-

 Summary: Spark Job failed, but the state is succeeded in Yarn Web
 Key: SPARK-22794
 URL: https://issues.apache.org/jira/browse/SPARK-22794
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.2.1
Reporter: KaiXinXIaoLei


I run a job in yarn mode, the job is failed:

{noformat}
17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'.
Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does 
not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss;
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
{noformat}


but in the yarn web, the job state is succeeded:
!C:\Users\g15071\Downloads\task_is_succeeded_in_yarn_web.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2017-12-14 Thread Mayank Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Agarwal updated SPARK-22371:
---
Attachment: ShuffleIssue.java
Helper.scala

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
> Attachments: Helper.scala, ShuffleIssue.java, ShuffleIssue.java, 
> driver-thread-dump-spark2.1.txt
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22793) Memory leak in Spark Thrift Server

2017-12-14 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-22793:

Description: 
1. Start HiveThriftServer2.
2. Connect to thriftserver through beeline.
3. Close the beeline.
4. repeat step2 and step 3 for several times, which caused the leak of Memory.

we found there are many directories never be dropped under the path
{code:java}
hive.exec.local.scratchdir
{code} and 
{code:java}
hive.exec.scratchdir
{code} , as we know the scratchdir has been added to deleteOnExit when it be 
created. So it means that the cache size of FileSystem deleteOnExit will keep 
increasing until JVM terminated.

In addition, we use 
{code:java}
jmap -histo:live [PID]
{code} to printout the size of objects in HiveThriftServer2 Process, we can 
find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
"org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we 
closed all the beeline connections, which caused the leak of Memory.




  was:
1. Start HiveThriftServer2
2. Connect to thriftserver through beeline
3. Close the beeline
4. repeat step2 and step 3 for several times

we found there are many directories never be dropped under the path
{code:java}
hive.exec.local.scratchdir
{code} and 
{code:java}
hive.exec.scratchdir
{code} , as we know the scratchdir has been added to deleteOnExit when it be 
created. So it means that the cache size of FileSystem deleteOnExit will keep 
increasing until JVM terminated.

In addition, we use 
{code:java}
jmap -histo:live [PID]
{code} to printout the size of objects in HiveThriftServer2 Process, we can 
find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
"org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we 
closed all the beeline connections, which caused the leak of Memory.





> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-22793
> URL: https://issues.apache.org/jira/browse/SPARK-22793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: zuotingbing
>Priority: Critical
>
> 1. Start HiveThriftServer2.
> 2. Connect to thriftserver through beeline.
> 3. Close the beeline.
> 4. repeat step2 and step 3 for several times, which caused the leak of Memory.
> we found there are many directories never be dropped under the path
> {code:java}
> hive.exec.local.scratchdir
> {code} and 
> {code:java}
> hive.exec.scratchdir
> {code} , as we know the scratchdir has been added to deleteOnExit when it be 
> created. So it means that the cache size of FileSystem deleteOnExit will keep 
> increasing until JVM terminated.
> In addition, we use 
> {code:java}
> jmap -histo:live [PID]
> {code} to printout the size of objects in HiveThriftServer2 Process, we can 
> find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
> "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though 
> we closed all the beeline connections, which caused the leak of Memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22793) Memory leak in Spark Thrift Server

2017-12-14 Thread zuotingbing (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-22793:

Description: 
1. Start HiveThriftServer2
2. Connect to thriftserver through beeline
3. Close the beeline
4. repeat step2 and step 3 for several times

we found there are many directories never be dropped under the path
{code:java}
hive.exec.local.scratchdir
{code} and 
{code:java}
hive.exec.scratchdir
{code} , as we know the scratchdir has been added to deleteOnExit when it be 
created. So it means that the cache size of FileSystem deleteOnExit will keep 
increasing until JVM terminated.

In addition, we use 
{code:java}
jmap -histo:live [PID]
{code} to printout the size of objects in HiveThriftServer2 Process, we can 
find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
"org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we 
closed all the beeline connections, which caused the leak of Memory.




  was:
1. Start HiveThriftServer2
2. Connect to thriftserver through beeline
3. Close the beeline
4. repeat step2 and step 3 for several times

we found there are many directories never be dropped under the path
{code:java}
hive.exec.local.scratchdir
{code} and 
{code:java}
hive.exec.scratchdir
{code} , as we know the scratchdir is added to deleteOnExit when it be created. 
So it means that the cache size of FileSystem deleteOnExit will keep increasing 
until JVM terminated.

In addition, we use 
{code:java}
jmap -histo:live [PID]
{code} to printout the size of objects in HiveThriftServer2 Process, we can 
find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
"org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we 
closed all the beeline connections, which caused the leak of Memory.





> Memory leak in Spark Thrift Server
> --
>
> Key: SPARK-22793
> URL: https://issues.apache.org/jira/browse/SPARK-22793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2
>Reporter: zuotingbing
>Priority: Critical
>
> 1. Start HiveThriftServer2
> 2. Connect to thriftserver through beeline
> 3. Close the beeline
> 4. repeat step2 and step 3 for several times
> we found there are many directories never be dropped under the path
> {code:java}
> hive.exec.local.scratchdir
> {code} and 
> {code:java}
> hive.exec.scratchdir
> {code} , as we know the scratchdir has been added to deleteOnExit when it be 
> created. So it means that the cache size of FileSystem deleteOnExit will keep 
> increasing until JVM terminated.
> In addition, we use 
> {code:java}
> jmap -histo:live [PID]
> {code} to printout the size of objects in HiveThriftServer2 Process, we can 
> find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
> "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though 
> we closed all the beeline connections, which caused the leak of Memory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22793) Memory leak in Spark Thrift Server

2017-12-14 Thread zuotingbing (JIRA)
zuotingbing created SPARK-22793:
---

 Summary: Memory leak in Spark Thrift Server
 Key: SPARK-22793
 URL: https://issues.apache.org/jira/browse/SPARK-22793
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.2
Reporter: zuotingbing
Priority: Critical


1. Start HiveThriftServer2
2. Connect to thriftserver through beeline
3. Close the beeline
4. repeat step2 and step 3 for several times

we found there are many directories never be dropped under the path
{code:java}
hive.exec.local.scratchdir
{code} and 
{code:java}
hive.exec.scratchdir
{code} , as we know the scratchdir is added to deleteOnExit when it be created. 
So it means that the cache size of FileSystem deleteOnExit will keep increasing 
until JVM terminated.

In addition, we use 
{code:java}
jmap -histo:live [PID]
{code} to printout the size of objects in HiveThriftServer2 Process, we can 
find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and 
"org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we 
closed all the beeline connections, which caused the leak of Memory.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22792) PySpark UDF registering issue

2017-12-14 Thread Annamalai Venugopal (JIRA)
Annamalai Venugopal created SPARK-22792:
---

 Summary: PySpark UDF registering issue
 Key: SPARK-22792
 URL: https://issues.apache.org/jira/browse/SPARK-22792
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.2.1
 Environment: Windows OS, Python pycharm ,Spark
Reporter: Annamalai Venugopal
Priority: Blocker
 Fix For: 2.2.1


I am doing a project with pyspark i am struck with an issue

Traceback (most recent call last):
  File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 187, 
in 
hypernym_extracted_data = result.withColumn("hypernym_extracted_data", 
hypernym_fn(F.column("token_extracted_data")))
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
 line 1957, in wrapper
return udf_obj(*args)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
 line 1916, in __call__
judf = self._judf
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
 line 1900, in _judf
self._judf_placeholder = self._create_judf()
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
 line 1909, in _create_judf
wrapped_func = _wrap_function(sc, self.func, self.returnType)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py",
 line 1866, in _wrap_function
pickled_command, broadcast_vars, env, includes = 
_prepare_for_python_RDD(sc, command)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py",
 line 2374, in _prepare_for_python_RDD
pickled_command = ser.dumps(command)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py",
 line 460, in dumps
return cloudpickle.dumps(obj, 2)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 704, in dumps
cp.dump(obj)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 148, in dump
return Pickler.dump(self, obj)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 409, in dump
self.save(obj)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 736, in save_tuple
save(element)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 249, in save_function
self.save_function_tuple(obj)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 297, in save_function_tuple
save(f_globals)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 821, in save_dict
self._batch_setitems(obj.items())
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 852, in _batch_setitems
save(v)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 249, in save_function
self.save_function_tuple(obj)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 297, in save_function_tuple
save(f_globals)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 476, in save
f(self, obj) # Call unbound method with explicit self
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 821, in save_dict
self._batch_setitems(obj.items())
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 852, in _batch_setitems
save(v)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", 
line 521, in save
self.save_reduce(obj=obj, *rv)
  File 
"C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py",
 line 565, in save_reduce
"args[0] from 

[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2017-12-14 Thread Mayank Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292134#comment-16292134
 ] 

Mayank Agarwal edited comment on SPARK-22371 at 12/15/17 7:14 AM:
--

Hi, Sorry for late reply.

>From our analysis this error seems to come when there are jobs running 
>parallel on datasets and union of all those datasets and Full GC triggered at 
>same time which clears the accumulate of children dataset of union.

I am attaching a small program from which this error comes frequently.

[^ShuffleIssue.java]


was (Author: mayank.agarwal2305):
[^ShuffleIssue.java]

> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
> Attachments: ShuffleIssue.java, driver-thread-dump-spark2.1.txt
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2017-12-14 Thread Mayank Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292134#comment-16292134
 ] 

Mayank Agarwal edited comment on SPARK-22371 at 12/15/17 7:13 AM:
--

[^ShuffleIssue.java]


was (Author: mayank.agarwal2305):
Hi, Sorry for late reply.

>From our analysis this error seems to come when there are jobs running 
>parallel on datasets and union of all those datasets and Full GC triggered at 
>same time which clears the accumulate of children dataset of union.

I am attaching a small program from which this error comes frequently.


> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
> Attachments: ShuffleIssue.java, driver-thread-dump-spark2.1.txt
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982

2017-12-14 Thread Mayank Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mayank Agarwal updated SPARK-22371:
---
Attachment: ShuffleIssue.java

Hi, Sorry for late reply.

>From our analysis this error seems to come when there are jobs running 
>parallel on datasets and union of all those datasets and Full GC triggered at 
>same time which clears the accumulate of children dataset of union.

I am attaching a small program from which this error comes frequently.


> dag-scheduler-event-loop thread stopped with error  Attempted to access 
> garbage collected accumulator 5605982
> -
>
> Key: SPARK-22371
> URL: https://issues.apache.org/jira/browse/SPARK-22371
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Mayank Agarwal
> Attachments: ShuffleIssue.java, driver-thread-dump-spark2.1.txt
>
>
> Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler 
> thread is stopped because of *Attempted to access garbage collected 
> accumulator 5605982*.
> from our investigation it look like accumulator is cleaned by GC first and 
> same accumulator is used for merging the results from executor on task 
> completion event.
> As the error java.lang.IllegalAccessError is LinkageError which is treated as 
> FatalError so dag-scheduler loop is finished with below exception.
> ---ERROR stack trace --
> Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: 
> Attempted to access garbage collected accumulator 5605982
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253)
>   at 
> org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> I am attaching the thread dump of driver as well 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22753) Get rid of dataSource.writeAndRead

2017-12-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22753.
-
   Resolution: Fixed
 Assignee: Li Yuanjian
Fix Version/s: 2.3.0

> Get rid of dataSource.writeAndRead
> --
>
> Key: SPARK-22753
> URL: https://issues.apache.org/jira/browse/SPARK-22753
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Li Yuanjian
>Assignee: Li Yuanjian
>Priority: Minor
> Fix For: 2.3.0
>
>
> Code clean work for getting rid of dataSource.writeAndRead. 
> As the discussion in https://github.com/apache/spark/pull/16481 and 
> https://github.com/apache/spark/pull/18975#discussion_r155454606
> Currently the BaseRelation returned by `dataSource.writeAndRead` only used in 
> `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some 
> common code paths. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22791) Redact Output of Explain

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22791:


Assignee: Xiao Li  (was: Apache Spark)

> Redact Output of Explain
> 
>
> Key: SPARK-22791
> URL: https://issues.apache.org/jira/browse/SPARK-22791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When calling explain on a query, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22791) Redact Output of Explain

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292131#comment-16292131
 ] 

Apache Spark commented on SPARK-22791:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19985

> Redact Output of Explain
> 
>
> Key: SPARK-22791
> URL: https://issues.apache.org/jira/browse/SPARK-22791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When calling explain on a query, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22791) Redact Output of Explain

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22791:


Assignee: Apache Spark  (was: Xiao Li)

> Redact Output of Explain
> 
>
> Key: SPARK-22791
> URL: https://issues.apache.org/jira/browse/SPARK-22791
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When calling explain on a query, the output can contain sensitive 
> information. We should provide an admin/user to redact such information.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22791) Redact Output of Explain

2017-12-14 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22791:
---

 Summary: Redact Output of Explain
 Key: SPARK-22791
 URL: https://issues.apache.org/jira/browse/SPARK-22791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: Xiao Li
Assignee: Xiao Li


When calling explain on a query, the output can contain sensitive information. 
We should provide an admin/user to redact such information.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22787) Add a TPCH query suite

2017-12-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22787.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add a TPCH query suite
> --
>
> Key: SPARK-22787
> URL: https://issues.apache.org/jira/browse/SPARK-22787
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Add a test suite to ensure all the TPC-H queries can be successfully 
> analyzed, optimized and compiled without hitting the max iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22647) Docker files for image creation

2017-12-14 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292113#comment-16292113
 ] 

Anirudh Ramanathan commented on SPARK-22647:


I think we haven't run that particular configuration in production on our fork 
so far. Although I don't anticipate issues due to the switch, I'd be more 
inclined to go with what we've tested so far - and maybe switching to centos on 
our fork first. 

> Docker files for image creation
> ---
>
> Key: SPARK-22647
> URL: https://issues.apache.org/jira/browse/SPARK-22647
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>
> This covers the dockerfiles that need to be shipped to enable the Kubernetes 
> backend for Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR

2017-12-14 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292048#comment-16292048
 ] 

Xuefu Zhang commented on SPARK-22765:
-

[~tgraves], I think it would help if SPARK-21656 can make a close-to-zero idle 
time work. This is one source of inefficiency. Our version is too old to 
backport the fix, but will try out this when we upgrade.

The second source of inefficiency comes in the fact that Spark favors bigger 
containers. A 4-core container might be running one task while wasting the 
other cores/mem. The executor cannot die as long as there is one task running. 
One might argue that a user configures 1-core containers under dynamic 
allocation. but this is probably not optimal on other aspects.

The third reason that one might favor MR-styled scheduling is its simplicity 
and efficiency. Frequently we found that for heavy workload the scheduler 
cannot really keep up with the task ups and downs, especially when the tasks 
finish fast. 

For cost-conscious users, cluster-level resource efficiency is probably what's 
looked at. My suspicion is that an enhanced MR-styled scheduling, simple and 
performing, will be significantly improve resource efficiency than a typical 
use of dynamic allocation, without sacrificing much performance.

As a start point, we will first benchmark with SPARK-21656 when possible.

> Create a new executor allocation scheme based on that of MR
> ---
>
> Key: SPARK-22765
> URL: https://issues.apache.org/jira/browse/SPARK-22765
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 1.6.0
>Reporter: Xuefu Zhang
>
> Many users migrating their workload from MR to Spark find a significant 
> resource consumption hike (i.e, SPARK-22683). While this might not be a 
> concern for users that are more performance centric, for others conscious 
> about cost, such hike creates a migration obstacle. This situation can get 
> worse as more users are moving to cloud.
> Dynamic allocation make it possible for Spark to be deployed in multi-tenant 
> environment. With its performance-centric design, its inefficiency has also 
> unfortunately shown up, especially when compared with MR. Thus, it's believed 
> that MR-styled scheduler still has its merit. Based on our research, the 
> inefficiency associated with dynamic allocation comes in many aspects such as 
> executor idling out, bigger executors, many stages (rather than 2 stages only 
> in MR) in a spark job, etc.
> Rather than fine tuning dynamic allocation for efficiency, the proposal here 
> is to add a new, efficiency-centric  scheduling scheme based on that of MR. 
> Such a MR-based scheme can be further enhanced and be more adapted to Spark 
> execution model. This alternative is expected to offer good performance 
> improvement (compared to MR) still with similar to or even better efficiency 
> than MR.
> Inputs are greatly welcome!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22790) add a configurable factor to describe HadoopFsRelation's size

2017-12-14 Thread Nan Zhu (JIRA)
Nan Zhu created SPARK-22790:
---

 Summary: add a configurable factor to describe HadoopFsRelation's 
size
 Key: SPARK-22790
 URL: https://issues.apache.org/jira/browse/SPARK-22790
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Nan Zhu


as per discussion in 
https://github.com/apache/spark/pull/19864#discussion_r156847927

the current HadoopFsRelation is purely based on the underlying file size which 
is not accurate and makes the execution vulnerable to errors like OOM

Users can enable CBO with the functionalities in 
https://github.com/apache/spark/pull/19864 to avoid this issue

This JIRA proposes to add a configurable factor to sizeInBytes method in 
HadoopFsRelation class so that users can mitigate this problem without CBO



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22790) add a configurable factor to describe HadoopFsRelation's size

2017-12-14 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291985#comment-16291985
 ] 

Nan Zhu commented on SPARK-22790:
-

created per discussion in https://github.com/apache/spark/pull/19864 (will file 
a PR soon)

> add a configurable factor to describe HadoopFsRelation's size
> -
>
> Key: SPARK-22790
> URL: https://issues.apache.org/jira/browse/SPARK-22790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Nan Zhu
>
> as per discussion in 
> https://github.com/apache/spark/pull/19864#discussion_r156847927
> the current HadoopFsRelation is purely based on the underlying file size 
> which is not accurate and makes the execution vulnerable to errors like OOM
> Users can enable CBO with the functionalities in 
> https://github.com/apache/spark/pull/19864 to avoid this issue
> This JIRA proposes to add a configurable factor to sizeInBytes method in 
> HadoopFsRelation class so that users can mitigate this problem without CBO



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22781) Support creating streaming dataset with ORC files

2017-12-14 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291960#comment-16291960
 ] 

Dongjoon Hyun commented on SPARK-22781:
---

Hi, [~tdas] and [~zsxwing].
Could you give me some advice on this issue?

> Support creating streaming dataset with ORC files
> -
>
> Key: SPARK-22781
> URL: https://issues.apache.org/jira/browse/SPARK-22781
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>
> This issue supports creating streaming dataset with ORC file format



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22789) Add ContinuousExecution for continuous processing queries

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22789:


Assignee: (was: Apache Spark)

> Add ContinuousExecution for continuous processing queries
> -
>
> Key: SPARK-22789
> URL: https://issues.apache.org/jira/browse/SPARK-22789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22789) Add ContinuousExecution for continuous processing queries

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291900#comment-16291900
 ] 

Apache Spark commented on SPARK-22789:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/19984

> Add ContinuousExecution for continuous processing queries
> -
>
> Key: SPARK-22789
> URL: https://issues.apache.org/jira/browse/SPARK-22789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22789) Add ContinuousExecution for continuous processing queries

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22789:


Assignee: Apache Spark

> Add ContinuousExecution for continuous processing queries
> -
>
> Key: SPARK-22789
> URL: https://issues.apache.org/jira/browse/SPARK-22789
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22047) HiveExternalCatalogVersionsSuite is Flaky on Jenkins

2017-12-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22047.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> HiveExternalCatalogVersionsSuite is Flaky on Jenkins
> 
>
> Key: SPARK-22047
> URL: https://issues.apache.org/jira/browse/SPARK-22047
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Armin Braun
>Assignee: Wenchen Fan
>  Labels: flaky-test
> Fix For: 2.3.0
>
>
> HiveExternalCatalogVersionsSuite fails quite a bit lately e.g.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/3490/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedException: spark-submit returned with exit 
> code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  
> '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py'
>   2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class 
> org.apache.spark.launcher.Main
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> spark-submit returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  
> '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py'
> 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class 
> org.apache.spark.launcher.Main
>
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1089)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:81)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:120)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:105)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:105)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null

2017-12-14 Thread Anvesh R (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291805#comment-16291805
 ] 

Anvesh R commented on SPARK-22036:
--

+1 Issue reproduced on spark-2.2.0 : 

Data at s3 location - s3://bucket/spark-sql-jira/ :
-
100|9

drop table if exists test;
CREATE EXTERNAL TABLE `test` (
adecimal(38,10),
bdecimal(38,10)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://bucket/spark-sql-jira/';

spark-sql> select a,(a*b*0.98765432100) from test;
100 9876444.4445679
Time taken: 11.033 seconds, Fetched 1 row(s)

spark-sql> select a,(a*b*0.987654321000) from test;
100 NULL
Time taken: 0.523 seconds, Fetched 1 row(s)

Changing a column's scale from decimal(38,10) to decimal(38,9) also helped but 
we would loose precision. 

> BigDecimal multiplication sometimes returns null
> 
>
> Key: SPARK-22036
> URL: https://issues.apache.org/jira/browse/SPARK-22036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Olivier Blanvillain
>
> The multiplication of two BigDecimal numbers sometimes returns null. Here is 
> a minimal reproduction:
> {code:java}
> object Main extends App {
>   import org.apache.spark.{SparkConf, SparkContext}
>   import org.apache.spark.sql.SparkSession
>   import spark.implicits._
>   val conf = new 
> SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", 
> "false")
>   val spark = 
> SparkSession.builder().config(conf).appName("REPL").getOrCreate()
>   implicit val sqlContext = spark.sqlContext
>   case class X2(a: BigDecimal, b: BigDecimal)
>   val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), 
> BigDecimal(-1000.1
>   val result = ds.select(ds("a") * ds("b")).collect.head
>   println(result) // [null]
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22496) beeline display operation log

2017-12-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22496.
-
Resolution: Fixed

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22789) Add ContinuousExecution for continuous processing queries

2017-12-14 Thread Jose Torres (JIRA)
Jose Torres created SPARK-22789:
---

 Summary: Add ContinuousExecution for continuous processing queries
 Key: SPARK-22789
 URL: https://issues.apache.org/jira/browse/SPARK-22789
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22733) refactor StreamExecution for extensibility

2017-12-14 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22733.
--
   Resolution: Fixed
 Assignee: Jose Torres
Fix Version/s: 2.3.0

> refactor StreamExecution for extensibility
> --
>
> Key: SPARK-22733
> URL: https://issues.apache.org/jira/browse/SPARK-22733
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Jose Torres
> Fix For: 2.3.0
>
>
> StreamExecution currently mixes together meta-logic (tracking and signalling 
> progress, persistence, reporting) with the core behavior of generating 
> batches and running them. We want to reuse the former but not the latter in 
> continuous execution, so we need to split them up.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-22778:
--

Assignee: Yinan Li

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Assignee: Yinan Li
>Priority: Critical
> Fix For: 2.3.0
>
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully

2017-12-14 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22778.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19972
[https://github.com/apache/spark/pull/19972]

> Kubernetes scheduler at master failing to run applications successfully
> ---
>
> Key: SPARK-22778
> URL: https://issues.apache.org/jira/browse/SPARK-22778
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Anirudh Ramanathan
>Priority: Critical
> Fix For: 2.3.0
>
>
> Building images based on master and deploying Spark PI results in the 
> following error.
> 2017-12-13 19:57:19 INFO  SparkContext:54 - Successfully stopped SparkContext
> Exception in thread "main" org.apache.spark.SparkException: Could not parse 
> Master URL: 'k8s:https://xx.yy.zz.ww'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741)
>   at org.apache.spark.SparkContext.(SparkContext.scala:496)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Shutdown hook called
> 2017-12-13 19:57:19 INFO  ShutdownHookManager:54 - Deleting directory 
> /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd
> This is likely an artifact seen because of changes in master, or our 
> submission code in the reviews. We haven't seen this on our fork. Hopefully 
> once integration tests are ported against upstream/master, we will catch 
> these issues earlier. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3419) Scheduler shouldn't delay running a task when executors don't reside at any of its preferred locations

2017-12-14 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-3419.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

This has been fixed for a long time, looks like as part of SPARK-4939 (though 
that sounds unrelated).

> Scheduler shouldn't delay running a task when executors don't reside at any 
> of its preferred locations 
> ---
>
> Key: SPARK-3419
> URL: https://issues.apache.org/jira/browse/SPARK-3419
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Reporter: Sandy Ryza
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22788:


Assignee: (was: Apache Spark)

> HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
> -
>
> Key: SPARK-22788
> URL: https://issues.apache.org/jira/browse/SPARK-22788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> Code: 
> {noformat}
> if (conf.getBoolean("hdfs.append.support", false) || 
> dfs.isInstanceOf[RawLocalFileSystem]) {
>   dfs.append(dfsPath)
> } else {
>   throw new IllegalStateException("File exists and there is no append 
> support!")
> }
> {noformat}
> This makes the exception to be thrown if you enable 
> {{writeAheadLog.closeFileAfterWrite}} with HDFS.
> The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and 
> the default value is true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22788:


Assignee: Apache Spark

> HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
> -
>
> Key: SPARK-22788
> URL: https://issues.apache.org/jira/browse/SPARK-22788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Trivial
>
> Code: 
> {noformat}
> if (conf.getBoolean("hdfs.append.support", false) || 
> dfs.isInstanceOf[RawLocalFileSystem]) {
>   dfs.append(dfsPath)
> } else {
>   throw new IllegalStateException("File exists and there is no append 
> support!")
> }
> {noformat}
> This makes the exception to be thrown if you enable 
> {{writeAheadLog.closeFileAfterWrite}} with HDFS.
> The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and 
> the default value is true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291602#comment-16291602
 ] 

Apache Spark commented on SPARK-22788:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19983

> HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
> -
>
> Key: SPARK-22788
> URL: https://issues.apache.org/jira/browse/SPARK-22788
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Trivial
>
> Code: 
> {noformat}
> if (conf.getBoolean("hdfs.append.support", false) || 
> dfs.isInstanceOf[RawLocalFileSystem]) {
>   dfs.append(dfsPath)
> } else {
>   throw new IllegalStateException("File exists and there is no append 
> support!")
> }
> {noformat}
> This makes the exception to be thrown if you enable 
> {{writeAheadLog.closeFileAfterWrite}} with HDFS.
> The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and 
> the default value is true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"

2017-12-14 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-22788:
--

 Summary: HdfsUtils.getOutputStream uses non-existent Hadoop conf 
"hdfs.append.support"
 Key: SPARK-22788
 URL: https://issues.apache.org/jira/browse/SPARK-22788
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 1.6.0
Reporter: Marcelo Vanzin
Priority: Trivial


Code: 

{noformat}
if (conf.getBoolean("hdfs.append.support", false) || 
dfs.isInstanceOf[RawLocalFileSystem]) {
  dfs.append(dfsPath)
} else {
  throw new IllegalStateException("File exists and there is no append 
support!")
}
{noformat}

This makes the exception to be thrown if you enable 
{{writeAheadLog.closeFileAfterWrite}} with HDFS.

The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and the 
default value is true.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Cuquemelle updated SPARK-22683:
--
Description: 
While migrating a series of jobs from MR to Spark using dynamicAllocation, I've 
noticed almost a doubling (+114% exactly) of resource consumption of Spark 
w.r.t MR, for a wall clock time gain of 43%

About the context: 
- resource usage stands for vcore-hours allocation for the whole job, as seen 
by YARN
- I'm talking about a series of jobs because we provide our users with a way to 
define experiments (via UI / DSL) that automatically get translated to Spark / 
MR jobs and submitted on the cluster
- we submit around 500 of such jobs each day
- these jobs are usually one shot, and the amount of processing can vary a lot 
between jobs, and as such finding an efficient number of executors for each job 
is difficult to get right, which is the reason I took the path of dynamic 
allocation.  
- Some of the tests have been scheduled on an idle queue, some on a full queue.
- experiments have been conducted with spark.executor-cores = 5 and 10, only 
results for 5 cores have been reported because efficiency was overall better 
than with 10 cores
- the figures I give are averaged over a representative sample of those jobs 
(about 600 jobs) ranging from tens to thousands splits in the data partitioning 
and between 400 to 9000 seconds of wall clock time.
- executor idle timeout is set to 30s;
 

Definition: 
- let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
which represent the max number of tasks an executor will process in parallel.
- the current behaviour of the dynamic allocation is to allocate enough 
containers to have one taskSlot per task, which minimizes latency, but wastes 
resources when tasks are small regarding executor allocation and idling 
overhead. 

The results using the proposal (described below) over the job sample (600 jobs):
- by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
resource usage, for a 37% (against 43%) reduction in wall clock time for Spark 
w.r.t MR
- by trying to minimize the average resource consumption, I ended up with 6 
tasks per core, with a 30% resource usage reduction, for a similar wall clock 
time w.r.t. MR

What did I try to solve the issue with existing parameters (summing up a few 
points mentioned in the comments) ?
- change dynamicAllocation.maxExecutors: this would need to be adapted for each 
job (tens to thousands splits can occur), and essentially remove the interest 
of using the dynamic allocation.
- use dynamicAllocation.backlogTimeout: 
- setting this parameter right to avoid creating unused executors is very 
dependant on wall clock time. One basically needs to solve the exponential ramp 
up for the target time. So this is not an option for my use case where I don't 
want a per-job tuning. 
- I've still done a series of experiments, details in the comments. Result 
is that after manual tuning, the best I could get was a similar resource 
consumption at the expense of 20% more wall clock time, or a similar wall clock 
time at the expense of 60% more resource consumption than what I got using my 
proposal @ 6 tasks per slot (this value being optimized over a much larger 
range of jobs as already stated)
- as mentioned in another comment, tampering with the exponential ramp up 
might yield task imbalance and such old executors could become contention 
points for other exes trying to remotely access blocks in the old exes (not 
witnessed in the jobs I'm talking about, but we did see this behavior in other 
jobs)

Proposal: 

Simply add a tasksPerExecutorSlot parameter, which makes it possible to specify 
how many tasks a single taskSlot should ideally execute to mitigate the 
overhead of executor allocation.

PR: https://github.com/apache/spark/pull/19881

  was:
While migrating a series of jobs from MR to Spark using dynamicAllocation, I've 
noticed almost a doubling (+114% exactly) of resource consumption of Spark 
w.r.t MR, for a wall clock time gain of 43%

About the context: 
- resource usage stands for vcore-hours allocation for the whole job, as seen 
by YARN
- I'm talking about a series of jobs because we provide our users with a way to 
define experiments (via UI / DSL) that automatically get translated to Spark / 
MR jobs and submitted on the cluster
- we submit around 500 of such jobs each day
- these jobs are usually one shot, and the amount of processing can vary a lot 
between jobs, and as such finding an efficient number of executors for each job 
is difficult to get right, which is the reason I took the path of dynamic 
allocation.  
- Some of the tests have been scheduled on an idle queue, some on a full queue.
- experiments have been conducted with spark.executor-cores = 5 and 10, only 
results for 5 cores have been reported because efficiency was overall 

[jira] [Assigned] (SPARK-22787) Add a TPCH query suite

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22787:


Assignee: Apache Spark  (was: Xiao Li)

> Add a TPCH query suite
> --
>
> Key: SPARK-22787
> URL: https://issues.apache.org/jira/browse/SPARK-22787
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Add a test suite to ensure all the TPC-H queries can be successfully 
> analyzed, optimized and compiled without hitting the max iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22787) Add a TPCH query suite

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291486#comment-16291486
 ] 

Apache Spark commented on SPARK-22787:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19982

> Add a TPCH query suite
> --
>
> Key: SPARK-22787
> URL: https://issues.apache.org/jira/browse/SPARK-22787
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add a test suite to ensure all the TPC-H queries can be successfully 
> analyzed, optimized and compiled without hitting the max iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22787) Add a TPCH query suite

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22787:


Assignee: Xiao Li  (was: Apache Spark)

> Add a TPCH query suite
> --
>
> Key: SPARK-22787
> URL: https://issues.apache.org/jira/browse/SPARK-22787
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add a test suite to ensure all the TPC-H queries can be successfully 
> analyzed, optimized and compiled without hitting the max iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22787) Add a TPCH query suite

2017-12-14 Thread Xiao Li (JIRA)
Xiao Li created SPARK-22787:
---

 Summary: Add a TPCH query suite
 Key: SPARK-22787
 URL: https://issues.apache.org/jira/browse/SPARK-22787
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.1
Reporter: Xiao Li
Assignee: Xiao Li


Add a test suite to ensure all the TPC-H queries can be successfully analyzed, 
optimized and compiled without hitting the max iteration threshold.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14822) Add lazy executor startup to Mesos Scheduler

2017-12-14 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291404#comment-16291404
 ] 

Imran Rashid commented on SPARK-14822:
--

[~mgummelt] is this still relevant?  seems like dynamic allocation on mesos 
already covers this (not precisely the same but pretty close)

> Add lazy executor startup to Mesos Scheduler
> 
>
> Key: SPARK-14822
> URL: https://issues.apache.org/jira/browse/SPARK-14822
> Project: Spark
>  Issue Type: Task
>  Components: Mesos
>Reporter: Michael Gummelt
>
> As we deprecate fine-grained mode, we need to make sure we have alternative 
> solutions for its benefits.
> Its two benefits are:
> 0. lazy executor startup
>   In fine-grained mode, executors are brought up only as tasks are scheduled. 
>  This means that a user doesn't have to set {{spark.cores.max}} to ensure 
> that the app doesn't consume all resources in the cluster.
> 1. relinquishing cores
>   As Spark tasks terminate, the mesos task it was bound to terminates as 
> well, thus relinquishing the cores assigned to it.
> I'd like to add {{0.}} to coarse-grained mode, possibly enabled with a 
> configuration param.  If https://issues.apache.org/jira/browse/MESOS-1279 
> ever happens, we can add {{1.}} as well.
> cc [~tnachen] [~dragos] [~skonto] [~andrewor14]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16496) Add wholetext as option for reading text in SQL.

2017-12-14 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-16496.
-
   Resolution: Fixed
 Assignee: Prashant Sharma
Fix Version/s: 2.3.0

> Add wholetext as option for reading text in SQL.
> 
>
> Key: SPARK-16496
> URL: https://issues.apache.org/jira/browse/SPARK-16496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
> Fix For: 2.3.0
>
>
> In multiple text analysis problems, it is not often desirable for the rows to 
> be split by "\n". There exists a wholeText reader for RDD API, and this JIRA 
> just adds the same support for Dataset API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22047) HiveExternalCatalogVersionsSuite is Flaky on Jenkins

2017-12-14 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291364#comment-16291364
 ] 

Imran Rashid commented on SPARK-22047:
--

[~cloud_fan] think we can close this now?  I don't recall seeing this fail 
recently, though I also haven't looked very extensively.  (seems 
spark-tests.appspot.com has stopped updating ...)

> HiveExternalCatalogVersionsSuite is Flaky on Jenkins
> 
>
> Key: SPARK-22047
> URL: https://issues.apache.org/jira/browse/SPARK-22047
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Armin Braun
>  Labels: flaky-test
>
> HiveExternalCatalogVersionsSuite fails quite a bit lately e.g.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/3490/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {code}
> Error Message
> org.scalatest.exceptions.TestFailedException: spark-submit returned with exit 
> code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  
> '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py'
>   2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class 
> org.apache.spark.launcher.Main
> Stacktrace
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 
> spark-submit returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0'
>  
> '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py'
> 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class 
> org.apache.spark.launcher.Main
>
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at org.scalatest.Assertions$class.fail(Assertions.scala:1089)
>   at org.scalatest.FunSuite.fail(FunSuite.scala:1560)
>   at 
> org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:81)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:120)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:105)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:105)
>   at 
> org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: 

[jira] [Assigned] (SPARK-22774) Add compilation check for generated code in TPCDSQuerySuite

2017-12-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-22774:
---

Assignee: Kazuaki Ishizaki

> Add compilation check for generated code in TPCDSQuerySuite
> ---
>
> Key: SPARK-22774
> URL: https://issues.apache.org/jira/browse/SPARK-22774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.3.0
>
>
> {{TPCDSQuerySuite}} already checks whether analysis can be performed 
> correctly. In addition, it would be good to check whether generated Java code 
> can be compiled correctly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22774) Add compilation check for generated code in TPCDSQuerySuite

2017-12-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22774.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19971
[https://github.com/apache/spark/pull/19971]

> Add compilation check for generated code in TPCDSQuerySuite
> ---
>
> Key: SPARK-22774
> URL: https://issues.apache.org/jira/browse/SPARK-22774
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
> Fix For: 2.3.0
>
>
> {{TPCDSQuerySuite}} already checks whether analysis can be performed 
> correctly. In addition, it would be good to check whether generated Java code 
> can be compiled correctly



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22786) only use AppStatusPlugin in history server

2017-12-14 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291238#comment-16291238
 ] 

Marcelo Vanzin commented on SPARK-22786:


As I mentioned in the PR, I don't see why this is an issue without an 
explanation of exactly what problem it's creating.

> only use AppStatusPlugin in history server
> --
>
> Key: SPARK-22786
> URL: https://issues.apache.org/jira/browse/SPARK-22786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22786) only use AppStatusPlugin in history server

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22786:


Assignee: Wenchen Fan  (was: Apache Spark)

> only use AppStatusPlugin in history server
> --
>
> Key: SPARK-22786
> URL: https://issues.apache.org/jira/browse/SPARK-22786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22786) only use AppStatusPlugin in history server

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22786:


Assignee: Apache Spark  (was: Wenchen Fan)

> only use AppStatusPlugin in history server
> --
>
> Key: SPARK-22786
> URL: https://issues.apache.org/jira/browse/SPARK-22786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22786) only use AppStatusPlugin in history server

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291234#comment-16291234
 ] 

Apache Spark commented on SPARK-22786:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19981

> only use AppStatusPlugin in history server
> --
>
> Key: SPARK-22786
> URL: https://issues.apache.org/jira/browse/SPARK-22786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22786) only use AppStatusPlugin in history server

2017-12-14 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22786:
---

 Summary: only use AppStatusPlugin in history server
 Key: SPARK-22786
 URL: https://issues.apache.org/jira/browse/SPARK-22786
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291141#comment-16291141
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/14/17 5:09 PM:
-

[~tgraves], thanks a lot for your remarks, I've updated the description and 
also included a summary of various results and comments I got.

Answers about your other questions: 

"The fact you are asking for 5+cores per executor will naturally waste more 
resources when the executor isn't being used"
In fact the resource usage will be similar with fewer cores, because if I set 1 
core per exe, the dynamic allocation will ask for 5 times more exes

"But if we can find something that by defaults works better for the majority of 
workloads that it makes sense to improve"
I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, 
especially short jobs

"As with any config though, how do I know what to set the tasksPerSlot as? it 
requires configuration and it could affect performance."
I agree, what I'm trying to show in my argumentation is that:
- I don't have any parameter today to do what I want without optimizing each 
job, which is not feasible in my use case
- the granularity of the efficiency of this parameter seems coarser that other 
parameters (sweetspot values are valid on a more broader range of jobs than 
maxNbExe or backLogTimeout
- it seems to me some settings are quite simple to understand : if I want to 
minimize latency, let the default value; If I want to save some resources, use 
a value of 2; If I want to really minimize resource consumption, find a higher 
number by analysis or aim at meeting a time budget

About dynamic allocation : 
with the default setting of 1s of backlogTimeout, the exponential ramp up is in 
practise very similar to an upfront request, regarding the duration of jobs. I 
think upfront allocation could be used instead of exponential, but this 
wouldn't change the issue which is related to the target number of exes
I don't think asking upfront vs exponential has any effect over how Yarn yields 
containers.

"Above you say "When running with 6 tasks per executor slot, our Spark jobs 
consume in average 30% less vcorehours than the MR jobs, this setting being 
valid for different workload sizes." Was this with this patch applied or 
without?"
The patch was applied, if not you cannot set the number of tasks per taskSlot 
(I mentionned "executor slot", which is incorrect, I was refering to taskSlot)

"the WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then 
MR? "
Positive numbers mean faster in Spark.

"why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, 
or just unknown overhead?"
running with 6 tasks per taskSlot means that 6 tasks will be processed 
sequentially by 6 times less task slots


was (Author: jcuquemelle):
[~tgraves], thanks a lot for your remarks, I've updated the description and 
also included a summary of various results and comments I got.

Answers about your other questions: 

"The fact you are asking for 5+cores per executor will naturally waste more 
resources when the executor isn't being used"
In fact the resource usage will be similar with fewer cores, because if I set 1 
core per exe, the dynamic allocation will ask for 5 times more exes

"But if we can find something that by defaults works better for the majority of 
workloads that it makes sense to improve"
I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, 
especially short jobs

"As with any config though, how do I know what to set the tasksPerSlot as? it 
requires configuration and it could affect performance."
I agree, what I'm trying to show in my argumentation is that:
- I don't have any parameter today to do what I want without optimizing each 
job, which is not feasible in my use case
- the granularity of the efficiency of this parameter seems coarser that other 
parameters (sweetspot values are valid on a more broader range of jobs than 
maxNbExe or backLogTimeout
- it seems to me some settings are quite simple to understand : if I want to 
minimize latency, let the default value; If I want to save some resources, use 
a value of 2; If I want to really minimize resource consumption, find a higher 
number by analysis or aim at maximizing a time budget

About dynamic allocation : 
with the default setting of 1s of backlogTimeout, the exponential ramp up is in 
practise very similar to an upfront request, regarding the duration of jobs. I 
think upfront allocation could be used instead of exponential, but this 
wouldn't change the issue which is related to the target number of exes
I don't think asking upfront vs exponential has any effect over how Yarn yields 
containers.

"Above you say "When running with 6 tasks per executor slot, our Spark jobs 
consume in average 30% less vcorehours than the MR 

[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291141#comment-16291141
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/14/17 5:09 PM:
-

[~tgraves], thanks a lot for your remarks, I've updated the description and 
also included a summary of various results and comments I got.

Answers about your other questions: 

"The fact you are asking for 5+cores per executor will naturally waste more 
resources when the executor isn't being used"
In fact the resource usage will be similar with fewer cores, because if I set 1 
core per exe, the dynamic allocation will ask for 5 times more exes

"But if we can find something that by defaults works better for the majority of 
workloads that it makes sense to improve"
I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, 
especially short jobs

"As with any config though, how do I know what to set the tasksPerSlot as? it 
requires configuration and it could affect performance."
I agree, what I'm trying to show in my argumentation is that:
- I don't have any parameter today to do what I want without optimizing each 
job, which is not feasible in my use case
- the granularity of the efficiency of this parameter seems coarser that other 
parameters (sweetspot values are valid on a more broader range of jobs than 
maxNbExe or backLogTimeout
- it seems to me some settings are quite simple to understand : if I want to 
minimize latency, let the default value; If I want to save some resources, use 
a value of 2; If I want to really minimize resource consumption, find a higher 
number by analysis or aim at maximizing a time budget

About dynamic allocation : 
with the default setting of 1s of backlogTimeout, the exponential ramp up is in 
practise very similar to an upfront request, regarding the duration of jobs. I 
think upfront allocation could be used instead of exponential, but this 
wouldn't change the issue which is related to the target number of exes
I don't think asking upfront vs exponential has any effect over how Yarn yields 
containers.

"Above you say "When running with 6 tasks per executor slot, our Spark jobs 
consume in average 30% less vcorehours than the MR jobs, this setting being 
valid for different workload sizes." Was this with this patch applied or 
without?"
The patch was applied, if not you cannot set the number of tasks per taskSlot 
(I mentionned "executor slot", which is incorrect, I was refering to taskSlot)

"the WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then 
MR? "
Positive numbers mean faster in Spark.

"why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, 
or just unknown overhead?"
running with 6 tasks per taskSlot means that 6 tasks will be processed 
sequentially by 6 times less task slots


was (Author: jcuquemelle):
[~tgraves], thanks a lot for your remarks, I've updated the description and 
also included a summary of various results and comments I got.

Answers about your other questions: 

"The fact you are asking for 5+cores per executor will naturally waste more 
resources when the executor isn't being used"
In fact the resource usage will be similar with fewer cores, because if I set 1 
core per exe, the dynamic allocation will ask for 5 times more exes

"But if we can find something that by defaults works better for the majority of 
workloads that it makes sense to improve"
I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, 
especially short jobs

"As with any config though, how do I know what to set the tasksPerSlot as? it 
requires configuration and it could affect performance."
I agree, what I'm trying to show in my argumentation is that:
- I don't have any parameter today to do what I want without optimizing each 
job, which is not feasible in my use case
- the granularity of the efficiency of this parameter seems coarser that other 
parameters (sweetspot values are valid on a more broader range of jobs than 
maxNbExe or backLogTimeout
- it seems to me some settings are quite simple to understand : if I want to 
minimize latency, let the default value; If I want to save some resources, use 
a value of 2; If I want to really minimize resource consumption, do an analysis 
or aim at maximizing a time budget

About dynamic allocation : 
with the default setting of 1s of backlogTimeout, the exponential ramp up is in 
practise very similar to an upfront request, regarding the duration of jobs. I 
think upfront allocation could be used instead of exponential, but this 
wouldn't change the issue which is related to the target number of exes
I don't think asking upfront vs exponential has any effect over how Yarn yields 
containers.

"Above you say "When running with 6 tasks per executor slot, our Spark jobs 
consume in average 30% less vcorehours than the MR jobs, this 

[jira] [Assigned] (SPARK-22496) beeline display operation log

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22496:


Assignee: (was: Apache Spark)

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22496) beeline display operation log

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22496:


Assignee: Apache Spark

> beeline display operation log
> -
>
> Key: SPARK-22496
> URL: https://issues.apache.org/jira/browse/SPARK-22496
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: StephenZou
>Assignee: Apache Spark
>Priority: Minor
>
> For now,when end user runs queries in beeline or in hue through STS, 
> no logs are displayed, end user will wait until the job finishes or fails. 
> Progress information is needed to inform end users how the job is running if 
> they are not familiar with yarn RM or standalone spark master ui. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291156#comment-16291156
 ] 

Julien Cuquemelle commented on SPARK-22683:
---

I did see SPARK-16158 before opening a new ticket, my proposal seemed simple 
enough for me not to need a full pluggable policy (with which I do agree, but 
seems much more difficult to be accepted IMHO)
[~tgraves], you mentioned yourself that the current policy could be improved, 
and that the complexity from the users' point of view should not be overlooked 
:-)

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to mitigate this (summing up a few points mentioned in the 
> comments)?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian 

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291141#comment-16291141
 ] 

Julien Cuquemelle commented on SPARK-22683:
---

[~tgraves], thanks a lot for your remarks, I've updated the description and 
also included a summary of various results and comments I got.

Answers about your other questions: 

"The fact you are asking for 5+cores per executor will naturally waste more 
resources when the executor isn't being used"
In fact the resource usage will be similar with fewer cores, because if I set 1 
core per exe, the dynamic allocation will ask for 5 times more exes

"But if we can find something that by defaults works better for the majority of 
workloads that it makes sense to improve"
I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, 
especially short jobs

"As with any config though, how do I know what to set the tasksPerSlot as? it 
requires configuration and it could affect performance."
I agree, what I'm trying to show in my argumentation is that:
- I don't have any parameter today to do what I want without optimizing each 
job, which is not feasible in my use case
- the granularity of the efficiency of this parameter seems coarser that other 
parameters (sweetspot values are valid on a more broader range of jobs than 
maxNbExe or backLogTimeout
- it seems to me some settings are quite simple to understand : if I want to 
minimize latency, let the default value; If I want to save some resources, use 
a value of 2; If I want to really minimize resource consumption, do an analysis 
or aim at maximizing a time budget

About dynamic allocation : 
with the default setting of 1s of backlogTimeout, the exponential ramp up is in 
practise very similar to an upfront request, regarding the duration of jobs. I 
think upfront allocation could be used instead of exponential, but this 
wouldn't change the issue which is related to the target number of exes
I don't think asking upfront vs exponential has any effect over how Yarn yields 
containers.

"Above you say "When running with 6 tasks per executor slot, our Spark jobs 
consume in average 30% less vcorehours than the MR jobs, this setting being 
valid for different workload sizes." Was this with this patch applied or 
without?"
The patch was applied, if not you cannot set the number of tasks per taskSlot 
(I mentionned "executor slot", which is incorrect, I was refering to taskSlot)

"the WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then 
MR? "
Positive numbers mean faster in Spark.

"why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, 
or just unknown overhead?"
running with 6 tasks per taskSlot means that 6 tasks will be processed 
sequentially by 6 times less task slots

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> 

[jira] [Resolved] (SPARK-22785) remove ColumnVector.anyNullsSet

2017-12-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22785.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19980
[https://github.com/apache/spark/pull/19980]

> remove ColumnVector.anyNullsSet
> ---
>
> Key: SPARK-22785
> URL: https://issues.apache.org/jira/browse/SPARK-22785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291037#comment-16291037
 ] 

Thomas Graves commented on SPARK-22683:
---

Another way to approach this is to have a pluggable policy here. Right now we 
have it do exponential increase in requesting containers we could have that 
pluggable or configurable so that you could do all at once (which I have seen 
use case for), and then this use case you could have a policy to start out much 
slower requesting them.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to mitigate this (summing up a few points mentioned in the 
> comments)?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Cuquemelle updated SPARK-22683:
--
Description: 
While migrating a series of jobs from MR to Spark using dynamicAllocation, I've 
noticed almost a doubling (+114% exactly) of resource consumption of Spark 
w.r.t MR, for a wall clock time gain of 43%

About the context: 
- resource usage stands for vcore-hours allocation for the whole job, as seen 
by YARN
- I'm talking about a series of jobs because we provide our users with a way to 
define experiments (via UI / DSL) that automatically get translated to Spark / 
MR jobs and submitted on the cluster
- we submit around 500 of such jobs each day
- these jobs are usually one shot, and the amount of processing can vary a lot 
between jobs, and as such finding an efficient number of executors for each job 
is difficult to get right, which is the reason I took the path of dynamic 
allocation.  
- Some of the tests have been scheduled on an idle queue, some on a full queue.
- experiments have been conducted with spark.executor-cores = 5 and 10, only 
results for 5 cores have been reported because efficiency was overall better 
than with 10 cores
- the figures I give are averaged over a representative sample of those jobs 
(about 600 jobs) ranging from tens to thousands splits in the data partitioning 
and between 400 to 9000 seconds of wall clock time.
- executor idle timeout is set to 30s;
 

Definition: 
- let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
which represent the max number of tasks an executor will process in parallel.
- the current behaviour of the dynamic allocation is to allocate enough 
containers to have one taskSlot per task, which minimizes latency, but wastes 
resources when tasks are small regarding executor allocation and idling 
overhead. 

The results using the proposal (described below) over the job sample (600 jobs):
- by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
resource usage, for a 37% (against 43%) reduction in wall clock time for Spark 
w.r.t MR
- by trying to minimize the average resource consumption, I ended up with 6 
tasks per core, with a 30% resource usage reduction, for a similar wall clock 
time w.r.t. MR

What did I try to mitigate this (summing up a few points mentioned in the 
comments)?
- change dynamicAllocation.maxExecutors: this would need to be adapted for each 
job (tens to thousands splits can occur), and essentially remove the interest 
of using the dynamic allocation.
- use dynamicAllocation.backlogTimeout: 
- setting this parameter right to avoid creating unused executors is very 
dependant on wall clock time. One basically needs to solve the exponential ramp 
up for the target time. So this is not an option for my use case where I don't 
want a per-job tuning. 
- I've still done a series of experiments, details in the comments. Result 
is that after manual tuning, the best I could get was a similar resource 
consumption at the expense of 20% more wall clock time, or a similar wall clock 
time at the expense of 60% more resource consumption than what I got using my 
proposal @ 6 tasks per slot (this value being optimized over a much larger 
range of jobs as already stated)
- as mentioned in another comment, tampering with the exponential ramp up 
might yield task imbalance and such old executors could become contention 
points for other exes trying to remotely access blocks in the old exes (not 
witnessed in the jobs I'm talking about, but we did see this behavior in other 
jobs)

Proposal: 

Simply add a tasksPerExecutorSlot parameter, which makes it possible to specify 
how many tasks a single taskSlot should ideally execute to mitigate the 
overhead of executor allocation.

PR: https://github.com/apache/spark/pull/19881

  was:
let's say an executor has spark.executor.cores / spark.task.cpus taskSlots

The current dynamic allocation policy allocates enough executors
to have each taskSlot execute a single task, which minimizes latency, 
but wastes resources when tasks are small regarding executor allocation
and idling overhead. 

By adding the tasksPerExecutorSlot, it is made possible to specify how many 
tasks
a single slot should ideally execute to mitigate the overhead of executor
allocation.

PR: https://github.com/apache/spark/pull/19881


> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>  Labels: pull-request-available
>
> While migrating a 

[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2017-12-14 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280548#comment-16280548
 ] 

Julien Cuquemelle edited comment on SPARK-22683 at 12/14/17 3:25 PM:
-

I don't understand your statement about delaying executor addition ? I want to 
cap the number of executors in an adaptive way regarding the current number of 
tasks, not delay their creation.

Doing this with dynamicAllocation.maxExecutors requires each job to be tuned 
for efficiency; when we're doing experiments, a lot of jobs are one shot, so 
they can't be fine tuned.
The proposal gives a way to have an adaptive behaviour for a family of jobs.

Regarding slowing the ramp up of executors with schedulerBacklogTimeout, I've 
made experiments to play with this parameter; I have made 2 series of 
experiments (7 similar jobs on each test case, average figures reported in the 
following table), one on a busy queue, and the other on an idle queue. I'll 
report only the idle queue, as the figures 
on the busy queue are even worse for the schedulerBacklogTimeout approach: 

First row is using the default 1s for the schedulerBacklogTimeout, and uses the 
6 tasks per executorSlot I've mentioned above, other rows use the default 
dynamicAllocation behaviour and only change schedulerBacklogTimeout

||SparkWallTimeSec||Spk-vCores-H||taskPerExeSlot||schedulerBacklogTimeout||
|693.571429|37.142857|6|1.0|
|584.857143|69.571429|1|30.0|
|763.428571|54.285714|1|60.0|
|826.714286|39.571429|1|90.0|

So basically I can tune the backlogTimeout to get a similar vCores-H 
consumption at the expense of almost 20% more wallClockTime, or I can tune the 
parameter to get about the same wallClockTime at the expense of about 60% more 
vcoreH consumption (very roughly extrapolated between 30 and 60 secs for 
schedulerBacklogTimeout).

It does not seem to solve the issue I'm trying to address, moreover this would 
again need to be tuned for each specific job's duration (to find the 90s 
timeout to get the similar resource consumption, I had to solve the exponential 
ramp-up with the duration of the already run job, which is not feasible in 
experimental use cases ).
The previous experiments that allowed me to find the sweet spot at 6 tasks per 
slot has involved job wallClockTimes between 400 and 9000 seconds

Another way to have a look at this new parameter I'm proposing is to have a 
simple way to tune the latency / resource consumption tradeoff. 


was (Author: jcuquemelle):
I don't understand your statement about delaying executor addition ? I want to 
cap the number of executors in an adaptive way regarding the current number of 
tasks, not delay their creation.

Doing this with dynamicAllocation.maxExecutors requires each job to be tuned 
for efficiency; when we're doing experiments, a lot of jobs are one shot, so 
they can't be fine tuned.
The proposal gives a way to have an adaptive behaviour for a family of jobs.

Regarding slowing the ramp up of executors with schedulerBacklogTimeout, I've 
made experiments to play with this parameter; I have made 2 series of 
experiments (7 jobs on each test case, average figures reported in the 
following table), one on a busy queue, and the other on an idle queue. I'll 
report only the idle queue, as the figures 
on the busy queue are even worse for the schedulerBacklogTimeout approach: 

First row is using the default 1s for the schedulerBacklogTimeout, and uses the 
6 tasks per executorSlot I've mentioned above, other rows use the default 
dynamicAllocation behaviour and only change schedulerBacklogTimeout

||SparkWallTimeSec||Spk-vCores-H||taskPerExeSlot||schedulerBacklogTimeout||
|693.571429|37.142857|6|1.0|
|584.857143|69.571429|1|30.0|
|763.428571|54.285714|1|60.0|
|826.714286|39.571429|1|90.0|

So basically I can tune the backlogTimeout to get a similar vCores-H 
consumption at the expense of almost 20% more wallClockTime, or I can tune the 
parameter to get about the same wallClockTime at the expense of about 60% more 
vcoreH consumption (very roughly extrapolated between 30 and 60 secs for 
schedulerBacklogTimeout).

It does not seem to solve the issue I'm trying to address, moreover this would 
again need to be tuned for each specific job's duration (to find the 90s 
timeout to get the similar resource consumption, I had to solve the exponential 
ramp-up with the duration of the already run job, which is not feasible in 
experimental use cases ).
The previous experiments that allowed me to find the sweet spot at 6 tasks per 
slot has involved job wallClockTimes between 400 and 9000 seconds

Another way to have a look at this new parameter I'm proposing is to have a 
simple way to tune the latency / resource consumption tradeoff. 

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 

[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions

2017-12-14 Thread Attila Zsolt Piros (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290964#comment-16290964
 ] 

Attila Zsolt Piros commented on SPARK-22359:


I would like to join and take the subtask SPARK-22362.


> Improve the test coverage of window functions
> -
>
> Key: SPARK-22359
> URL: https://issues.apache.org/jira/browse/SPARK-22359
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jiang Xingbo
>
> There are already quite a few integration tests using window functions, but 
> the unit tests coverage for window funtions is not ideal.
> We'd like to test the following aspects:
> * Specifications
> ** different partition clauses (none, one, multiple)
> ** different order clauses (none, one, multiple, asc/desc, nulls first/last)
> * Frames and their combinations
> ** OffsetWindowFunctionFrame
> ** UnboundedWindowFunctionFrame
> ** SlidingWindowFunctionFrame
> ** UnboundedPrecedingWindowFunctionFrame
> ** UnboundedFollowingWindowFunctionFrame
> * Aggregate function types
> ** Declarative
> ** Imperative
> ** UDAF
> * Spilling
> ** Cover the conditions that WindowExec should spill at least once 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22785) remove ColumnVector.anyNullsSet

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22785:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove ColumnVector.anyNullsSet
> ---
>
> Key: SPARK-22785
> URL: https://issues.apache.org/jira/browse/SPARK-22785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22785) remove ColumnVector.anyNullsSet

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22785:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove ColumnVector.anyNullsSet
> ---
>
> Key: SPARK-22785
> URL: https://issues.apache.org/jira/browse/SPARK-22785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22785) remove ColumnVector.anyNullsSet

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290797#comment-16290797
 ] 

Apache Spark commented on SPARK-22785:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19980

> remove ColumnVector.anyNullsSet
> ---
>
> Key: SPARK-22785
> URL: https://issues.apache.org/jira/browse/SPARK-22785
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22785) remove ColumnVector.anyNullsSet

2017-12-14 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-22785:
---

 Summary: remove ColumnVector.anyNullsSet
 Key: SPARK-22785
 URL: https://issues.apache.org/jira/browse/SPARK-22785
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22775) move dictionary related APIs from ColumnVector to WritableColumnVector

2017-12-14 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-22775.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19970
[https://github.com/apache/spark/pull/19970]

> move dictionary related APIs from ColumnVector to WritableColumnVector
> --
>
> Key: SPARK-22775
> URL: https://issues.apache.org/jira/browse/SPARK-22775
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22644) Make ML testsuite support StructuredStreaming test

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290773#comment-16290773
 ] 

Apache Spark commented on SPARK-22644:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19979

> Make ML testsuite support StructuredStreaming test
> --
>
> Key: SPARK-22644
> URL: https://issues.apache.org/jira/browse/SPARK-22644
> Project: Spark
>  Issue Type: Test
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.3.0
>
>
> We need to add some helper code to make testing ML transformers & models 
> easier with streaming data. These tests might help us catch any remaining 
> issues and we could encourage future PRs to use these tests to prevent new 
> Models & Transformers from having issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22784) Configure reading buffer size in Spark History Server

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22784:


Assignee: Apache Spark

> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Assignee: Apache Spark
>Priority: Minor
> Attachments: replay-baseline.svg
>
>
> Motivation:
> Our Spark History Server spends most of the backfill time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with a new buffer size parameter. 
> Measured the best buffer size. x20 of the average line size (30mb) gives 32% 
> speedup in a local test.
> Result:
> Backfill of Spark History and reading to the cache will be up to 30% faster 
> after tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22782) Boost speed, use kafka010 consumer kafka

2017-12-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22782.
---
Resolution: Invalid

This sounds like some kind of comment for the mailing list. There's no specific 
problem or change described here.

> Boost speed, use kafka010 consumer kafka
> 
>
> Key: SPARK-22782
> URL: https://issues.apache.org/jira/browse/SPARK-22782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
> Environment: kafka-version:  0.10
> spark-version: 2.2.0
> schedule spark on yarn
>Reporter: licun
>Priority: Critical
>
> We use spark structured streaming to consumer kafka, but we find the 
> consumer speed is too slow compare spark streaming . we set kafka 
> "maxOffsetsPerTrigger": 1. 
>By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  
> found the following situation: 
>  
>The debug code:
>   
> {code:scala}
> private def poll(pollTimeoutMs: Long): Unit = {
> val startTime = System.currentTimeMillis()
> val p = consumer.poll(pollTimeoutMs)
> val r = p.records(topicPartition)
> val endTime = System.currentTimeMillis()
> val delta = endTime - startTime
> logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
> ${delta} ms")
> fetchedData = r.iterator
>   }
> {code}
>The log:
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22784) Configure reading buffer size in Spark History Server

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290745#comment-16290745
 ] 

Apache Spark commented on SPARK-22784:
--

User 'MikhailErofeev' has created a pull request for this issue:
https://github.com/apache/spark/pull/19978

> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
> Attachments: replay-baseline.svg
>
>
> Motivation:
> Our Spark History Server spends most of the backfill time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with a new buffer size parameter. 
> Measured the best buffer size. x20 of the average line size (30mb) gives 32% 
> speedup in a local test.
> Result:
> Backfill of Spark History and reading to the cache will be up to 30% faster 
> after tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22784) Configure reading buffer size in Spark History Server

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22784:


Assignee: (was: Apache Spark)

> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
> Attachments: replay-baseline.svg
>
>
> Motivation:
> Our Spark History Server spends most of the backfill time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with a new buffer size parameter. 
> Measured the best buffer size. x20 of the average line size (30mb) gives 32% 
> speedup in a local test.
> Result:
> Backfill of Spark History and reading to the cache will be up to 30% faster 
> after tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22784) Configure reading buffer size in Spark History Server

2017-12-14 Thread Mikhail Erofeev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Erofeev updated SPARK-22784:

Description: 
Motivation:
Our Spark History Server spends most of the backfill time inside BufferedReader 
and StringBuffer. It happens because average line size of our events is 
~1.500.000 chars (due to a lot of partitions and iterations), whereas the 
default buffer size is 2048 bytes. See the attached flame graph.

Implementation:
I've added logging of spent time and line size for each job.
Parametrised ReplayListenerBus with a new buffer size parameter. 
Measured the best buffer size. x20 of the average line size (30mb) gives 32% 
speedup in a local test.

Result:
Backfill of Spark History and reading to the cache will be up to 30% faster 
after tuning.

  was:
Motivation:
Our Spark History Server spends most of its warm-up time inside BufferedReader 
and StringBuffer. It happens because average line size of our events is 
~1.500.000 chars (due to a lot of partitions and iterations), whereas the 
default buffer size is 2048 bytes. See the attached flame graph.

Implementation:
I've added logging of spent time and line size for each job.
Parametrised ReplayListenerBus with new buffer size parameter. 
Measured best buffer size. x20 of average line size (30mb) gives 32% speedup in 
a local test.

Result:
Warm-up of Spark History and reading to cache will be up to 30% faster after 
tuning.


> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
> Attachments: replay-baseline.svg
>
>
> Motivation:
> Our Spark History Server spends most of the backfill time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with a new buffer size parameter. 
> Measured the best buffer size. x20 of the average line size (30mb) gives 32% 
> speedup in a local test.
> Result:
> Backfill of Spark History and reading to the cache will be up to 30% faster 
> after tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22784) Configure reading buffer size in Spark History Server

2017-12-14 Thread Mikhail Erofeev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Erofeev updated SPARK-22784:

Attachment: replay-baseline.svg

> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
> Attachments: replay-baseline.svg
>
>
> Motivation:
> Our Spark History Server spends most of its warm-up time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with new buffer size parameter. 
> Measured best buffer size. x20 of average line size (30mb) gives 32% speedup 
> in a local test.
> Result:
> Warm-up of Spark History and reading to cache will be up to 30% faster after 
> tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22752) FileNotFoundException while reading from Kafka

2017-12-14 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290659#comment-16290659
 ] 

Marco Gaido commented on SPARK-22752:
-

thanks [~zsxwing]. You are right. I am closing this as duplicate, thanks.

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-22752
> URL: https://issues.apache.org/jira/browse/SPARK-22752
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
> {noformat}
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
> {noformat}
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started, after the first one failed. So it 
> looks like task failures can generate it. Or it might be related to 
> watermarks, since there are some problems related to the incoming data for 
> which the watermark was filtering all the incoming data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22784) Increase reading buffer size in Spark History Server

2017-12-14 Thread Mikhail Erofeev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Erofeev updated SPARK-22784:

Affects Version/s: (was: 2.0.0)
   2.2.1
 Target Version/s:   (was: 2.2.1)

> Increase reading buffer size in Spark History Server
> 
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
>
> Motivation:
> Our Spark History Server spends most of its warm-up time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with new buffer size parameter. 
> Measured best buffer size. x20 of average line size (30mb) gives 32% speedup 
> in a local test.
> Result:
> Warm-up of Spark History and reading to cache will be up to 30% faster after 
> tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22784) Configure reading buffer size in Spark History Server

2017-12-14 Thread Mikhail Erofeev (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Erofeev updated SPARK-22784:

Summary: Configure reading buffer size in Spark History Server  (was: 
Increase reading buffer size in Spark History Server)

> Configure reading buffer size in Spark History Server
> -
>
> Key: SPARK-22784
> URL: https://issues.apache.org/jira/browse/SPARK-22784
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Mikhail Erofeev
>Priority: Minor
>
> Motivation:
> Our Spark History Server spends most of its warm-up time inside 
> BufferedReader and StringBuffer. It happens because average line size of our 
> events is ~1.500.000 chars (due to a lot of partitions and iterations), 
> whereas the default buffer size is 2048 bytes. See the attached flame graph.
> Implementation:
> I've added logging of spent time and line size for each job.
> Parametrised ReplayListenerBus with new buffer size parameter. 
> Measured best buffer size. x20 of average line size (30mb) gives 32% speedup 
> in a local test.
> Result:
> Warm-up of Spark History and reading to cache will be up to 30% faster after 
> tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22752) FileNotFoundException while reading from Kafka

2017-12-14 Thread Marco Gaido (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-22752.
-
Resolution: Duplicate

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-22752
> URL: https://issues.apache.org/jira/browse/SPARK-22752
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
> {noformat}
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null.
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>   at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>   at scala.Option.getOrElse(Option.scala:121)
> {noformat}
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started, after the first one failed. So it 
> looks like task failures can generate it. Or it might be related to 
> watermarks, since there are some problems related to the incoming data for 
> which the watermark was filtering all the incoming data.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22784) Increase reading buffer size in Spark History Server

2017-12-14 Thread Mikhail Erofeev (JIRA)
Mikhail Erofeev created SPARK-22784:
---

 Summary: Increase reading buffer size in Spark History Server
 Key: SPARK-22784
 URL: https://issues.apache.org/jira/browse/SPARK-22784
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Mikhail Erofeev
Priority: Minor


Motivation:
Our Spark History Server spends most of its warm-up time inside BufferedReader 
and StringBuffer. It happens because average line size of our events is 
~1.500.000 chars (due to a lot of partitions and iterations), whereas the 
default buffer size is 2048 bytes. See the attached flame graph.

Implementation:
I've added logging of spent time and line size for each job.
Parametrised ReplayListenerBus with new buffer size parameter. 
Measured best buffer size. x20 of average line size (30mb) gives 32% speedup in 
a local test.

Result:
Warm-up of Spark History and reading to cache will be up to 30% faster after 
tuning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2017-12-14 Thread omkar kankalapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290620#comment-16290620
 ] 

omkar kankalapati edited comment on SPARK-22783 at 12/14/17 9:39 AM:
-

EventLoggingListener (org.apache.spark.scheduler.EventLoggingListener) opens 
the the log file named  .inprogress file for writing in start() 
method as a Writer object.

On all the events, logEvent() is invoked, which simply appends to the writer, 
without checking the size/performing any rotation.  The writer is closed and 
file is renamed to remove the ".inprogress" suffix only in stop() method.

Thus, the *.inprogress file will be held open throughout and keeps growing. 

It would be very helpful if EventLoggingListener can be enhanced to support 
rotating the file  .inprogress file when it reaches a (configured) threshold 
size/interval. 


was (Author: omkar.kankalapati):
EventLoggingListener (org.apache.spark.scheduler.EventLoggingListener) opens 
the the log file named  .inprogress file for writing in start() 
method as a Writer object.

On all the events, logEvent() is invoked, which simply appends to the writer, 
without checking the size/performing any rotation.  The writer is closed and 
file is renamed to remove the ".inprogress" suffix only in stop() method.

Thus, the *.inprogress file will be held open throughout and keeps growing. 

It would be very helpful if EventLoggingListener can be enhanced to support 
rotating the file  .inprogress file when it reaches a (configured) threshhold 
size/interval. 

> event log directory(spark-history) filled by large .inprogress files for 
> spark streaming applications
> -
>
> Key: SPARK-22783
> URL: https://issues.apache.org/jira/browse/SPARK-22783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
> Environment: Linux(Generic)
>Reporter: omkar kankalapati
>Priority: Critical
>
> When running long running streaming applications, the HDFS storage gets 
> filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234 /spark-history/.inprogress
> 46.6 G  /spark-history/.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  
> Spark should instead rotate the current log file when it reaches a size (for 
> example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because 
> users may have limited HDFS quota and these large files consume the available 
> limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file 
> causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the 
> .inprogress file fails because the file is still opened by 
> EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress 
> for DFSClient_NONMAPREDUCE_<#ID>on  because this file lease is currently 
> owned by DFSClient_NONMAPREDUCE_<#ID> on 
> The only workaround available is to disable the event logging for streaming 
> applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2017-12-14 Thread omkar kankalapati (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290620#comment-16290620
 ] 

omkar kankalapati commented on SPARK-22783:
---

EventLoggingListener (org.apache.spark.scheduler.EventLoggingListener) opens 
the the log file named  .inprogress file for writing in start() 
method as a Writer object.

On all the events, logEvent() is invoked, which simply appends to the writer, 
without checking the size/performing any rotation.  The writer is closed and 
file is renamed to remove the ".inprogress" suffix only in stop() method.

Thus, the *.inprogress file will be held open throughout and keeps growing. 

It would be very helpful if EventLoggingListener can be enhanced to support 
rotating the file  .inprogress file when it reaches a (configured) threshhold 
size/interval. 

> event log directory(spark-history) filled by large .inprogress files for 
> spark streaming applications
> -
>
> Key: SPARK-22783
> URL: https://issues.apache.org/jira/browse/SPARK-22783
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.1.0
> Environment: Linux(Generic)
>Reporter: omkar kankalapati
>Priority: Critical
>
> When running long running streaming applications, the HDFS storage gets 
> filled up with large  *.inprogress files in hdfs://spark-history/  directory
> For example:
>  hadoop fs -du -h /spark-history
> 234 /spark-history/.inprogress
> 46.6 G  /spark-history/.inprogress
> Instead of continuing to write to a very large (multi GB) .inprogress file,  
> Spark should instead rotate the current log file when it reaches a size (for 
> example:  100 MB) or interval
> and perhaps expose a configuration parameter for the size/interval.
> This is also mentioned in SPARK-12140 as a concern.
> It is very important and useful to support rotating the log files because 
> users may have limited HDFS quota and these large files consume the available 
> limited quota.
> Also the users do not have a viable workaround
> 1) Can not move the files to an another location because the moving  the file 
> causes the event logging to stop
> 2) Trying to copy the .inprogress file to another location and truncate the 
> .inprogress file fails because the file is still opened by 
> EventLoggingListener for writing
> hdfs dfs -truncate -w 0 /spark-history/.inprogress
> truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress 
> for DFSClient_NONMAPREDUCE_<#ID>on  because this file lease is currently 
> owned by DFSClient_NONMAPREDUCE_<#ID> on 
> The only workaround available is to disable the event logging for streaming 
> applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications

2017-12-14 Thread omkar kankalapati (JIRA)
omkar kankalapati created SPARK-22783:
-

 Summary: event log directory(spark-history) filled by large 
.inprogress files for spark streaming applications
 Key: SPARK-22783
 URL: https://issues.apache.org/jira/browse/SPARK-22783
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0, 1.6.0
 Environment: Linux(Generic)
Reporter: omkar kankalapati
Priority: Critical


When running long running streaming applications, the HDFS storage gets filled 
up with large  *.inprogress files in hdfs://spark-history/  directory

For example:

 hadoop fs -du -h /spark-history

234 /spark-history/.inprogress

46.6 G  /spark-history/.inprogress


Instead of continuing to write to a very large (multi GB) .inprogress file,  
Spark should instead rotate the current log file when it reaches a size (for 
example:  100 MB) or interval

and perhaps expose a configuration parameter for the size/interval.

This is also mentioned in SPARK-12140 as a concern.

It is very important and useful to support rotating the log files because users 
may have limited HDFS quota and these large files consume the available limited 
quota.

Also the users do not have a viable workaround

1) Can not move the files to an another location because the moving  the file 
causes the event logging to stop

2) Trying to copy the .inprogress file to another location and truncate the 
.inprogress file fails because the file is still opened by EventLoggingListener 
for writing

hdfs dfs -truncate -w 0 /spark-history/.inprogress
truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress 
for DFSClient_NONMAPREDUCE_<#ID>on  because this file lease is currently 
owned by DFSClient_NONMAPREDUCE_<#ID> on 


The only workaround available is to disable the event logging for streaming 
applications by setting "spark.eventLog.enabled" to false



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22782) Boost speed, use kafka010 consumer kafka

2017-12-14 Thread licun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

licun updated SPARK-22782:
--
Description: 
We use spark structured streaming to consumer kafka, but we find the 
consumer speed is too slow compare spark streaming . we set kafka 
"maxOffsetsPerTrigger": 1. 
   By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  found 
the following situation: 
 
   The debug code:
  
{code:scala}
private def poll(pollTimeoutMs: Long): Unit = {
val startTime = System.currentTimeMillis()
val p = consumer.poll(pollTimeoutMs)
val r = p.records(topicPartition)
val endTime = System.currentTimeMillis()
val delta = endTime - startTime
logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
${delta} ms")
fetchedData = r.iterator
  }
{code}

   The log:
   


  was:
We use spark structured streaming to consumer kafka, but we find the 
consumer speed is too slow compare spark streaming . we set kafka 
"maxOffsetsPerTrigger": 1. 
   By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  found 
the following situation: 
 
   The debug code:
   
{code:scala}
private def poll(pollTimeoutMs: Long): Unit = {
val startTime = System.currentTimeMillis()
val p = consumer.poll(pollTimeoutMs)
val r = p.records(topicPartition)
val endTime = System.currentTimeMillis()
val delta = endTime - startTime
logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
${delta} ms")
fetchedData = r.iterator
  }
{code}



> Boost speed, use kafka010 consumer kafka
> 
>
> Key: SPARK-22782
> URL: https://issues.apache.org/jira/browse/SPARK-22782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
> Environment: kafka-version:  0.10
> spark-version: 2.2.0
> schedule spark on yarn
>Reporter: licun
>Priority: Critical
>
> We use spark structured streaming to consumer kafka, but we find the 
> consumer speed is too slow compare spark streaming . we set kafka 
> "maxOffsetsPerTrigger": 1. 
>By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  
> found the following situation: 
>  
>The debug code:
>   
> {code:scala}
> private def poll(pollTimeoutMs: Long): Unit = {
> val startTime = System.currentTimeMillis()
> val p = consumer.poll(pollTimeoutMs)
> val r = p.records(topicPartition)
> val endTime = System.currentTimeMillis()
> val delta = endTime - startTime
> logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
> ${delta} ms")
> fetchedData = r.iterator
>   }
> {code}
>The log:
>



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22782) Boost speed, use kafka010 consumer kafka

2017-12-14 Thread licun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

licun updated SPARK-22782:
--
Description: 
We use spark structured streaming to consumer kafka, but we find the 
consumer speed is too slow compare spark streaming . we set kafka 
"maxOffsetsPerTrigger": 1. 
   By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  found 
the following situation: 
 
   The debug code:
   
{code:scala}
private def poll(pollTimeoutMs: Long): Unit = {
val startTime = System.currentTimeMillis()
val p = consumer.poll(pollTimeoutMs)
val r = p.records(topicPartition)
val endTime = System.currentTimeMillis()
val delta = endTime - startTime
logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
${delta} ms")
fetchedData = r.iterator
  }
{code}


  was:
We use spark structured streaming to consumer kafka, but we find the 
consumer speed is too slow compare spark streaming . we set kafka 
"maxOffsetsPerTrigger": 1. 
   By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  found 
the following situation: 
 
   The debug code:
   {color:red}private def poll(pollTimeoutMs: Long): Unit = {
val startTime = System.currentTimeMillis()
val p = consumer.poll(pollTimeoutMs)
val r = p.records(topicPartition)
val endTime = System.currentTimeMillis()
val delta = endTime - startTime
logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
${delta} ms")
fetchedData = r.iterator
  }{color}


> Boost speed, use kafka010 consumer kafka
> 
>
> Key: SPARK-22782
> URL: https://issues.apache.org/jira/browse/SPARK-22782
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.2.0
> Environment: kafka-version:  0.10
> spark-version: 2.2.0
> schedule spark on yarn
>Reporter: licun
>Priority: Critical
>
> We use spark structured streaming to consumer kafka, but we find the 
> consumer speed is too slow compare spark streaming . we set kafka 
> "maxOffsetsPerTrigger": 1. 
>By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  
> found the following situation: 
>  
>The debug code:
>
> {code:scala}
> private def poll(pollTimeoutMs: Long): Unit = {
> val startTime = System.currentTimeMillis()
> val p = consumer.poll(pollTimeoutMs)
> val r = p.records(topicPartition)
> val endTime = System.currentTimeMillis()
> val delta = endTime - startTime
> logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
> ${delta} ms")
> fetchedData = r.iterator
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22782) Boost speed, use kafka010 consumer kafka

2017-12-14 Thread licun (JIRA)
licun created SPARK-22782:
-

 Summary: Boost speed, use kafka010 consumer kafka
 Key: SPARK-22782
 URL: https://issues.apache.org/jira/browse/SPARK-22782
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0, 2.1.0
 Environment: kafka-version:  0.10
spark-version: 2.2.0
schedule spark on yarn
Reporter: licun
Priority: Critical


We use spark structured streaming to consumer kafka, but we find the 
consumer speed is too slow compare spark streaming . we set kafka 
"maxOffsetsPerTrigger": 1. 
   By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we  found 
the following situation: 
 
   The debug code:
   {color:red}private def poll(pollTimeoutMs: Long): Unit = {
val startTime = System.currentTimeMillis()
val p = consumer.poll(pollTimeoutMs)
val r = p.records(topicPartition)
val endTime = System.currentTimeMillis()
val delta = endTime - startTime
logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: 
${delta} ms")
fetchedData = r.iterator
  }{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22771) SQL concat for binary

2017-12-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22771:


Assignee: Apache Spark

> SQL concat for binary 
> --
>
> Key: SPARK-22771
> URL: https://issues.apache.org/jira/browse/SPARK-22771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Assignee: Apache Spark
>Priority: Minor
>
> spark.sql {{concat}}  function automatically casts arguments to StringType 
> and returns a String.
> This might be the behavior of traditional databases, however in Spark there's 
> Binary as a standard type, and concat'ing binary seems reasonable if it 
> returns another binary sequence.
> Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} 
> represent text, by concat'ing both we end up with the same type as the 
> arguments, and in case they are intermixed (str + unicode) the most generic 
> type is returned (unicode).
> Following the same principle, I believe that when concat'ing binary it would 
> make sense to return a binary. 
> In terms of Spark behavior, it would affect only the case when all arguments 
> are binary. All other cases should remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22771) SQL concat for binary

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290551#comment-16290551
 ] 

Apache Spark commented on SPARK-22771:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/19977

> SQL concat for binary 
> --
>
> Key: SPARK-22771
> URL: https://issues.apache.org/jira/browse/SPARK-22771
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Fernando Pereira
>Priority: Minor
>
> spark.sql {{concat}}  function automatically casts arguments to StringType 
> and returns a String.
> This might be the behavior of traditional databases, however in Spark there's 
> Binary as a standard type, and concat'ing binary seems reasonable if it 
> returns another binary sequence.
> Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} 
> represent text, by concat'ing both we end up with the same type as the 
> arguments, and in case they are intermixed (str + unicode) the most generic 
> type is returned (unicode).
> Following the same principle, I believe that when concat'ing binary it would 
> make sense to return a binary. 
> In terms of Spark behavior, it would affect only the case when all arguments 
> are binary. All other cases should remain unchanged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12

2017-12-14 Thread liyunzhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290518#comment-16290518
 ] 

liyunzhang commented on SPARK-22660:


[~srowen]:  there is another modification about limit() in 
TaskSetManager.scala.  Very Sorry for not including it in last commit. 
{code}
  abort(s"$msg Exception during serialization: $e")
 throw new TaskNotSerializableException(e)
  }   }
 -if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 
&&  
 +if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 
1024 &&
!emittedTaskSizeWarning) { 
emittedTaskSizeWarning = true  
logWarning(s"Stage ${task.stageId} contains a task of very large 
size " + 
 -s"(${serializedTask.limit / 1024} KB). The maximum recommended 
task size is " +   
 +s"(${serializedTask.limit() / 1024} KB). The maximum recommended 
task size is " +
  s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
 
  }   }
  addRunningTask(taskId) 
 @@ -502,7 +502,7 @@ private[spark] class TaskSetManager(
  // val timeTaken = clock.getTime() - startTime  
  val taskName = s"task ${info.id} in stage ${taskSet.id}"  
   
  logInfo(s"Starting $taskName (TID $taskId, $host, executor 
${info.executorId}, " +
 -  s"partition ${task.partitionId}, $taskLocality, 
${serializedTask.limit} bytes)")
 +  s"partition ${task.partitionId}, $taskLocality, 
${serializedTask.limit()} bytes)")
  
  sched.dagScheduler.taskStarted(task, info)
  new TaskDescription(   

{code}

Can you help review? 

> Use position() and limit() to fix ambiguity issue in scala-2.12
> ---
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
> 

[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12

2017-12-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290514#comment-16290514
 ] 

Apache Spark commented on SPARK-22660:
--

User 'kellyzly' has created a pull request for this issue:
https://github.com/apache/spark/pull/19976

> Use position() and limit() to fix ambiguity issue in scala-2.12
> ---
>
> Key: SPARK-22660
> URL: https://issues.apache.org/jira/browse/SPARK-22660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.0
>Reporter: liyunzhang
>Assignee: liyunzhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> build with scala-2.12 with following steps
> 1. change the pom.xml with scala-2.12
>  ./dev/change-scala-version.sh 2.12
> 2.build with -Pscala-2.12
> for hive on spark
> {code}
> ./dev/make-distribution.sh   --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn 
> -Pparquet-provided -Dhadoop.version=2.7.3
> {code}
> for spark sql
> {code}
> ./dev/make-distribution.sh  --tgz -Pscala-2.12 -Phadoop-2.7  -Pyarn -Phive 
> -Dhadoop.version=2.7.3>log.sparksql 2>&1
> {code}
> get following error
> #Error1
> {code}
> /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: 
> error: cannot find   symbol
> Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory));
> {code}
> This is because sun.misc.Cleaner has been moved to new location in JDK9. 
> HADOOP-12760 will be the long term fix
> #Error2
> {code}
> spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455:
>  ambiguous reference to overloaded definition, method limit in class 
> ByteBuffer of type (x$1: Int)java.nio.ByteBuffer
> method limit in class Buffer of type ()Int
> match expected type ?
>  val resultSize = serializedDirectResult.limit
> error 
> {code}
> The limit method was moved from ByteBuffer to the superclass Buffer and it 
> can no longer be called without (). The same reason for position method.
> #Error3
> {code}
> home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error] properties.putAll(propsMap.asJava)
>  [error]^
> [error] 
> /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427:
>  ambiguous reference to overloaded definition, [error] both method putAll in 
> class Properties of type (x$1: java.util.Map[_, _])Unit [error] and  method 
> putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: 
> Object])Unit [error] match argument types (java.util.Map[String,String])
>  [error]   props.putAll(outputSerdeProps.toMap.asJava)
>  [error] ^
>  {code}
>  This is because the key type is Object instead of String which is unsafe.
> After solving these 3 errors, compile successfully.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org