[jira] [Commented] (SPARK-22792) PySpark UDF registering issue
[ https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292163#comment-16292163 ] Hyukjin Kwon commented on SPARK-22792: -- Does this work in 2.1.x or lower version? Just want to check if this is a regression. Also, could you maybe make a minimised reproducer? it seems hard to follow. Does the object you used work with normal regular pickle in your local? > PySpark UDF registering issue > - > > Key: SPARK-22792 > URL: https://issues.apache.org/jira/browse/SPARK-22792 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.2.1 > Environment: Windows OS, Python pycharm ,Spark >Reporter: Annamalai Venugopal > Labels: windows > Original Estimate: 72h > Remaining Estimate: 72h > > I am doing a project with pyspark i am struck with an issue > Traceback (most recent call last): > File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line > 187, in > hypernym_extracted_data = result.withColumn("hypernym_extracted_data", > hypernym_fn(F.column("token_extracted_data"))) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1957, in wrapper > return udf_obj(*args) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1916, in __call__ > judf = self._judf > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1900, in _judf > self._judf_placeholder = self._create_judf() > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1909, in _create_judf > wrapped_func = _wrap_function(sc, self.func, self.returnType) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1866, in _wrap_function > pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", > line 2374, in _prepare_for_python_RDD > pickled_command = ser.dumps(command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", > line 460, in dumps > return cloudpickle.dumps(obj, 2) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 704, in dumps > cp.dump(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 148, in dump > return Pickler.dump(self, obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 409, in dump > self.save(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 736, in save_tuple > save(element) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 852, in _batch_setitems > save(v) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in
[jira] [Commented] (SPARK-22792) PySpark UDF registering issue
[ https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292162#comment-16292162 ] Annamalai Venugopal commented on SPARK-22792: - Sorry am new to this.I'll change it now > PySpark UDF registering issue > - > > Key: SPARK-22792 > URL: https://issues.apache.org/jira/browse/SPARK-22792 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.2.1 > Environment: Windows OS, Python pycharm ,Spark >Reporter: Annamalai Venugopal > Labels: windows > Original Estimate: 72h > Remaining Estimate: 72h > > I am doing a project with pyspark i am struck with an issue > Traceback (most recent call last): > File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line > 187, in > hypernym_extracted_data = result.withColumn("hypernym_extracted_data", > hypernym_fn(F.column("token_extracted_data"))) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1957, in wrapper > return udf_obj(*args) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1916, in __call__ > judf = self._judf > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1900, in _judf > self._judf_placeholder = self._create_judf() > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1909, in _create_judf > wrapped_func = _wrap_function(sc, self.func, self.returnType) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1866, in _wrap_function > pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", > line 2374, in _prepare_for_python_RDD > pickled_command = ser.dumps(command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", > line 460, in dumps > return cloudpickle.dumps(obj, 2) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 704, in dumps > cp.dump(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 148, in dump > return Pickler.dump(self, obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 409, in dump > self.save(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 736, in save_tuple > save(element) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 852, in _batch_setitems > save(v) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict >
[jira] [Updated] (SPARK-22792) PySpark UDF registering issue
[ https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22792: - Fix Version/s: (was: 2.2.1) > PySpark UDF registering issue > - > > Key: SPARK-22792 > URL: https://issues.apache.org/jira/browse/SPARK-22792 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.2.1 > Environment: Windows OS, Python pycharm ,Spark >Reporter: Annamalai Venugopal > Labels: windows > Original Estimate: 72h > Remaining Estimate: 72h > > I am doing a project with pyspark i am struck with an issue > Traceback (most recent call last): > File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line > 187, in > hypernym_extracted_data = result.withColumn("hypernym_extracted_data", > hypernym_fn(F.column("token_extracted_data"))) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1957, in wrapper > return udf_obj(*args) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1916, in __call__ > judf = self._judf > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1900, in _judf > self._judf_placeholder = self._create_judf() > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1909, in _create_judf > wrapped_func = _wrap_function(sc, self.func, self.returnType) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1866, in _wrap_function > pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", > line 2374, in _prepare_for_python_RDD > pickled_command = ser.dumps(command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", > line 460, in dumps > return cloudpickle.dumps(obj, 2) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 704, in dumps > cp.dump(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 148, in dump > return Pickler.dump(self, obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 409, in dump > self.save(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 736, in save_tuple > save(element) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 852, in _batch_setitems > save(v) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File >
[jira] [Updated] (SPARK-22792) PySpark UDF registering issue
[ https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22792: - Target Version/s: (was: 2.2.1) > PySpark UDF registering issue > - > > Key: SPARK-22792 > URL: https://issues.apache.org/jira/browse/SPARK-22792 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.2.1 > Environment: Windows OS, Python pycharm ,Spark >Reporter: Annamalai Venugopal > Labels: windows > Original Estimate: 72h > Remaining Estimate: 72h > > I am doing a project with pyspark i am struck with an issue > Traceback (most recent call last): > File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line > 187, in > hypernym_extracted_data = result.withColumn("hypernym_extracted_data", > hypernym_fn(F.column("token_extracted_data"))) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1957, in wrapper > return udf_obj(*args) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1916, in __call__ > judf = self._judf > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1900, in _judf > self._judf_placeholder = self._create_judf() > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1909, in _create_judf > wrapped_func = _wrap_function(sc, self.func, self.returnType) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1866, in _wrap_function > pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", > line 2374, in _prepare_for_python_RDD > pickled_command = ser.dumps(command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", > line 460, in dumps > return cloudpickle.dumps(obj, 2) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 704, in dumps > cp.dump(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 148, in dump > return Pickler.dump(self, obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 409, in dump > self.save(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 736, in save_tuple > save(element) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 852, in _batch_setitems > save(v) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File >
[jira] [Updated] (SPARK-22792) PySpark UDF registering issue
[ https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-22792: - Priority: Major (was: Blocker) > PySpark UDF registering issue > - > > Key: SPARK-22792 > URL: https://issues.apache.org/jira/browse/SPARK-22792 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.2.1 > Environment: Windows OS, Python pycharm ,Spark >Reporter: Annamalai Venugopal > Labels: windows > Original Estimate: 72h > Remaining Estimate: 72h > > I am doing a project with pyspark i am struck with an issue > Traceback (most recent call last): > File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line > 187, in > hypernym_extracted_data = result.withColumn("hypernym_extracted_data", > hypernym_fn(F.column("token_extracted_data"))) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1957, in wrapper > return udf_obj(*args) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1916, in __call__ > judf = self._judf > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1900, in _judf > self._judf_placeholder = self._create_judf() > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1909, in _create_judf > wrapped_func = _wrap_function(sc, self.func, self.returnType) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1866, in _wrap_function > pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", > line 2374, in _prepare_for_python_RDD > pickled_command = ser.dumps(command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", > line 460, in dumps > return cloudpickle.dumps(obj, 2) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 704, in dumps > cp.dump(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 148, in dump > return Pickler.dump(self, obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 409, in dump > self.save(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 736, in save_tuple > save(element) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 852, in _batch_setitems > save(v) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File >
[jira] [Commented] (SPARK-22792) PySpark UDF registering issue
[ https://issues.apache.org/jira/browse/SPARK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292158#comment-16292158 ] Hyukjin Kwon commented on SPARK-22792: -- Please don't set the blocker and target version that are usually reserved for committers, and fixed version that we usually set when it's actually fixed. > PySpark UDF registering issue > - > > Key: SPARK-22792 > URL: https://issues.apache.org/jira/browse/SPARK-22792 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.2.1 > Environment: Windows OS, Python pycharm ,Spark >Reporter: Annamalai Venugopal >Priority: Blocker > Labels: windows > Fix For: 2.2.1 > > Original Estimate: 72h > Remaining Estimate: 72h > > I am doing a project with pyspark i am struck with an issue > Traceback (most recent call last): > File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line > 187, in > hypernym_extracted_data = result.withColumn("hypernym_extracted_data", > hypernym_fn(F.column("token_extracted_data"))) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1957, in wrapper > return udf_obj(*args) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1916, in __call__ > judf = self._judf > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1900, in _judf > self._judf_placeholder = self._create_judf() > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1909, in _create_judf > wrapped_func = _wrap_function(sc, self.func, self.returnType) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", > line 1866, in _wrap_function > pickled_command, broadcast_vars, env, includes = > _prepare_for_python_RDD(sc, command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", > line 2374, in _prepare_for_python_RDD > pickled_command = ser.dumps(command) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", > line 460, in dumps > return cloudpickle.dumps(obj, 2) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 704, in dumps > cp.dump(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 148, in dump > return Pickler.dump(self, obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 409, in dump > self.save(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 736, in save_tuple > save(element) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 821, in save_dict > self._batch_setitems(obj.items()) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 852, in _batch_setitems > save(v) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self, obj) # Call unbound method with explicit self > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 249, in save_function > self.save_function_tuple(obj) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", > line 297, in save_function_tuple > save(f_globals) > File > "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", > line 476, in save > f(self,
[jira] [Updated] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web
[ https://issues.apache.org/jira/browse/SPARK-22794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22794: -- Attachment: task_is_succeeded_in_yarn_web.png > Spark Job failed, but the state is succeeded in Yarn Web > > > Key: SPARK-22794 > URL: https://issues.apache.org/jira/browse/SPARK-22794 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei > Attachments: task_is_succeeded_in_yarn_web.png > > > I run a job in yarn mode, the job is failed: > {noformat} > 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'. > Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does > not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {noformat} > but in the yarn web, the job state is succeeded. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web
[ https://issues.apache.org/jira/browse/SPARK-22794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] KaiXinXIaoLei updated SPARK-22794: -- Description: I run a job in yarn mode, the job is failed: {noformat} 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'. Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) {noformat} but in the yarn web, the job state is succeeded. was: I run a job in yarn mode, the job is failed: {noformat} 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'. Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) {noformat} but in the yarn web, the job state is succeeded: !C:\Users\g15071\Downloads\task_is_succeeded_in_yarn_web.png! > Spark Job failed, but the state is succeeded in Yarn Web > > > Key: SPARK-22794 > URL: https://issues.apache.org/jira/browse/SPARK-22794 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.2.1 >Reporter: KaiXinXIaoLei > > I run a job in yarn mode, the job is failed: > {noformat} > 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'. > Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does > not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss; > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) > {noformat} > but in the yarn web, the job state is succeeded. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22794) Spark Job failed, but the state is succeeded in Yarn Web
KaiXinXIaoLei created SPARK-22794: - Summary: Spark Job failed, but the state is succeeded in Yarn Web Key: SPARK-22794 URL: https://issues.apache.org/jira/browse/SPARK-22794 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.2.1 Reporter: KaiXinXIaoLei I run a job in yarn mode, the job is failed: {noformat} 17/12/15 11:55:16 INFO SharedState: Warehouse path is '/apps/hive/warehouse'. Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://node1.huleilei.h3c.com:8020/user/hdfs/sss; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370) {noformat} but in the yarn web, the job state is succeeded: !C:\Users\g15071\Downloads\task_is_succeeded_in_yarn_web.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Agarwal updated SPARK-22371: --- Attachment: ShuffleIssue.java Helper.scala > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal > Attachments: Helper.scala, ShuffleIssue.java, ShuffleIssue.java, > driver-thread-dump-spark2.1.txt > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22793) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-22793: Description: 1. Start HiveThriftServer2. 2. Connect to thriftserver through beeline. 3. Close the beeline. 4. repeat step2 and step 3 for several times, which caused the leak of Memory. we found there are many directories never be dropped under the path {code:java} hive.exec.local.scratchdir {code} and {code:java} hive.exec.scratchdir {code} , as we know the scratchdir has been added to deleteOnExit when it be created. So it means that the cache size of FileSystem deleteOnExit will keep increasing until JVM terminated. In addition, we use {code:java} jmap -histo:live [PID] {code} to printout the size of objects in HiveThriftServer2 Process, we can find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we closed all the beeline connections, which caused the leak of Memory. was: 1. Start HiveThriftServer2 2. Connect to thriftserver through beeline 3. Close the beeline 4. repeat step2 and step 3 for several times we found there are many directories never be dropped under the path {code:java} hive.exec.local.scratchdir {code} and {code:java} hive.exec.scratchdir {code} , as we know the scratchdir has been added to deleteOnExit when it be created. So it means that the cache size of FileSystem deleteOnExit will keep increasing until JVM terminated. In addition, we use {code:java} jmap -histo:live [PID] {code} to printout the size of objects in HiveThriftServer2 Process, we can find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we closed all the beeline connections, which caused the leak of Memory. > Memory leak in Spark Thrift Server > -- > > Key: SPARK-22793 > URL: https://issues.apache.org/jira/browse/SPARK-22793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: zuotingbing >Priority: Critical > > 1. Start HiveThriftServer2. > 2. Connect to thriftserver through beeline. > 3. Close the beeline. > 4. repeat step2 and step 3 for several times, which caused the leak of Memory. > we found there are many directories never be dropped under the path > {code:java} > hive.exec.local.scratchdir > {code} and > {code:java} > hive.exec.scratchdir > {code} , as we know the scratchdir has been added to deleteOnExit when it be > created. So it means that the cache size of FileSystem deleteOnExit will keep > increasing until JVM terminated. > In addition, we use > {code:java} > jmap -histo:live [PID] > {code} to printout the size of objects in HiveThriftServer2 Process, we can > find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and > "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though > we closed all the beeline connections, which caused the leak of Memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22793) Memory leak in Spark Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-22793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zuotingbing updated SPARK-22793: Description: 1. Start HiveThriftServer2 2. Connect to thriftserver through beeline 3. Close the beeline 4. repeat step2 and step 3 for several times we found there are many directories never be dropped under the path {code:java} hive.exec.local.scratchdir {code} and {code:java} hive.exec.scratchdir {code} , as we know the scratchdir has been added to deleteOnExit when it be created. So it means that the cache size of FileSystem deleteOnExit will keep increasing until JVM terminated. In addition, we use {code:java} jmap -histo:live [PID] {code} to printout the size of objects in HiveThriftServer2 Process, we can find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we closed all the beeline connections, which caused the leak of Memory. was: 1. Start HiveThriftServer2 2. Connect to thriftserver through beeline 3. Close the beeline 4. repeat step2 and step 3 for several times we found there are many directories never be dropped under the path {code:java} hive.exec.local.scratchdir {code} and {code:java} hive.exec.scratchdir {code} , as we know the scratchdir is added to deleteOnExit when it be created. So it means that the cache size of FileSystem deleteOnExit will keep increasing until JVM terminated. In addition, we use {code:java} jmap -histo:live [PID] {code} to printout the size of objects in HiveThriftServer2 Process, we can find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we closed all the beeline connections, which caused the leak of Memory. > Memory leak in Spark Thrift Server > -- > > Key: SPARK-22793 > URL: https://issues.apache.org/jira/browse/SPARK-22793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: zuotingbing >Priority: Critical > > 1. Start HiveThriftServer2 > 2. Connect to thriftserver through beeline > 3. Close the beeline > 4. repeat step2 and step 3 for several times > we found there are many directories never be dropped under the path > {code:java} > hive.exec.local.scratchdir > {code} and > {code:java} > hive.exec.scratchdir > {code} , as we know the scratchdir has been added to deleteOnExit when it be > created. So it means that the cache size of FileSystem deleteOnExit will keep > increasing until JVM terminated. > In addition, we use > {code:java} > jmap -histo:live [PID] > {code} to printout the size of objects in HiveThriftServer2 Process, we can > find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and > "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though > we closed all the beeline connections, which caused the leak of Memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22793) Memory leak in Spark Thrift Server
zuotingbing created SPARK-22793: --- Summary: Memory leak in Spark Thrift Server Key: SPARK-22793 URL: https://issues.apache.org/jira/browse/SPARK-22793 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.2 Reporter: zuotingbing Priority: Critical 1. Start HiveThriftServer2 2. Connect to thriftserver through beeline 3. Close the beeline 4. repeat step2 and step 3 for several times we found there are many directories never be dropped under the path {code:java} hive.exec.local.scratchdir {code} and {code:java} hive.exec.scratchdir {code} , as we know the scratchdir is added to deleteOnExit when it be created. So it means that the cache size of FileSystem deleteOnExit will keep increasing until JVM terminated. In addition, we use {code:java} jmap -histo:live [PID] {code} to printout the size of objects in HiveThriftServer2 Process, we can find the object "org.apache.spark.sql.hive.client.HiveClientImpl" and "org.apache.hadoop.hive.ql.session.SessionState" keep increasing even though we closed all the beeline connections, which caused the leak of Memory. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22792) PySpark UDF registering issue
Annamalai Venugopal created SPARK-22792: --- Summary: PySpark UDF registering issue Key: SPARK-22792 URL: https://issues.apache.org/jira/browse/SPARK-22792 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.2.1 Environment: Windows OS, Python pycharm ,Spark Reporter: Annamalai Venugopal Priority: Blocker Fix For: 2.2.1 I am doing a project with pyspark i am struck with an issue Traceback (most recent call last): File "C:/Users/avenugopal/PycharmProjects/POC_for_vectors/main.py", line 187, in hypernym_extracted_data = result.withColumn("hypernym_extracted_data", hypernym_fn(F.column("token_extracted_data"))) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", line 1957, in wrapper return udf_obj(*args) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", line 1916, in __call__ judf = self._judf File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", line 1900, in _judf self._judf_placeholder = self._create_judf() File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", line 1909, in _create_judf wrapped_func = _wrap_function(sc, self.func, self.returnType) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\functions.py", line 1866, in _wrap_function pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\rdd.py", line 2374, in _prepare_for_python_RDD pickled_command = ser.dumps(command) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\serializers.py", line 460, in dumps return cloudpickle.dumps(obj, 2) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 704, in dumps cp.dump(obj) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 148, in dump return Pickler.dump(self, obj) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 409, in dump self.save(obj) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 476, in save f(self, obj) # Call unbound method with explicit self File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 736, in save_tuple save(element) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 476, in save f(self, obj) # Call unbound method with explicit self File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 249, in save_function self.save_function_tuple(obj) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 297, in save_function_tuple save(f_globals) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 476, in save f(self, obj) # Call unbound method with explicit self File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 821, in save_dict self._batch_setitems(obj.items()) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 852, in _batch_setitems save(v) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 476, in save f(self, obj) # Call unbound method with explicit self File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 249, in save_function self.save_function_tuple(obj) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 297, in save_function_tuple save(f_globals) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 476, in save f(self, obj) # Call unbound method with explicit self File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 821, in save_dict self._batch_setitems(obj.items()) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 852, in _batch_setitems save(v) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\pickle.py", line 521, in save self.save_reduce(obj=obj, *rv) File "C:\Users\avenugopal\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\cloudpickle.py", line 565, in save_reduce "args[0] from
[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292134#comment-16292134 ] Mayank Agarwal edited comment on SPARK-22371 at 12/15/17 7:14 AM: -- Hi, Sorry for late reply. >From our analysis this error seems to come when there are jobs running >parallel on datasets and union of all those datasets and Full GC triggered at >same time which clears the accumulate of children dataset of union. I am attaching a small program from which this error comes frequently. [^ShuffleIssue.java] was (Author: mayank.agarwal2305): [^ShuffleIssue.java] > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal > Attachments: ShuffleIssue.java, driver-thread-dump-spark2.1.txt > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292134#comment-16292134 ] Mayank Agarwal edited comment on SPARK-22371 at 12/15/17 7:13 AM: -- [^ShuffleIssue.java] was (Author: mayank.agarwal2305): Hi, Sorry for late reply. >From our analysis this error seems to come when there are jobs running >parallel on datasets and union of all those datasets and Full GC triggered at >same time which clears the accumulate of children dataset of union. I am attaching a small program from which this error comes frequently. > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal > Attachments: ShuffleIssue.java, driver-thread-dump-spark2.1.txt > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22371) dag-scheduler-event-loop thread stopped with error Attempted to access garbage collected accumulator 5605982
[ https://issues.apache.org/jira/browse/SPARK-22371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Agarwal updated SPARK-22371: --- Attachment: ShuffleIssue.java Hi, Sorry for late reply. >From our analysis this error seems to come when there are jobs running >parallel on datasets and union of all those datasets and Full GC triggered at >same time which clears the accumulate of children dataset of union. I am attaching a small program from which this error comes frequently. > dag-scheduler-event-loop thread stopped with error Attempted to access > garbage collected accumulator 5605982 > - > > Key: SPARK-22371 > URL: https://issues.apache.org/jira/browse/SPARK-22371 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Mayank Agarwal > Attachments: ShuffleIssue.java, driver-thread-dump-spark2.1.txt > > > Our Spark Jobs are getting stuck on DagScheduler.runJob as dagscheduler > thread is stopped because of *Attempted to access garbage collected > accumulator 5605982*. > from our investigation it look like accumulator is cleaned by GC first and > same accumulator is used for merging the results from executor on task > completion event. > As the error java.lang.IllegalAccessError is LinkageError which is treated as > FatalError so dag-scheduler loop is finished with below exception. > ---ERROR stack trace -- > Exception in thread "dag-scheduler-event-loop" java.lang.IllegalAccessError: > Attempted to access garbage collected accumulator 5605982 > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:253) > at > org.apache.spark.util.AccumulatorContext$$anonfun$get$1.apply(AccumulatorV2.scala:249) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.AccumulatorContext$.get(AccumulatorV2.scala:249) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1083) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1080) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1080) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1647) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > I am attaching the thread dump of driver as well -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22753) Get rid of dataSource.writeAndRead
[ https://issues.apache.org/jira/browse/SPARK-22753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22753. - Resolution: Fixed Assignee: Li Yuanjian Fix Version/s: 2.3.0 > Get rid of dataSource.writeAndRead > -- > > Key: SPARK-22753 > URL: https://issues.apache.org/jira/browse/SPARK-22753 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Li Yuanjian >Assignee: Li Yuanjian >Priority: Minor > Fix For: 2.3.0 > > > Code clean work for getting rid of dataSource.writeAndRead. > As the discussion in https://github.com/apache/spark/pull/16481 and > https://github.com/apache/spark/pull/18975#discussion_r155454606 > Currently the BaseRelation returned by `dataSource.writeAndRead` only used in > `CreateDataSourceTableAsSelect`, planForWriting and writeAndRead has some > common code paths. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22791) Redact Output of Explain
[ https://issues.apache.org/jira/browse/SPARK-22791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22791: Assignee: Xiao Li (was: Apache Spark) > Redact Output of Explain > > > Key: SPARK-22791 > URL: https://issues.apache.org/jira/browse/SPARK-22791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > > When calling explain on a query, the output can contain sensitive > information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22791) Redact Output of Explain
[ https://issues.apache.org/jira/browse/SPARK-22791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292131#comment-16292131 ] Apache Spark commented on SPARK-22791: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19985 > Redact Output of Explain > > > Key: SPARK-22791 > URL: https://issues.apache.org/jira/browse/SPARK-22791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > > When calling explain on a query, the output can contain sensitive > information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22791) Redact Output of Explain
[ https://issues.apache.org/jira/browse/SPARK-22791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22791: Assignee: Apache Spark (was: Xiao Li) > Redact Output of Explain > > > Key: SPARK-22791 > URL: https://issues.apache.org/jira/browse/SPARK-22791 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Apache Spark > > When calling explain on a query, the output can contain sensitive > information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22791) Redact Output of Explain
Xiao Li created SPARK-22791: --- Summary: Redact Output of Explain Key: SPARK-22791 URL: https://issues.apache.org/jira/browse/SPARK-22791 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: Xiao Li Assignee: Xiao Li When calling explain on a query, the output can contain sensitive information. We should provide an admin/user to redact such information. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22787) Add a TPCH query suite
[ https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22787. - Resolution: Fixed Fix Version/s: 2.3.0 > Add a TPCH query suite > -- > > Key: SPARK-22787 > URL: https://issues.apache.org/jira/browse/SPARK-22787 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > Add a test suite to ensure all the TPC-H queries can be successfully > analyzed, optimized and compiled without hitting the max iteration threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22647) Docker files for image creation
[ https://issues.apache.org/jira/browse/SPARK-22647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292113#comment-16292113 ] Anirudh Ramanathan commented on SPARK-22647: I think we haven't run that particular configuration in production on our fork so far. Although I don't anticipate issues due to the switch, I'd be more inclined to go with what we've tested so far - and maybe switching to centos on our fork first. > Docker files for image creation > --- > > Key: SPARK-22647 > URL: https://issues.apache.org/jira/browse/SPARK-22647 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan > > This covers the dockerfiles that need to be shipped to enable the Kubernetes > backend for Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22765) Create a new executor allocation scheme based on that of MR
[ https://issues.apache.org/jira/browse/SPARK-22765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292048#comment-16292048 ] Xuefu Zhang commented on SPARK-22765: - [~tgraves], I think it would help if SPARK-21656 can make a close-to-zero idle time work. This is one source of inefficiency. Our version is too old to backport the fix, but will try out this when we upgrade. The second source of inefficiency comes in the fact that Spark favors bigger containers. A 4-core container might be running one task while wasting the other cores/mem. The executor cannot die as long as there is one task running. One might argue that a user configures 1-core containers under dynamic allocation. but this is probably not optimal on other aspects. The third reason that one might favor MR-styled scheduling is its simplicity and efficiency. Frequently we found that for heavy workload the scheduler cannot really keep up with the task ups and downs, especially when the tasks finish fast. For cost-conscious users, cluster-level resource efficiency is probably what's looked at. My suspicion is that an enhanced MR-styled scheduling, simple and performing, will be significantly improve resource efficiency than a typical use of dynamic allocation, without sacrificing much performance. As a start point, we will first benchmark with SPARK-21656 when possible. > Create a new executor allocation scheme based on that of MR > --- > > Key: SPARK-22765 > URL: https://issues.apache.org/jira/browse/SPARK-22765 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 1.6.0 >Reporter: Xuefu Zhang > > Many users migrating their workload from MR to Spark find a significant > resource consumption hike (i.e, SPARK-22683). While this might not be a > concern for users that are more performance centric, for others conscious > about cost, such hike creates a migration obstacle. This situation can get > worse as more users are moving to cloud. > Dynamic allocation make it possible for Spark to be deployed in multi-tenant > environment. With its performance-centric design, its inefficiency has also > unfortunately shown up, especially when compared with MR. Thus, it's believed > that MR-styled scheduler still has its merit. Based on our research, the > inefficiency associated with dynamic allocation comes in many aspects such as > executor idling out, bigger executors, many stages (rather than 2 stages only > in MR) in a spark job, etc. > Rather than fine tuning dynamic allocation for efficiency, the proposal here > is to add a new, efficiency-centric scheduling scheme based on that of MR. > Such a MR-based scheme can be further enhanced and be more adapted to Spark > execution model. This alternative is expected to offer good performance > improvement (compared to MR) still with similar to or even better efficiency > than MR. > Inputs are greatly welcome! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22790) add a configurable factor to describe HadoopFsRelation's size
Nan Zhu created SPARK-22790: --- Summary: add a configurable factor to describe HadoopFsRelation's size Key: SPARK-22790 URL: https://issues.apache.org/jira/browse/SPARK-22790 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Nan Zhu as per discussion in https://github.com/apache/spark/pull/19864#discussion_r156847927 the current HadoopFsRelation is purely based on the underlying file size which is not accurate and makes the execution vulnerable to errors like OOM Users can enable CBO with the functionalities in https://github.com/apache/spark/pull/19864 to avoid this issue This JIRA proposes to add a configurable factor to sizeInBytes method in HadoopFsRelation class so that users can mitigate this problem without CBO -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22790) add a configurable factor to describe HadoopFsRelation's size
[ https://issues.apache.org/jira/browse/SPARK-22790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291985#comment-16291985 ] Nan Zhu commented on SPARK-22790: - created per discussion in https://github.com/apache/spark/pull/19864 (will file a PR soon) > add a configurable factor to describe HadoopFsRelation's size > - > > Key: SPARK-22790 > URL: https://issues.apache.org/jira/browse/SPARK-22790 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Nan Zhu > > as per discussion in > https://github.com/apache/spark/pull/19864#discussion_r156847927 > the current HadoopFsRelation is purely based on the underlying file size > which is not accurate and makes the execution vulnerable to errors like OOM > Users can enable CBO with the functionalities in > https://github.com/apache/spark/pull/19864 to avoid this issue > This JIRA proposes to add a configurable factor to sizeInBytes method in > HadoopFsRelation class so that users can mitigate this problem without CBO -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22781) Support creating streaming dataset with ORC files
[ https://issues.apache.org/jira/browse/SPARK-22781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291960#comment-16291960 ] Dongjoon Hyun commented on SPARK-22781: --- Hi, [~tdas] and [~zsxwing]. Could you give me some advice on this issue? > Support creating streaming dataset with ORC files > - > > Key: SPARK-22781 > URL: https://issues.apache.org/jira/browse/SPARK-22781 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun > > This issue supports creating streaming dataset with ORC file format -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22789) Add ContinuousExecution for continuous processing queries
[ https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22789: Assignee: (was: Apache Spark) > Add ContinuousExecution for continuous processing queries > - > > Key: SPARK-22789 > URL: https://issues.apache.org/jira/browse/SPARK-22789 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22789) Add ContinuousExecution for continuous processing queries
[ https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291900#comment-16291900 ] Apache Spark commented on SPARK-22789: -- User 'joseph-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/19984 > Add ContinuousExecution for continuous processing queries > - > > Key: SPARK-22789 > URL: https://issues.apache.org/jira/browse/SPARK-22789 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22789) Add ContinuousExecution for continuous processing queries
[ https://issues.apache.org/jira/browse/SPARK-22789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22789: Assignee: Apache Spark > Add ContinuousExecution for continuous processing queries > - > > Key: SPARK-22789 > URL: https://issues.apache.org/jira/browse/SPARK-22789 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22047) HiveExternalCatalogVersionsSuite is Flaky on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22047. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.3.0 > HiveExternalCatalogVersionsSuite is Flaky on Jenkins > > > Key: SPARK-22047 > URL: https://issues.apache.org/jira/browse/SPARK-22047 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Armin Braun >Assignee: Wenchen Fan > Labels: flaky-test > Fix For: 2.3.0 > > > HiveExternalCatalogVersionsSuite fails quite a bit lately e.g. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/3490/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {code} > Error Message > org.scalatest.exceptions.TestFailedException: spark-submit returned with exit > code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > > '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py' > 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class > org.apache.spark.launcher.Main > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > spark-submit returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > > '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py' > 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class > org.apache.spark.launcher.Main > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at org.scalatest.Assertions$class.fail(Assertions.scala:1089) > at org.scalatest.FunSuite.fail(FunSuite.scala:1560) > at > org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:81) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:38) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:120) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:105) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:105) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22036) BigDecimal multiplication sometimes returns null
[ https://issues.apache.org/jira/browse/SPARK-22036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291805#comment-16291805 ] Anvesh R commented on SPARK-22036: -- +1 Issue reproduced on spark-2.2.0 : Data at s3 location - s3://bucket/spark-sql-jira/ : - 100|9 drop table if exists test; CREATE EXTERNAL TABLE `test` ( adecimal(38,10), bdecimal(38,10) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 's3://bucket/spark-sql-jira/'; spark-sql> select a,(a*b*0.98765432100) from test; 100 9876444.4445679 Time taken: 11.033 seconds, Fetched 1 row(s) spark-sql> select a,(a*b*0.987654321000) from test; 100 NULL Time taken: 0.523 seconds, Fetched 1 row(s) Changing a column's scale from decimal(38,10) to decimal(38,9) also helped but we would loose precision. > BigDecimal multiplication sometimes returns null > > > Key: SPARK-22036 > URL: https://issues.apache.org/jira/browse/SPARK-22036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Olivier Blanvillain > > The multiplication of two BigDecimal numbers sometimes returns null. Here is > a minimal reproduction: > {code:java} > object Main extends App { > import org.apache.spark.{SparkConf, SparkContext} > import org.apache.spark.sql.SparkSession > import spark.implicits._ > val conf = new > SparkConf().setMaster("local[*]").setAppName("REPL").set("spark.ui.enabled", > "false") > val spark = > SparkSession.builder().config(conf).appName("REPL").getOrCreate() > implicit val sqlContext = spark.sqlContext > case class X2(a: BigDecimal, b: BigDecimal) > val ds = sqlContext.createDataset(List(X2(BigDecimal(-0.1267333984375), > BigDecimal(-1000.1 > val result = ds.select(ds("a") * ds("b")).collect.head > println(result) // [null] > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22496) beeline display operation log
[ https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-22496. - Resolution: Fixed > beeline display operation log > - > > Key: SPARK-22496 > URL: https://issues.apache.org/jira/browse/SPARK-22496 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: StephenZou >Priority: Minor > > For now,when end user runs queries in beeline or in hue through STS, > no logs are displayed, end user will wait until the job finishes or fails. > Progress information is needed to inform end users how the job is running if > they are not familiar with yarn RM or standalone spark master ui. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22789) Add ContinuousExecution for continuous processing queries
Jose Torres created SPARK-22789: --- Summary: Add ContinuousExecution for continuous processing queries Key: SPARK-22789 URL: https://issues.apache.org/jira/browse/SPARK-22789 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22733) refactor StreamExecution for extensibility
[ https://issues.apache.org/jira/browse/SPARK-22733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22733. -- Resolution: Fixed Assignee: Jose Torres Fix Version/s: 2.3.0 > refactor StreamExecution for extensibility > -- > > Key: SPARK-22733 > URL: https://issues.apache.org/jira/browse/SPARK-22733 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Jose Torres > Fix For: 2.3.0 > > > StreamExecution currently mixes together meta-logic (tracking and signalling > progress, persistence, reporting) with the core behavior of generating > batches and running them. We want to reuse the former but not the latter in > continuous execution, so we need to split them up. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully
[ https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-22778: -- Assignee: Yinan Li > Kubernetes scheduler at master failing to run applications successfully > --- > > Key: SPARK-22778 > URL: https://issues.apache.org/jira/browse/SPARK-22778 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Assignee: Yinan Li >Priority: Critical > Fix For: 2.3.0 > > > Building images based on master and deploying Spark PI results in the > following error. > 2017-12-13 19:57:19 INFO SparkContext:54 - Successfully stopped SparkContext > Exception in thread "main" org.apache.spark.SparkException: Could not parse > Master URL: 'k8s:https://xx.yy.zz.ww' > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741) > at org.apache.spark.SparkContext.(SparkContext.scala:496) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Shutdown hook called > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd > This is likely an artifact seen because of changes in master, or our > submission code in the reviews. We haven't seen this on our fork. Hopefully > once integration tests are ported against upstream/master, we will catch > these issues earlier. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22778) Kubernetes scheduler at master failing to run applications successfully
[ https://issues.apache.org/jira/browse/SPARK-22778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-22778. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19972 [https://github.com/apache/spark/pull/19972] > Kubernetes scheduler at master failing to run applications successfully > --- > > Key: SPARK-22778 > URL: https://issues.apache.org/jira/browse/SPARK-22778 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Anirudh Ramanathan >Priority: Critical > Fix For: 2.3.0 > > > Building images based on master and deploying Spark PI results in the > following error. > 2017-12-13 19:57:19 INFO SparkContext:54 - Successfully stopped SparkContext > Exception in thread "main" org.apache.spark.SparkException: Could not parse > Master URL: 'k8s:https://xx.yy.zz.ww' > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2741) > at org.apache.spark.SparkContext.(SparkContext.scala:496) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2490) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:927) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:918) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:918) > at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31) > at org.apache.spark.examples.SparkPi.main(SparkPi.scala) > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Shutdown hook called > 2017-12-13 19:57:19 INFO ShutdownHookManager:54 - Deleting directory > /tmp/spark-b47515c2-6750-4a37-aa68-6ee12da5d2bd > This is likely an artifact seen because of changes in master, or our > submission code in the reviews. We haven't seen this on our fork. Hopefully > once integration tests are ported against upstream/master, we will catch > these issues earlier. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3419) Scheduler shouldn't delay running a task when executors don't reside at any of its preferred locations
[ https://issues.apache.org/jira/browse/SPARK-3419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-3419. - Resolution: Fixed Fix Version/s: 1.3.0 This has been fixed for a long time, looks like as part of SPARK-4939 (though that sounds unrelated). > Scheduler shouldn't delay running a task when executors don't reside at any > of its preferred locations > --- > > Key: SPARK-3419 > URL: https://issues.apache.org/jira/browse/SPARK-3419 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Sandy Ryza > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
[ https://issues.apache.org/jira/browse/SPARK-22788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22788: Assignee: (was: Apache Spark) > HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support" > - > > Key: SPARK-22788 > URL: https://issues.apache.org/jira/browse/SPARK-22788 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Trivial > > Code: > {noformat} > if (conf.getBoolean("hdfs.append.support", false) || > dfs.isInstanceOf[RawLocalFileSystem]) { > dfs.append(dfsPath) > } else { > throw new IllegalStateException("File exists and there is no append > support!") > } > {noformat} > This makes the exception to be thrown if you enable > {{writeAheadLog.closeFileAfterWrite}} with HDFS. > The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and > the default value is true. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
[ https://issues.apache.org/jira/browse/SPARK-22788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22788: Assignee: Apache Spark > HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support" > - > > Key: SPARK-22788 > URL: https://issues.apache.org/jira/browse/SPARK-22788 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Trivial > > Code: > {noformat} > if (conf.getBoolean("hdfs.append.support", false) || > dfs.isInstanceOf[RawLocalFileSystem]) { > dfs.append(dfsPath) > } else { > throw new IllegalStateException("File exists and there is no append > support!") > } > {noformat} > This makes the exception to be thrown if you enable > {{writeAheadLog.closeFileAfterWrite}} with HDFS. > The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and > the default value is true. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
[ https://issues.apache.org/jira/browse/SPARK-22788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291602#comment-16291602 ] Apache Spark commented on SPARK-22788: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/19983 > HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support" > - > > Key: SPARK-22788 > URL: https://issues.apache.org/jira/browse/SPARK-22788 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 1.6.0 >Reporter: Marcelo Vanzin >Priority: Trivial > > Code: > {noformat} > if (conf.getBoolean("hdfs.append.support", false) || > dfs.isInstanceOf[RawLocalFileSystem]) { > dfs.append(dfsPath) > } else { > throw new IllegalStateException("File exists and there is no append > support!") > } > {noformat} > This makes the exception to be thrown if you enable > {{writeAheadLog.closeFileAfterWrite}} with HDFS. > The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and > the default value is true. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22788) HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support"
Marcelo Vanzin created SPARK-22788: -- Summary: HdfsUtils.getOutputStream uses non-existent Hadoop conf "hdfs.append.support" Key: SPARK-22788 URL: https://issues.apache.org/jira/browse/SPARK-22788 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 1.6.0 Reporter: Marcelo Vanzin Priority: Trivial Code: {noformat} if (conf.getBoolean("hdfs.append.support", false) || dfs.isInstanceOf[RawLocalFileSystem]) { dfs.append(dfsPath) } else { throw new IllegalStateException("File exists and there is no append support!") } {noformat} This makes the exception to be thrown if you enable {{writeAheadLog.closeFileAfterWrite}} with HDFS. The correct config, from {{DFSConfigKeys}}, is {{dfs.support.append}}, and the default value is true. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Cuquemelle updated SPARK-22683: -- Description: While migrating a series of jobs from MR to Spark using dynamicAllocation, I've noticed almost a doubling (+114% exactly) of resource consumption of Spark w.r.t MR, for a wall clock time gain of 43% About the context: - resource usage stands for vcore-hours allocation for the whole job, as seen by YARN - I'm talking about a series of jobs because we provide our users with a way to define experiments (via UI / DSL) that automatically get translated to Spark / MR jobs and submitted on the cluster - we submit around 500 of such jobs each day - these jobs are usually one shot, and the amount of processing can vary a lot between jobs, and as such finding an efficient number of executors for each job is difficult to get right, which is the reason I took the path of dynamic allocation. - Some of the tests have been scheduled on an idle queue, some on a full queue. - experiments have been conducted with spark.executor-cores = 5 and 10, only results for 5 cores have been reported because efficiency was overall better than with 10 cores - the figures I give are averaged over a representative sample of those jobs (about 600 jobs) ranging from tens to thousands splits in the data partitioning and between 400 to 9000 seconds of wall clock time. - executor idle timeout is set to 30s; Definition: - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, which represent the max number of tasks an executor will process in parallel. - the current behaviour of the dynamic allocation is to allocate enough containers to have one taskSlot per task, which minimizes latency, but wastes resources when tasks are small regarding executor allocation and idling overhead. The results using the proposal (described below) over the job sample (600 jobs): - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in resource usage, for a 37% (against 43%) reduction in wall clock time for Spark w.r.t MR - by trying to minimize the average resource consumption, I ended up with 6 tasks per core, with a 30% resource usage reduction, for a similar wall clock time w.r.t. MR What did I try to solve the issue with existing parameters (summing up a few points mentioned in the comments) ? - change dynamicAllocation.maxExecutors: this would need to be adapted for each job (tens to thousands splits can occur), and essentially remove the interest of using the dynamic allocation. - use dynamicAllocation.backlogTimeout: - setting this parameter right to avoid creating unused executors is very dependant on wall clock time. One basically needs to solve the exponential ramp up for the target time. So this is not an option for my use case where I don't want a per-job tuning. - I've still done a series of experiments, details in the comments. Result is that after manual tuning, the best I could get was a similar resource consumption at the expense of 20% more wall clock time, or a similar wall clock time at the expense of 60% more resource consumption than what I got using my proposal @ 6 tasks per slot (this value being optimized over a much larger range of jobs as already stated) - as mentioned in another comment, tampering with the exponential ramp up might yield task imbalance and such old executors could become contention points for other exes trying to remotely access blocks in the old exes (not witnessed in the jobs I'm talking about, but we did see this behavior in other jobs) Proposal: Simply add a tasksPerExecutorSlot parameter, which makes it possible to specify how many tasks a single taskSlot should ideally execute to mitigate the overhead of executor allocation. PR: https://github.com/apache/spark/pull/19881 was: While migrating a series of jobs from MR to Spark using dynamicAllocation, I've noticed almost a doubling (+114% exactly) of resource consumption of Spark w.r.t MR, for a wall clock time gain of 43% About the context: - resource usage stands for vcore-hours allocation for the whole job, as seen by YARN - I'm talking about a series of jobs because we provide our users with a way to define experiments (via UI / DSL) that automatically get translated to Spark / MR jobs and submitted on the cluster - we submit around 500 of such jobs each day - these jobs are usually one shot, and the amount of processing can vary a lot between jobs, and as such finding an efficient number of executors for each job is difficult to get right, which is the reason I took the path of dynamic allocation. - Some of the tests have been scheduled on an idle queue, some on a full queue. - experiments have been conducted with spark.executor-cores = 5 and 10, only results for 5 cores have been reported because efficiency was overall
[jira] [Assigned] (SPARK-22787) Add a TPCH query suite
[ https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22787: Assignee: Apache Spark (was: Xiao Li) > Add a TPCH query suite > -- > > Key: SPARK-22787 > URL: https://issues.apache.org/jira/browse/SPARK-22787 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Apache Spark > > Add a test suite to ensure all the TPC-H queries can be successfully > analyzed, optimized and compiled without hitting the max iteration threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22787) Add a TPCH query suite
[ https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291486#comment-16291486 ] Apache Spark commented on SPARK-22787: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/19982 > Add a TPCH query suite > -- > > Key: SPARK-22787 > URL: https://issues.apache.org/jira/browse/SPARK-22787 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > > Add a test suite to ensure all the TPC-H queries can be successfully > analyzed, optimized and compiled without hitting the max iteration threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22787) Add a TPCH query suite
[ https://issues.apache.org/jira/browse/SPARK-22787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22787: Assignee: Xiao Li (was: Apache Spark) > Add a TPCH query suite > -- > > Key: SPARK-22787 > URL: https://issues.apache.org/jira/browse/SPARK-22787 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.2.1 >Reporter: Xiao Li >Assignee: Xiao Li > > Add a test suite to ensure all the TPC-H queries can be successfully > analyzed, optimized and compiled without hitting the max iteration threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22787) Add a TPCH query suite
Xiao Li created SPARK-22787: --- Summary: Add a TPCH query suite Key: SPARK-22787 URL: https://issues.apache.org/jira/browse/SPARK-22787 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.2.1 Reporter: Xiao Li Assignee: Xiao Li Add a test suite to ensure all the TPC-H queries can be successfully analyzed, optimized and compiled without hitting the max iteration threshold. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14822) Add lazy executor startup to Mesos Scheduler
[ https://issues.apache.org/jira/browse/SPARK-14822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291404#comment-16291404 ] Imran Rashid commented on SPARK-14822: -- [~mgummelt] is this still relevant? seems like dynamic allocation on mesos already covers this (not precisely the same but pretty close) > Add lazy executor startup to Mesos Scheduler > > > Key: SPARK-14822 > URL: https://issues.apache.org/jira/browse/SPARK-14822 > Project: Spark > Issue Type: Task > Components: Mesos >Reporter: Michael Gummelt > > As we deprecate fine-grained mode, we need to make sure we have alternative > solutions for its benefits. > Its two benefits are: > 0. lazy executor startup > In fine-grained mode, executors are brought up only as tasks are scheduled. > This means that a user doesn't have to set {{spark.cores.max}} to ensure > that the app doesn't consume all resources in the cluster. > 1. relinquishing cores > As Spark tasks terminate, the mesos task it was bound to terminates as > well, thus relinquishing the cores assigned to it. > I'd like to add {{0.}} to coarse-grained mode, possibly enabled with a > configuration param. If https://issues.apache.org/jira/browse/MESOS-1279 > ever happens, we can add {{1.}} as well. > cc [~tnachen] [~dragos] [~skonto] [~andrewor14] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16496) Add wholetext as option for reading text in SQL.
[ https://issues.apache.org/jira/browse/SPARK-16496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-16496. - Resolution: Fixed Assignee: Prashant Sharma Fix Version/s: 2.3.0 > Add wholetext as option for reading text in SQL. > > > Key: SPARK-16496 > URL: https://issues.apache.org/jira/browse/SPARK-16496 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prashant Sharma >Assignee: Prashant Sharma > Fix For: 2.3.0 > > > In multiple text analysis problems, it is not often desirable for the rows to > be split by "\n". There exists a wholeText reader for RDD API, and this JIRA > just adds the same support for Dataset API. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22047) HiveExternalCatalogVersionsSuite is Flaky on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-22047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291364#comment-16291364 ] Imran Rashid commented on SPARK-22047: -- [~cloud_fan] think we can close this now? I don't recall seeing this fail recently, though I also haven't looked very extensively. (seems spark-tests.appspot.com has stopped updating ...) > HiveExternalCatalogVersionsSuite is Flaky on Jenkins > > > Key: SPARK-22047 > URL: https://issues.apache.org/jira/browse/SPARK-22047 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Armin Braun > Labels: flaky-test > > HiveExternalCatalogVersionsSuite fails quite a bit lately e.g. > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.7/3490/testReport/junit/org.apache.spark.sql.hive/HiveExternalCatalogVersionsSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {code} > Error Message > org.scalatest.exceptions.TestFailedException: spark-submit returned with exit > code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > > '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py' > 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class > org.apache.spark.launcher.Main > Stacktrace > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: > spark-submit returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/warehouse-b266cb0e-5180-4ba8-80a3-b790b3be3aa0' > > '/home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7/target/tmp/test120059455549609580.py' > 2017-09-17 04:26:11.641 - stderr> Error: Could not find or load main class > org.apache.spark.launcher.Main > > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at org.scalatest.Assertions$class.fail(Assertions.scala:1089) > at org.scalatest.FunSuite.fail(FunSuite.scala:1560) > at > org.apache.spark.sql.hive.SparkSubmitTestUtils$class.runSparkSubmit(SparkSubmitTestUtils.scala:81) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.runSparkSubmit(HiveExternalCatalogVersionsSuite.scala:38) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:120) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite$$anonfun$beforeAll$1.apply(HiveExternalCatalogVersionsSuite.scala:105) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite.beforeAll(HiveExternalCatalogVersionsSuite.scala:105) > at > org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:212) > at > org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:210) > at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:31) > at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:314) > at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:480) > at sbt.ForkMain$Run$2.call(ForkMain.java:296) > at sbt.ForkMain$Run$2.call(ForkMain.java:286) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail:
[jira] [Assigned] (SPARK-22774) Add compilation check for generated code in TPCDSQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-22774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-22774: --- Assignee: Kazuaki Ishizaki > Add compilation check for generated code in TPCDSQuerySuite > --- > > Key: SPARK-22774 > URL: https://issues.apache.org/jira/browse/SPARK-22774 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.3.0 > > > {{TPCDSQuerySuite}} already checks whether analysis can be performed > correctly. In addition, it would be good to check whether generated Java code > can be compiled correctly -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22774) Add compilation check for generated code in TPCDSQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-22774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22774. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19971 [https://github.com/apache/spark/pull/19971] > Add compilation check for generated code in TPCDSQuerySuite > --- > > Key: SPARK-22774 > URL: https://issues.apache.org/jira/browse/SPARK-22774 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki > Fix For: 2.3.0 > > > {{TPCDSQuerySuite}} already checks whether analysis can be performed > correctly. In addition, it would be good to check whether generated Java code > can be compiled correctly -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22786) only use AppStatusPlugin in history server
[ https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291238#comment-16291238 ] Marcelo Vanzin commented on SPARK-22786: As I mentioned in the PR, I don't see why this is an issue without an explanation of exactly what problem it's creating. > only use AppStatusPlugin in history server > -- > > Key: SPARK-22786 > URL: https://issues.apache.org/jira/browse/SPARK-22786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22786) only use AppStatusPlugin in history server
[ https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22786: Assignee: Wenchen Fan (was: Apache Spark) > only use AppStatusPlugin in history server > -- > > Key: SPARK-22786 > URL: https://issues.apache.org/jira/browse/SPARK-22786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22786) only use AppStatusPlugin in history server
[ https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22786: Assignee: Apache Spark (was: Wenchen Fan) > only use AppStatusPlugin in history server > -- > > Key: SPARK-22786 > URL: https://issues.apache.org/jira/browse/SPARK-22786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22786) only use AppStatusPlugin in history server
[ https://issues.apache.org/jira/browse/SPARK-22786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291234#comment-16291234 ] Apache Spark commented on SPARK-22786: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19981 > only use AppStatusPlugin in history server > -- > > Key: SPARK-22786 > URL: https://issues.apache.org/jira/browse/SPARK-22786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22786) only use AppStatusPlugin in history server
Wenchen Fan created SPARK-22786: --- Summary: only use AppStatusPlugin in history server Key: SPARK-22786 URL: https://issues.apache.org/jira/browse/SPARK-22786 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291141#comment-16291141 ] Julien Cuquemelle edited comment on SPARK-22683 at 12/14/17 5:09 PM: - [~tgraves], thanks a lot for your remarks, I've updated the description and also included a summary of various results and comments I got. Answers about your other questions: "The fact you are asking for 5+cores per executor will naturally waste more resources when the executor isn't being used" In fact the resource usage will be similar with fewer cores, because if I set 1 core per exe, the dynamic allocation will ask for 5 times more exes "But if we can find something that by defaults works better for the majority of workloads that it makes sense to improve" I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, especially short jobs "As with any config though, how do I know what to set the tasksPerSlot as? it requires configuration and it could affect performance." I agree, what I'm trying to show in my argumentation is that: - I don't have any parameter today to do what I want without optimizing each job, which is not feasible in my use case - the granularity of the efficiency of this parameter seems coarser that other parameters (sweetspot values are valid on a more broader range of jobs than maxNbExe or backLogTimeout - it seems to me some settings are quite simple to understand : if I want to minimize latency, let the default value; If I want to save some resources, use a value of 2; If I want to really minimize resource consumption, find a higher number by analysis or aim at meeting a time budget About dynamic allocation : with the default setting of 1s of backlogTimeout, the exponential ramp up is in practise very similar to an upfront request, regarding the duration of jobs. I think upfront allocation could be used instead of exponential, but this wouldn't change the issue which is related to the target number of exes I don't think asking upfront vs exponential has any effect over how Yarn yields containers. "Above you say "When running with 6 tasks per executor slot, our Spark jobs consume in average 30% less vcorehours than the MR jobs, this setting being valid for different workload sizes." Was this with this patch applied or without?" The patch was applied, if not you cannot set the number of tasks per taskSlot (I mentionned "executor slot", which is incorrect, I was refering to taskSlot) "the WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then MR? " Positive numbers mean faster in Spark. "why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, or just unknown overhead?" running with 6 tasks per taskSlot means that 6 tasks will be processed sequentially by 6 times less task slots was (Author: jcuquemelle): [~tgraves], thanks a lot for your remarks, I've updated the description and also included a summary of various results and comments I got. Answers about your other questions: "The fact you are asking for 5+cores per executor will naturally waste more resources when the executor isn't being used" In fact the resource usage will be similar with fewer cores, because if I set 1 core per exe, the dynamic allocation will ask for 5 times more exes "But if we can find something that by defaults works better for the majority of workloads that it makes sense to improve" I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, especially short jobs "As with any config though, how do I know what to set the tasksPerSlot as? it requires configuration and it could affect performance." I agree, what I'm trying to show in my argumentation is that: - I don't have any parameter today to do what I want without optimizing each job, which is not feasible in my use case - the granularity of the efficiency of this parameter seems coarser that other parameters (sweetspot values are valid on a more broader range of jobs than maxNbExe or backLogTimeout - it seems to me some settings are quite simple to understand : if I want to minimize latency, let the default value; If I want to save some resources, use a value of 2; If I want to really minimize resource consumption, find a higher number by analysis or aim at maximizing a time budget About dynamic allocation : with the default setting of 1s of backlogTimeout, the exponential ramp up is in practise very similar to an upfront request, regarding the duration of jobs. I think upfront allocation could be used instead of exponential, but this wouldn't change the issue which is related to the target number of exes I don't think asking upfront vs exponential has any effect over how Yarn yields containers. "Above you say "When running with 6 tasks per executor slot, our Spark jobs consume in average 30% less vcorehours than the MR
[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291141#comment-16291141 ] Julien Cuquemelle edited comment on SPARK-22683 at 12/14/17 5:09 PM: - [~tgraves], thanks a lot for your remarks, I've updated the description and also included a summary of various results and comments I got. Answers about your other questions: "The fact you are asking for 5+cores per executor will naturally waste more resources when the executor isn't being used" In fact the resource usage will be similar with fewer cores, because if I set 1 core per exe, the dynamic allocation will ask for 5 times more exes "But if we can find something that by defaults works better for the majority of workloads that it makes sense to improve" I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, especially short jobs "As with any config though, how do I know what to set the tasksPerSlot as? it requires configuration and it could affect performance." I agree, what I'm trying to show in my argumentation is that: - I don't have any parameter today to do what I want without optimizing each job, which is not feasible in my use case - the granularity of the efficiency of this parameter seems coarser that other parameters (sweetspot values are valid on a more broader range of jobs than maxNbExe or backLogTimeout - it seems to me some settings are quite simple to understand : if I want to minimize latency, let the default value; If I want to save some resources, use a value of 2; If I want to really minimize resource consumption, find a higher number by analysis or aim at maximizing a time budget About dynamic allocation : with the default setting of 1s of backlogTimeout, the exponential ramp up is in practise very similar to an upfront request, regarding the duration of jobs. I think upfront allocation could be used instead of exponential, but this wouldn't change the issue which is related to the target number of exes I don't think asking upfront vs exponential has any effect over how Yarn yields containers. "Above you say "When running with 6 tasks per executor slot, our Spark jobs consume in average 30% less vcorehours than the MR jobs, this setting being valid for different workload sizes." Was this with this patch applied or without?" The patch was applied, if not you cannot set the number of tasks per taskSlot (I mentionned "executor slot", which is incorrect, I was refering to taskSlot) "the WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then MR? " Positive numbers mean faster in Spark. "why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, or just unknown overhead?" running with 6 tasks per taskSlot means that 6 tasks will be processed sequentially by 6 times less task slots was (Author: jcuquemelle): [~tgraves], thanks a lot for your remarks, I've updated the description and also included a summary of various results and comments I got. Answers about your other questions: "The fact you are asking for 5+cores per executor will naturally waste more resources when the executor isn't being used" In fact the resource usage will be similar with fewer cores, because if I set 1 core per exe, the dynamic allocation will ask for 5 times more exes "But if we can find something that by defaults works better for the majority of workloads that it makes sense to improve" I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, especially short jobs "As with any config though, how do I know what to set the tasksPerSlot as? it requires configuration and it could affect performance." I agree, what I'm trying to show in my argumentation is that: - I don't have any parameter today to do what I want without optimizing each job, which is not feasible in my use case - the granularity of the efficiency of this parameter seems coarser that other parameters (sweetspot values are valid on a more broader range of jobs than maxNbExe or backLogTimeout - it seems to me some settings are quite simple to understand : if I want to minimize latency, let the default value; If I want to save some resources, use a value of 2; If I want to really minimize resource consumption, do an analysis or aim at maximizing a time budget About dynamic allocation : with the default setting of 1s of backlogTimeout, the exponential ramp up is in practise very similar to an upfront request, regarding the duration of jobs. I think upfront allocation could be used instead of exponential, but this wouldn't change the issue which is related to the target number of exes I don't think asking upfront vs exponential has any effect over how Yarn yields containers. "Above you say "When running with 6 tasks per executor slot, our Spark jobs consume in average 30% less vcorehours than the MR jobs, this
[jira] [Assigned] (SPARK-22496) beeline display operation log
[ https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22496: Assignee: (was: Apache Spark) > beeline display operation log > - > > Key: SPARK-22496 > URL: https://issues.apache.org/jira/browse/SPARK-22496 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: StephenZou >Priority: Minor > > For now,when end user runs queries in beeline or in hue through STS, > no logs are displayed, end user will wait until the job finishes or fails. > Progress information is needed to inform end users how the job is running if > they are not familiar with yarn RM or standalone spark master ui. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22496) beeline display operation log
[ https://issues.apache.org/jira/browse/SPARK-22496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22496: Assignee: Apache Spark > beeline display operation log > - > > Key: SPARK-22496 > URL: https://issues.apache.org/jira/browse/SPARK-22496 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: StephenZou >Assignee: Apache Spark >Priority: Minor > > For now,when end user runs queries in beeline or in hue through STS, > no logs are displayed, end user will wait until the job finishes or fails. > Progress information is needed to inform end users how the job is running if > they are not familiar with yarn RM or standalone spark master ui. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291156#comment-16291156 ] Julien Cuquemelle commented on SPARK-22683: --- I did see SPARK-16158 before opening a new ticket, my proposal seemed simple enough for me not to need a full pluggable policy (with which I do agree, but seems much more difficult to be accepted IMHO) [~tgraves], you mentioned yourself that the current policy could be improved, and that the complexity from the users' point of view should not be overlooked :-) > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to mitigate this (summing up a few points mentioned in the > comments)? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291141#comment-16291141 ] Julien Cuquemelle commented on SPARK-22683: --- [~tgraves], thanks a lot for your remarks, I've updated the description and also included a summary of various results and comments I got. Answers about your other questions: "The fact you are asking for 5+cores per executor will naturally waste more resources when the executor isn't being used" In fact the resource usage will be similar with fewer cores, because if I set 1 core per exe, the dynamic allocation will ask for 5 times more exes "But if we can find something that by defaults works better for the majority of workloads that it makes sense to improve" I'm pretty sure 2 tasks per Slot would work for a very large set of workloads, especially short jobs "As with any config though, how do I know what to set the tasksPerSlot as? it requires configuration and it could affect performance." I agree, what I'm trying to show in my argumentation is that: - I don't have any parameter today to do what I want without optimizing each job, which is not feasible in my use case - the granularity of the efficiency of this parameter seems coarser that other parameters (sweetspot values are valid on a more broader range of jobs than maxNbExe or backLogTimeout - it seems to me some settings are quite simple to understand : if I want to minimize latency, let the default value; If I want to save some resources, use a value of 2; If I want to really minimize resource consumption, do an analysis or aim at maximizing a time budget About dynamic allocation : with the default setting of 1s of backlogTimeout, the exponential ramp up is in practise very similar to an upfront request, regarding the duration of jobs. I think upfront allocation could be used instead of exponential, but this wouldn't change the issue which is related to the target number of exes I don't think asking upfront vs exponential has any effect over how Yarn yields containers. "Above you say "When running with 6 tasks per executor slot, our Spark jobs consume in average 30% less vcorehours than the MR jobs, this setting being valid for different workload sizes." Was this with this patch applied or without?" The patch was applied, if not you cannot set the number of tasks per taskSlot (I mentionned "executor slot", which is incorrect, I was refering to taskSlot) "the WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then MR? " Positive numbers mean faster in Spark. "why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, or just unknown overhead?" running with 6 tasks per taskSlot means that 6 tasks will be processed sequentially by 6 times less task slots > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough >
[jira] [Resolved] (SPARK-22785) remove ColumnVector.anyNullsSet
[ https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22785. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19980 [https://github.com/apache/spark/pull/19980] > remove ColumnVector.anyNullsSet > --- > > Key: SPARK-22785 > URL: https://issues.apache.org/jira/browse/SPARK-22785 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291037#comment-16291037 ] Thomas Graves commented on SPARK-22683: --- Another way to approach this is to have a pluggable policy here. Right now we have it do exponential increase in requesting containers we could have that pluggable or configurable so that you could do all at once (which I have seen use case for), and then this use case you could have a policy to start out much slower requesting them. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to mitigate this (summing up a few points mentioned in the > comments)? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Cuquemelle updated SPARK-22683: -- Description: While migrating a series of jobs from MR to Spark using dynamicAllocation, I've noticed almost a doubling (+114% exactly) of resource consumption of Spark w.r.t MR, for a wall clock time gain of 43% About the context: - resource usage stands for vcore-hours allocation for the whole job, as seen by YARN - I'm talking about a series of jobs because we provide our users with a way to define experiments (via UI / DSL) that automatically get translated to Spark / MR jobs and submitted on the cluster - we submit around 500 of such jobs each day - these jobs are usually one shot, and the amount of processing can vary a lot between jobs, and as such finding an efficient number of executors for each job is difficult to get right, which is the reason I took the path of dynamic allocation. - Some of the tests have been scheduled on an idle queue, some on a full queue. - experiments have been conducted with spark.executor-cores = 5 and 10, only results for 5 cores have been reported because efficiency was overall better than with 10 cores - the figures I give are averaged over a representative sample of those jobs (about 600 jobs) ranging from tens to thousands splits in the data partitioning and between 400 to 9000 seconds of wall clock time. - executor idle timeout is set to 30s; Definition: - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, which represent the max number of tasks an executor will process in parallel. - the current behaviour of the dynamic allocation is to allocate enough containers to have one taskSlot per task, which minimizes latency, but wastes resources when tasks are small regarding executor allocation and idling overhead. The results using the proposal (described below) over the job sample (600 jobs): - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in resource usage, for a 37% (against 43%) reduction in wall clock time for Spark w.r.t MR - by trying to minimize the average resource consumption, I ended up with 6 tasks per core, with a 30% resource usage reduction, for a similar wall clock time w.r.t. MR What did I try to mitigate this (summing up a few points mentioned in the comments)? - change dynamicAllocation.maxExecutors: this would need to be adapted for each job (tens to thousands splits can occur), and essentially remove the interest of using the dynamic allocation. - use dynamicAllocation.backlogTimeout: - setting this parameter right to avoid creating unused executors is very dependant on wall clock time. One basically needs to solve the exponential ramp up for the target time. So this is not an option for my use case where I don't want a per-job tuning. - I've still done a series of experiments, details in the comments. Result is that after manual tuning, the best I could get was a similar resource consumption at the expense of 20% more wall clock time, or a similar wall clock time at the expense of 60% more resource consumption than what I got using my proposal @ 6 tasks per slot (this value being optimized over a much larger range of jobs as already stated) - as mentioned in another comment, tampering with the exponential ramp up might yield task imbalance and such old executors could become contention points for other exes trying to remotely access blocks in the old exes (not witnessed in the jobs I'm talking about, but we did see this behavior in other jobs) Proposal: Simply add a tasksPerExecutorSlot parameter, which makes it possible to specify how many tasks a single taskSlot should ideally execute to mitigate the overhead of executor allocation. PR: https://github.com/apache/spark/pull/19881 was: let's say an executor has spark.executor.cores / spark.task.cpus taskSlots The current dynamic allocation policy allocates enough executors to have each taskSlot execute a single task, which minimizes latency, but wastes resources when tasks are small regarding executor allocation and idling overhead. By adding the tasksPerExecutorSlot, it is made possible to specify how many tasks a single slot should ideally execute to mitigate the overhead of executor allocation. PR: https://github.com/apache/spark/pull/19881 > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle > Labels: pull-request-available > > While migrating a
[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16280548#comment-16280548 ] Julien Cuquemelle edited comment on SPARK-22683 at 12/14/17 3:25 PM: - I don't understand your statement about delaying executor addition ? I want to cap the number of executors in an adaptive way regarding the current number of tasks, not delay their creation. Doing this with dynamicAllocation.maxExecutors requires each job to be tuned for efficiency; when we're doing experiments, a lot of jobs are one shot, so they can't be fine tuned. The proposal gives a way to have an adaptive behaviour for a family of jobs. Regarding slowing the ramp up of executors with schedulerBacklogTimeout, I've made experiments to play with this parameter; I have made 2 series of experiments (7 similar jobs on each test case, average figures reported in the following table), one on a busy queue, and the other on an idle queue. I'll report only the idle queue, as the figures on the busy queue are even worse for the schedulerBacklogTimeout approach: First row is using the default 1s for the schedulerBacklogTimeout, and uses the 6 tasks per executorSlot I've mentioned above, other rows use the default dynamicAllocation behaviour and only change schedulerBacklogTimeout ||SparkWallTimeSec||Spk-vCores-H||taskPerExeSlot||schedulerBacklogTimeout|| |693.571429|37.142857|6|1.0| |584.857143|69.571429|1|30.0| |763.428571|54.285714|1|60.0| |826.714286|39.571429|1|90.0| So basically I can tune the backlogTimeout to get a similar vCores-H consumption at the expense of almost 20% more wallClockTime, or I can tune the parameter to get about the same wallClockTime at the expense of about 60% more vcoreH consumption (very roughly extrapolated between 30 and 60 secs for schedulerBacklogTimeout). It does not seem to solve the issue I'm trying to address, moreover this would again need to be tuned for each specific job's duration (to find the 90s timeout to get the similar resource consumption, I had to solve the exponential ramp-up with the duration of the already run job, which is not feasible in experimental use cases ). The previous experiments that allowed me to find the sweet spot at 6 tasks per slot has involved job wallClockTimes between 400 and 9000 seconds Another way to have a look at this new parameter I'm proposing is to have a simple way to tune the latency / resource consumption tradeoff. was (Author: jcuquemelle): I don't understand your statement about delaying executor addition ? I want to cap the number of executors in an adaptive way regarding the current number of tasks, not delay their creation. Doing this with dynamicAllocation.maxExecutors requires each job to be tuned for efficiency; when we're doing experiments, a lot of jobs are one shot, so they can't be fine tuned. The proposal gives a way to have an adaptive behaviour for a family of jobs. Regarding slowing the ramp up of executors with schedulerBacklogTimeout, I've made experiments to play with this parameter; I have made 2 series of experiments (7 jobs on each test case, average figures reported in the following table), one on a busy queue, and the other on an idle queue. I'll report only the idle queue, as the figures on the busy queue are even worse for the schedulerBacklogTimeout approach: First row is using the default 1s for the schedulerBacklogTimeout, and uses the 6 tasks per executorSlot I've mentioned above, other rows use the default dynamicAllocation behaviour and only change schedulerBacklogTimeout ||SparkWallTimeSec||Spk-vCores-H||taskPerExeSlot||schedulerBacklogTimeout|| |693.571429|37.142857|6|1.0| |584.857143|69.571429|1|30.0| |763.428571|54.285714|1|60.0| |826.714286|39.571429|1|90.0| So basically I can tune the backlogTimeout to get a similar vCores-H consumption at the expense of almost 20% more wallClockTime, or I can tune the parameter to get about the same wallClockTime at the expense of about 60% more vcoreH consumption (very roughly extrapolated between 30 and 60 secs for schedulerBacklogTimeout). It does not seem to solve the issue I'm trying to address, moreover this would again need to be tuned for each specific job's duration (to find the 90s timeout to get the similar resource consumption, I had to solve the exponential ramp-up with the duration of the already run job, which is not feasible in experimental use cases ). The previous experiments that allowed me to find the sweet spot at 6 tasks per slot has involved job wallClockTimes between 400 and 9000 seconds Another way to have a look at this new parameter I'm proposing is to have a simple way to tune the latency / resource consumption tradeoff. > DynamicAllocation wastes resources by allocating containers that will barely > be used >
[jira] [Commented] (SPARK-22359) Improve the test coverage of window functions
[ https://issues.apache.org/jira/browse/SPARK-22359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290964#comment-16290964 ] Attila Zsolt Piros commented on SPARK-22359: I would like to join and take the subtask SPARK-22362. > Improve the test coverage of window functions > - > > Key: SPARK-22359 > URL: https://issues.apache.org/jira/browse/SPARK-22359 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jiang Xingbo > > There are already quite a few integration tests using window functions, but > the unit tests coverage for window funtions is not ideal. > We'd like to test the following aspects: > * Specifications > ** different partition clauses (none, one, multiple) > ** different order clauses (none, one, multiple, asc/desc, nulls first/last) > * Frames and their combinations > ** OffsetWindowFunctionFrame > ** UnboundedWindowFunctionFrame > ** SlidingWindowFunctionFrame > ** UnboundedPrecedingWindowFunctionFrame > ** UnboundedFollowingWindowFunctionFrame > * Aggregate function types > ** Declarative > ** Imperative > ** UDAF > * Spilling > ** Cover the conditions that WindowExec should spill at least once -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22785) remove ColumnVector.anyNullsSet
[ https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22785: Assignee: Wenchen Fan (was: Apache Spark) > remove ColumnVector.anyNullsSet > --- > > Key: SPARK-22785 > URL: https://issues.apache.org/jira/browse/SPARK-22785 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22785) remove ColumnVector.anyNullsSet
[ https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22785: Assignee: Apache Spark (was: Wenchen Fan) > remove ColumnVector.anyNullsSet > --- > > Key: SPARK-22785 > URL: https://issues.apache.org/jira/browse/SPARK-22785 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22785) remove ColumnVector.anyNullsSet
[ https://issues.apache.org/jira/browse/SPARK-22785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290797#comment-16290797 ] Apache Spark commented on SPARK-22785: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19980 > remove ColumnVector.anyNullsSet > --- > > Key: SPARK-22785 > URL: https://issues.apache.org/jira/browse/SPARK-22785 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22785) remove ColumnVector.anyNullsSet
Wenchen Fan created SPARK-22785: --- Summary: remove ColumnVector.anyNullsSet Key: SPARK-22785 URL: https://issues.apache.org/jira/browse/SPARK-22785 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22775) move dictionary related APIs from ColumnVector to WritableColumnVector
[ https://issues.apache.org/jira/browse/SPARK-22775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-22775. - Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 19970 [https://github.com/apache/spark/pull/19970] > move dictionary related APIs from ColumnVector to WritableColumnVector > -- > > Key: SPARK-22775 > URL: https://issues.apache.org/jira/browse/SPARK-22775 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22644) Make ML testsuite support StructuredStreaming test
[ https://issues.apache.org/jira/browse/SPARK-22644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290773#comment-16290773 ] Apache Spark commented on SPARK-22644: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/19979 > Make ML testsuite support StructuredStreaming test > -- > > Key: SPARK-22644 > URL: https://issues.apache.org/jira/browse/SPARK-22644 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.3.0 > > > We need to add some helper code to make testing ML transformers & models > easier with streaming data. These tests might help us catch any remaining > issues and we could encourage future PRs to use these tests to prevent new > Models & Transformers from having issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22784: Assignee: Apache Spark > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Assignee: Apache Spark >Priority: Minor > Attachments: replay-baseline.svg > > > Motivation: > Our Spark History Server spends most of the backfill time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with a new buffer size parameter. > Measured the best buffer size. x20 of the average line size (30mb) gives 32% > speedup in a local test. > Result: > Backfill of Spark History and reading to the cache will be up to 30% faster > after tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22782) Boost speed, use kafka010 consumer kafka
[ https://issues.apache.org/jira/browse/SPARK-22782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-22782. --- Resolution: Invalid This sounds like some kind of comment for the mailing list. There's no specific problem or change described here. > Boost speed, use kafka010 consumer kafka > > > Key: SPARK-22782 > URL: https://issues.apache.org/jira/browse/SPARK-22782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 > Environment: kafka-version: 0.10 > spark-version: 2.2.0 > schedule spark on yarn >Reporter: licun >Priority: Critical > > We use spark structured streaming to consumer kafka, but we find the > consumer speed is too slow compare spark streaming . we set kafka > "maxOffsetsPerTrigger": 1. >By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we > found the following situation: > >The debug code: > > {code:scala} > private def poll(pollTimeoutMs: Long): Unit = { > val startTime = System.currentTimeMillis() > val p = consumer.poll(pollTimeoutMs) > val r = p.records(topicPartition) > val endTime = System.currentTimeMillis() > val delta = endTime - startTime > logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: > ${delta} ms") > fetchedData = r.iterator > } > {code} >The log: > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290745#comment-16290745 ] Apache Spark commented on SPARK-22784: -- User 'MikhailErofeev' has created a pull request for this issue: https://github.com/apache/spark/pull/19978 > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > Attachments: replay-baseline.svg > > > Motivation: > Our Spark History Server spends most of the backfill time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with a new buffer size parameter. > Measured the best buffer size. x20 of the average line size (30mb) gives 32% > speedup in a local test. > Result: > Backfill of Spark History and reading to the cache will be up to 30% faster > after tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22784: Assignee: (was: Apache Spark) > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > Attachments: replay-baseline.svg > > > Motivation: > Our Spark History Server spends most of the backfill time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with a new buffer size parameter. > Measured the best buffer size. x20 of the average line size (30mb) gives 32% > speedup in a local test. > Result: > Backfill of Spark History and reading to the cache will be up to 30% faster > after tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Erofeev updated SPARK-22784: Description: Motivation: Our Spark History Server spends most of the backfill time inside BufferedReader and StringBuffer. It happens because average line size of our events is ~1.500.000 chars (due to a lot of partitions and iterations), whereas the default buffer size is 2048 bytes. See the attached flame graph. Implementation: I've added logging of spent time and line size for each job. Parametrised ReplayListenerBus with a new buffer size parameter. Measured the best buffer size. x20 of the average line size (30mb) gives 32% speedup in a local test. Result: Backfill of Spark History and reading to the cache will be up to 30% faster after tuning. was: Motivation: Our Spark History Server spends most of its warm-up time inside BufferedReader and StringBuffer. It happens because average line size of our events is ~1.500.000 chars (due to a lot of partitions and iterations), whereas the default buffer size is 2048 bytes. See the attached flame graph. Implementation: I've added logging of spent time and line size for each job. Parametrised ReplayListenerBus with new buffer size parameter. Measured best buffer size. x20 of average line size (30mb) gives 32% speedup in a local test. Result: Warm-up of Spark History and reading to cache will be up to 30% faster after tuning. > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > Attachments: replay-baseline.svg > > > Motivation: > Our Spark History Server spends most of the backfill time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with a new buffer size parameter. > Measured the best buffer size. x20 of the average line size (30mb) gives 32% > speedup in a local test. > Result: > Backfill of Spark History and reading to the cache will be up to 30% faster > after tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Erofeev updated SPARK-22784: Attachment: replay-baseline.svg > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > Attachments: replay-baseline.svg > > > Motivation: > Our Spark History Server spends most of its warm-up time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with new buffer size parameter. > Measured best buffer size. x20 of average line size (30mb) gives 32% speedup > in a local test. > Result: > Warm-up of Spark History and reading to cache will be up to 30% faster after > tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22752) FileNotFoundException while reading from Kafka
[ https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290659#comment-16290659 ] Marco Gaido commented on SPARK-22752: - thanks [~zsxwing]. You are right. I am closing this as duplicate, thanks. > FileNotFoundException while reading from Kafka > -- > > Key: SPARK-22752 > URL: https://issues.apache.org/jira/browse/SPARK-22752 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Marco Gaido > > We are running a stateful structured streaming job which reads from Kafka and > writes to HDFS. And we are hitting this exception: > {noformat} > 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): > java.lang.IllegalStateException: Error reading delta file > /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, > part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta > does not exist > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) > {noformat} > Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory > there is no file at all. While we have some files in the commits and offsets > folders. I am not sure about the reason of this behavior. It seems to happen > on the second time the job is started, after the first one failed. So it > looks like task failures can generate it. Or it might be related to > watermarks, since there are some problems related to the incoming data for > which the watermark was filtering all the incoming data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22784) Increase reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Erofeev updated SPARK-22784: Affects Version/s: (was: 2.0.0) 2.2.1 Target Version/s: (was: 2.2.1) > Increase reading buffer size in Spark History Server > > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > > Motivation: > Our Spark History Server spends most of its warm-up time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with new buffer size parameter. > Measured best buffer size. x20 of average line size (30mb) gives 32% speedup > in a local test. > Result: > Warm-up of Spark History and reading to cache will be up to 30% faster after > tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22784) Configure reading buffer size in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-22784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Erofeev updated SPARK-22784: Summary: Configure reading buffer size in Spark History Server (was: Increase reading buffer size in Spark History Server) > Configure reading buffer size in Spark History Server > - > > Key: SPARK-22784 > URL: https://issues.apache.org/jira/browse/SPARK-22784 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Mikhail Erofeev >Priority: Minor > > Motivation: > Our Spark History Server spends most of its warm-up time inside > BufferedReader and StringBuffer. It happens because average line size of our > events is ~1.500.000 chars (due to a lot of partitions and iterations), > whereas the default buffer size is 2048 bytes. See the attached flame graph. > Implementation: > I've added logging of spent time and line size for each job. > Parametrised ReplayListenerBus with new buffer size parameter. > Measured best buffer size. x20 of average line size (30mb) gives 32% speedup > in a local test. > Result: > Warm-up of Spark History and reading to cache will be up to 30% faster after > tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22752) FileNotFoundException while reading from Kafka
[ https://issues.apache.org/jira/browse/SPARK-22752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido resolved SPARK-22752. - Resolution: Duplicate > FileNotFoundException while reading from Kafka > -- > > Key: SPARK-22752 > URL: https://issues.apache.org/jira/browse/SPARK-22752 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Marco Gaido > > We are running a stateful structured streaming job which reads from Kafka and > writes to HDFS. And we are hitting this exception: > {noformat} > 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): > java.lang.IllegalStateException: Error reading delta file > /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, > part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta > does not exist > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) > {noformat} > Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory > there is no file at all. While we have some files in the commits and offsets > folders. I am not sure about the reason of this behavior. It seems to happen > on the second time the job is started, after the first one failed. So it > looks like task failures can generate it. Or it might be related to > watermarks, since there are some problems related to the incoming data for > which the watermark was filtering all the incoming data. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22784) Increase reading buffer size in Spark History Server
Mikhail Erofeev created SPARK-22784: --- Summary: Increase reading buffer size in Spark History Server Key: SPARK-22784 URL: https://issues.apache.org/jira/browse/SPARK-22784 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.0.0 Reporter: Mikhail Erofeev Priority: Minor Motivation: Our Spark History Server spends most of its warm-up time inside BufferedReader and StringBuffer. It happens because average line size of our events is ~1.500.000 chars (due to a lot of partitions and iterations), whereas the default buffer size is 2048 bytes. See the attached flame graph. Implementation: I've added logging of spent time and line size for each job. Parametrised ReplayListenerBus with new buffer size parameter. Measured best buffer size. x20 of average line size (30mb) gives 32% speedup in a local test. Result: Warm-up of Spark History and reading to cache will be up to 30% faster after tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications
[ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290620#comment-16290620 ] omkar kankalapati edited comment on SPARK-22783 at 12/14/17 9:39 AM: - EventLoggingListener (org.apache.spark.scheduler.EventLoggingListener) opens the the log file named .inprogress file for writing in start() method as a Writer object. On all the events, logEvent() is invoked, which simply appends to the writer, without checking the size/performing any rotation. The writer is closed and file is renamed to remove the ".inprogress" suffix only in stop() method. Thus, the *.inprogress file will be held open throughout and keeps growing. It would be very helpful if EventLoggingListener can be enhanced to support rotating the file .inprogress file when it reaches a (configured) threshold size/interval. was (Author: omkar.kankalapati): EventLoggingListener (org.apache.spark.scheduler.EventLoggingListener) opens the the log file named .inprogress file for writing in start() method as a Writer object. On all the events, logEvent() is invoked, which simply appends to the writer, without checking the size/performing any rotation. The writer is closed and file is renamed to remove the ".inprogress" suffix only in stop() method. Thus, the *.inprogress file will be held open throughout and keeps growing. It would be very helpful if EventLoggingListener can be enhanced to support rotating the file .inprogress file when it reaches a (configured) threshhold size/interval. > event log directory(spark-history) filled by large .inprogress files for > spark streaming applications > - > > Key: SPARK-22783 > URL: https://issues.apache.org/jira/browse/SPARK-22783 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.1.0 > Environment: Linux(Generic) >Reporter: omkar kankalapati >Priority: Critical > > When running long running streaming applications, the HDFS storage gets > filled up with large *.inprogress files in hdfs://spark-history/ directory > For example: > hadoop fs -du -h /spark-history > 234 /spark-history/.inprogress > 46.6 G /spark-history/.inprogress > Instead of continuing to write to a very large (multi GB) .inprogress file, > Spark should instead rotate the current log file when it reaches a size (for > example: 100 MB) or interval > and perhaps expose a configuration parameter for the size/interval. > This is also mentioned in SPARK-12140 as a concern. > It is very important and useful to support rotating the log files because > users may have limited HDFS quota and these large files consume the available > limited quota. > Also the users do not have a viable workaround > 1) Can not move the files to an another location because the moving the file > causes the event logging to stop > 2) Trying to copy the .inprogress file to another location and truncate the > .inprogress file fails because the file is still opened by > EventLoggingListener for writing > hdfs dfs -truncate -w 0 /spark-history/.inprogress > truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress > for DFSClient_NONMAPREDUCE_<#ID>on because this file lease is currently > owned by DFSClient_NONMAPREDUCE_<#ID> on > The only workaround available is to disable the event logging for streaming > applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications
[ https://issues.apache.org/jira/browse/SPARK-22783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290620#comment-16290620 ] omkar kankalapati commented on SPARK-22783: --- EventLoggingListener (org.apache.spark.scheduler.EventLoggingListener) opens the the log file named .inprogress file for writing in start() method as a Writer object. On all the events, logEvent() is invoked, which simply appends to the writer, without checking the size/performing any rotation. The writer is closed and file is renamed to remove the ".inprogress" suffix only in stop() method. Thus, the *.inprogress file will be held open throughout and keeps growing. It would be very helpful if EventLoggingListener can be enhanced to support rotating the file .inprogress file when it reaches a (configured) threshhold size/interval. > event log directory(spark-history) filled by large .inprogress files for > spark streaming applications > - > > Key: SPARK-22783 > URL: https://issues.apache.org/jira/browse/SPARK-22783 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.1.0 > Environment: Linux(Generic) >Reporter: omkar kankalapati >Priority: Critical > > When running long running streaming applications, the HDFS storage gets > filled up with large *.inprogress files in hdfs://spark-history/ directory > For example: > hadoop fs -du -h /spark-history > 234 /spark-history/.inprogress > 46.6 G /spark-history/.inprogress > Instead of continuing to write to a very large (multi GB) .inprogress file, > Spark should instead rotate the current log file when it reaches a size (for > example: 100 MB) or interval > and perhaps expose a configuration parameter for the size/interval. > This is also mentioned in SPARK-12140 as a concern. > It is very important and useful to support rotating the log files because > users may have limited HDFS quota and these large files consume the available > limited quota. > Also the users do not have a viable workaround > 1) Can not move the files to an another location because the moving the file > causes the event logging to stop > 2) Trying to copy the .inprogress file to another location and truncate the > .inprogress file fails because the file is still opened by > EventLoggingListener for writing > hdfs dfs -truncate -w 0 /spark-history/.inprogress > truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress > for DFSClient_NONMAPREDUCE_<#ID>on because this file lease is currently > owned by DFSClient_NONMAPREDUCE_<#ID> on > The only workaround available is to disable the event logging for streaming > applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22783) event log directory(spark-history) filled by large .inprogress files for spark streaming applications
omkar kankalapati created SPARK-22783: - Summary: event log directory(spark-history) filled by large .inprogress files for spark streaming applications Key: SPARK-22783 URL: https://issues.apache.org/jira/browse/SPARK-22783 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0, 1.6.0 Environment: Linux(Generic) Reporter: omkar kankalapati Priority: Critical When running long running streaming applications, the HDFS storage gets filled up with large *.inprogress files in hdfs://spark-history/ directory For example: hadoop fs -du -h /spark-history 234 /spark-history/.inprogress 46.6 G /spark-history/.inprogress Instead of continuing to write to a very large (multi GB) .inprogress file, Spark should instead rotate the current log file when it reaches a size (for example: 100 MB) or interval and perhaps expose a configuration parameter for the size/interval. This is also mentioned in SPARK-12140 as a concern. It is very important and useful to support rotating the log files because users may have limited HDFS quota and these large files consume the available limited quota. Also the users do not have a viable workaround 1) Can not move the files to an another location because the moving the file causes the event logging to stop 2) Trying to copy the .inprogress file to another location and truncate the .inprogress file fails because the file is still opened by EventLoggingListener for writing hdfs dfs -truncate -w 0 /spark-history/.inprogress truncate: Failed to TRUNCATE_FILE /spark-history/.inprogress for DFSClient_NONMAPREDUCE_<#ID>on because this file lease is currently owned by DFSClient_NONMAPREDUCE_<#ID> on The only workaround available is to disable the event logging for streaming applications by setting "spark.eventLog.enabled" to false -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22782) Boost speed, use kafka010 consumer kafka
[ https://issues.apache.org/jira/browse/SPARK-22782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] licun updated SPARK-22782: -- Description: We use spark structured streaming to consumer kafka, but we find the consumer speed is too slow compare spark streaming . we set kafka "maxOffsetsPerTrigger": 1. By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we found the following situation: The debug code: {code:scala} private def poll(pollTimeoutMs: Long): Unit = { val startTime = System.currentTimeMillis() val p = consumer.poll(pollTimeoutMs) val r = p.records(topicPartition) val endTime = System.currentTimeMillis() val delta = endTime - startTime logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: ${delta} ms") fetchedData = r.iterator } {code} The log: was: We use spark structured streaming to consumer kafka, but we find the consumer speed is too slow compare spark streaming . we set kafka "maxOffsetsPerTrigger": 1. By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we found the following situation: The debug code: {code:scala} private def poll(pollTimeoutMs: Long): Unit = { val startTime = System.currentTimeMillis() val p = consumer.poll(pollTimeoutMs) val r = p.records(topicPartition) val endTime = System.currentTimeMillis() val delta = endTime - startTime logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: ${delta} ms") fetchedData = r.iterator } {code} > Boost speed, use kafka010 consumer kafka > > > Key: SPARK-22782 > URL: https://issues.apache.org/jira/browse/SPARK-22782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 > Environment: kafka-version: 0.10 > spark-version: 2.2.0 > schedule spark on yarn >Reporter: licun >Priority: Critical > > We use spark structured streaming to consumer kafka, but we find the > consumer speed is too slow compare spark streaming . we set kafka > "maxOffsetsPerTrigger": 1. >By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we > found the following situation: > >The debug code: > > {code:scala} > private def poll(pollTimeoutMs: Long): Unit = { > val startTime = System.currentTimeMillis() > val p = consumer.poll(pollTimeoutMs) > val r = p.records(topicPartition) > val endTime = System.currentTimeMillis() > val delta = endTime - startTime > logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: > ${delta} ms") > fetchedData = r.iterator > } > {code} >The log: > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22782) Boost speed, use kafka010 consumer kafka
[ https://issues.apache.org/jira/browse/SPARK-22782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] licun updated SPARK-22782: -- Description: We use spark structured streaming to consumer kafka, but we find the consumer speed is too slow compare spark streaming . we set kafka "maxOffsetsPerTrigger": 1. By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we found the following situation: The debug code: {code:scala} private def poll(pollTimeoutMs: Long): Unit = { val startTime = System.currentTimeMillis() val p = consumer.poll(pollTimeoutMs) val r = p.records(topicPartition) val endTime = System.currentTimeMillis() val delta = endTime - startTime logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: ${delta} ms") fetchedData = r.iterator } {code} was: We use spark structured streaming to consumer kafka, but we find the consumer speed is too slow compare spark streaming . we set kafka "maxOffsetsPerTrigger": 1. By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we found the following situation: The debug code: {color:red}private def poll(pollTimeoutMs: Long): Unit = { val startTime = System.currentTimeMillis() val p = consumer.poll(pollTimeoutMs) val r = p.records(topicPartition) val endTime = System.currentTimeMillis() val delta = endTime - startTime logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: ${delta} ms") fetchedData = r.iterator }{color} > Boost speed, use kafka010 consumer kafka > > > Key: SPARK-22782 > URL: https://issues.apache.org/jira/browse/SPARK-22782 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.1.0, 2.2.0 > Environment: kafka-version: 0.10 > spark-version: 2.2.0 > schedule spark on yarn >Reporter: licun >Priority: Critical > > We use spark structured streaming to consumer kafka, but we find the > consumer speed is too slow compare spark streaming . we set kafka > "maxOffsetsPerTrigger": 1. >By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we > found the following situation: > >The debug code: > > {code:scala} > private def poll(pollTimeoutMs: Long): Unit = { > val startTime = System.currentTimeMillis() > val p = consumer.poll(pollTimeoutMs) > val r = p.records(topicPartition) > val endTime = System.currentTimeMillis() > val delta = endTime - startTime > logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: > ${delta} ms") > fetchedData = r.iterator > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22782) Boost speed, use kafka010 consumer kafka
licun created SPARK-22782: - Summary: Boost speed, use kafka010 consumer kafka Key: SPARK-22782 URL: https://issues.apache.org/jira/browse/SPARK-22782 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0, 2.1.0 Environment: kafka-version: 0.10 spark-version: 2.2.0 schedule spark on yarn Reporter: licun Priority: Critical We use spark structured streaming to consumer kafka, but we find the consumer speed is too slow compare spark streaming . we set kafka "maxOffsetsPerTrigger": 1. By adding debug log in CachedKafkaConsumer.scala (kafka-0-10-sql), we found the following situation: The debug code: {color:red}private def poll(pollTimeoutMs: Long): Unit = { val startTime = System.currentTimeMillis() val p = consumer.poll(pollTimeoutMs) val r = p.records(topicPartition) val endTime = System.currentTimeMillis() val delta = endTime - startTime logError(s"Polled $groupId ${p.partitions()}, Size:${r.size}, spend: ${delta} ms") fetchedData = r.iterator }{color} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22771) SQL concat for binary
[ https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22771: Assignee: Apache Spark > SQL concat for binary > -- > > Key: SPARK-22771 > URL: https://issues.apache.org/jira/browse/SPARK-22771 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Assignee: Apache Spark >Priority: Minor > > spark.sql {{concat}} function automatically casts arguments to StringType > and returns a String. > This might be the behavior of traditional databases, however in Spark there's > Binary as a standard type, and concat'ing binary seems reasonable if it > returns another binary sequence. > Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} > represent text, by concat'ing both we end up with the same type as the > arguments, and in case they are intermixed (str + unicode) the most generic > type is returned (unicode). > Following the same principle, I believe that when concat'ing binary it would > make sense to return a binary. > In terms of Spark behavior, it would affect only the case when all arguments > are binary. All other cases should remain unchanged. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22771) SQL concat for binary
[ https://issues.apache.org/jira/browse/SPARK-22771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290551#comment-16290551 ] Apache Spark commented on SPARK-22771: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/19977 > SQL concat for binary > -- > > Key: SPARK-22771 > URL: https://issues.apache.org/jira/browse/SPARK-22771 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.1 >Reporter: Fernando Pereira >Priority: Minor > > spark.sql {{concat}} function automatically casts arguments to StringType > and returns a String. > This might be the behavior of traditional databases, however in Spark there's > Binary as a standard type, and concat'ing binary seems reasonable if it > returns another binary sequence. > Taking the example of, e.g. Python where both {{bytes}} and {{unicode}} > represent text, by concat'ing both we end up with the same type as the > arguments, and in case they are intermixed (str + unicode) the most generic > type is returned (unicode). > Following the same principle, I believe that when concat'ing binary it would > make sense to return a binary. > In terms of Spark behavior, it would affect only the case when all arguments > are binary. All other cases should remain unchanged. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290518#comment-16290518 ] liyunzhang commented on SPARK-22660: [~srowen]: there is another modification about limit() in TaskSetManager.scala. Very Sorry for not including it in last commit. {code} abort(s"$msg Exception during serialization: $e") throw new TaskNotSerializableException(e) } } -if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 && +if (serializedTask.limit() > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 && !emittedTaskSizeWarning) { emittedTaskSizeWarning = true logWarning(s"Stage ${task.stageId} contains a task of very large size " + -s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " + +s"(${serializedTask.limit() / 1024} KB). The maximum recommended task size is " + s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.") } } addRunningTask(taskId) @@ -502,7 +502,7 @@ private[spark] class TaskSetManager( // val timeTaken = clock.getTime() - startTime val taskName = s"task ${info.id} in stage ${taskSet.id}" logInfo(s"Starting $taskName (TID $taskId, $host, executor ${info.executorId}, " + - s"partition ${task.partitionId}, $taskLocality, ${serializedTask.limit} bytes)") + s"partition ${task.partitionId}, $taskLocality, ${serializedTask.limit()} bytes)") sched.dagScheduler.taskStarted(task, info) new TaskDescription( {code} Can you help review? > Use position() and limit() to fix ambiguity issue in scala-2.12 > --- > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang >Assignee: liyunzhang >Priority: Minor > Fix For: 2.3.0 > > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > ./dev/change-scala-version.sh 2.12 > 2.build with -Pscala-2.12 > for hive on spark > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > for spark sql > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn -Phive > -Dhadoop.version=2.7.3>log.sparksql 2>&1 > {code} > get following error > #Error1 > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix > #Error2 > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} > The limit method was moved from ByteBuffer to the superclass Buffer and it > can no longer be called without (). The same reason for position method. > #Error3 > {code} > home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] properties.putAll(propsMap.asJava) > [error]^ > [error] > /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] props.putAll(outputSerdeProps.toMap.asJava) > [error] ^ >
[jira] [Commented] (SPARK-22660) Use position() and limit() to fix ambiguity issue in scala-2.12
[ https://issues.apache.org/jira/browse/SPARK-22660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290514#comment-16290514 ] Apache Spark commented on SPARK-22660: -- User 'kellyzly' has created a pull request for this issue: https://github.com/apache/spark/pull/19976 > Use position() and limit() to fix ambiguity issue in scala-2.12 > --- > > Key: SPARK-22660 > URL: https://issues.apache.org/jira/browse/SPARK-22660 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.2.0 >Reporter: liyunzhang >Assignee: liyunzhang >Priority: Minor > Fix For: 2.3.0 > > > build with scala-2.12 with following steps > 1. change the pom.xml with scala-2.12 > ./dev/change-scala-version.sh 2.12 > 2.build with -Pscala-2.12 > for hive on spark > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn > -Pparquet-provided -Dhadoop.version=2.7.3 > {code} > for spark sql > {code} > ./dev/make-distribution.sh --tgz -Pscala-2.12 -Phadoop-2.7 -Pyarn -Phive > -Dhadoop.version=2.7.3>log.sparksql 2>&1 > {code} > get following error > #Error1 > {code} > /common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java:172: > error: cannot find symbol > Cleaner cleaner = Cleaner.create(buffer, () -> freeMemory(memory)); > {code} > This is because sun.misc.Cleaner has been moved to new location in JDK9. > HADOOP-12760 will be the long term fix > #Error2 > {code} > spark_source/core/src/main/scala/org/apache/spark/executor/Executor.scala:455: > ambiguous reference to overloaded definition, method limit in class > ByteBuffer of type (x$1: Int)java.nio.ByteBuffer > method limit in class Buffer of type ()Int > match expected type ? > val resultSize = serializedDirectResult.limit > error > {code} > The limit method was moved from ByteBuffer to the superclass Buffer and it > can no longer be called without (). The same reason for position method. > #Error3 > {code} > home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:415: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] properties.putAll(propsMap.asJava) > [error]^ > [error] > /home/zly/prj/oss/jdk9_HOS_SOURCE/spark_source/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala:427: > ambiguous reference to overloaded definition, [error] both method putAll in > class Properties of type (x$1: java.util.Map[_, _])Unit [error] and method > putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: > Object])Unit [error] match argument types (java.util.Map[String,String]) > [error] props.putAll(outputSerdeProps.toMap.asJava) > [error] ^ > {code} > This is because the key type is Object instead of String which is unsafe. > After solving these 3 errors, compile successfully. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org