[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-12-01 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19845


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-11-30 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/19845#discussion_r154266335
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1837,6 +1837,29 @@ def test_read_images(self):
 self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
 
 
+class ImageReaderTest2(PySparkTestCase):
+
+@classmethod
+def setUpClass(cls):
+PySparkTestCase.setUpClass()
+try:
+cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
+except py4j.protocol.Py4JError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+except TypeError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+cls.spark = HiveContext._createForTesting(cls.sc)
+
--- End diff --

Sure, that should be safer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-11-30 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19845#discussion_r154262931
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1837,6 +1837,29 @@ def test_read_images(self):
 self.assertEqual(ImageSchema.undefinedImageType, "Undefined")
 
 
+class ImageReaderTest2(PySparkTestCase):
+
+@classmethod
+def setUpClass(cls):
+PySparkTestCase.setUpClass()
+try:
+cls.sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
+except py4j.protocol.Py4JError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+except TypeError:
+cls.tearDownClass()
+raise unittest.SkipTest("Hive is not available")
+cls.spark = HiveContext._createForTesting(cls.sc)
+
--- End diff --

Add classmethod `tearDownClass` to stop the `cls.spark`? I didn't see 
`HiveContextSQLTests` closes it anyway, maybe we can fix it too.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-11-29 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/19845#discussion_r153966424
  
--- Diff: python/pyspark/ml/image.py ---
@@ -180,9 +180,8 @@ def readImages(self, path, recursive=False, 
numPartitions=-1,
 .. versionadded:: 2.3.0
 """
 
-ctx = SparkContext._active_spark_context
-spark = SparkSession(ctx)
-image_schema = ctx._jvm.org.apache.spark.ml.image.ImageSchema
+spark = SparkSession.builder.getOrCreate()
--- End diff --

Yeah, I think this should be best practice to initialize `SparkSession`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19845: [SPARK-22651][PYTHON][ML] Prevent initiating mult...

2017-11-29 Thread HyukjinKwon
GitHub user HyukjinKwon opened a pull request:

https://github.com/apache/spark/pull/19845

[SPARK-22651][PYTHON][ML] Prevent initiating multiple Hive clients for 
ImageSchema.readImages

## What changes were proposed in this pull request?

Calling `ImageSchema.readImages` multiple times as below:

```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, 
dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, 
dropImageFailures=True).collect()
```

throws an error as below:

```
...
org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
connection to the given database. JDBC url = 
jdbc:derby:;databaseName=metastore_db;create=true, username = APP. Terminating 
connection pool (set lazyInit to true if you expect to start your database 
after your app). Original Exception: --
java.sql.SQLException: Failed to start database 'metastore_db' with class 
loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, 
see the next exception for details.
...
at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source)
...
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
...
at 
org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:180)
...
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:348)
at 
org.apache.spark.ml.image.ImageSchema$$anonfun$readImages$2$$anonfun$apply$1.apply(ImageSchema.scala:253)
...
Caused by: ERROR XJ040: Failed to start database 'metastore_db' with class 
loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@742f639f, 
see the next exception for details.
at org.apache.derby.iapi.error.StandardException.newException(Unknown 
Source)
at 
org.apache.derby.impl.jdbc.SQLExceptionFactory.wrapArgsForTransportAcrossDRDA(Unknown
 Source)
... 121 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted 
the database /.../spark/metastore_db.
...
Traceback (most recent call last):
  File "", line 1, in 
  File "/.../spark/python/pyspark/ml/image.py", line 190, in readImages
dropImageFailures, float(sampleRatio), seed)
  File "/.../spark/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py", 
line 1160, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u'java.lang.RuntimeException: 
java.lang.RuntimeException: Unable to instantiate 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient;'
```

Seems we better stick to `SparkSession.builder.getOrCreate()` like:


https://github.com/apache/spark/blob/51620e288b5e0a7fffc3899c9deadabace28e6d7/python/pyspark/sql/streaming.py#L329


https://github.com/apache/spark/blob/dc5d34d8dcd6526d1dfdac8606661561c7576a62/python/pyspark/sql/column.py#L541


https://github.com/apache/spark/blob/33d43bf1b6f55594187066f0e38ba3985fa2542b/python/pyspark/sql/readwriter.py#L105


## How was this patch tested?

This was tested as below:

```python
from pyspark.ml.image import ImageSchema
data_path = 'data/mllib/images/kittens'
_ = ImageSchema.readImages(data_path, recursive=True, 
dropImageFailures=True).collect()
_ = ImageSchema.readImages(data_path, recursive=True, 
dropImageFailures=True).collect()
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/HyukjinKwon/spark SPARK-22651

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19845.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19845


commit c0c3c487474629df00ba50c5b6904552010f1aab
Author: hyukjinkwon 
Date:   2017-11-29T13:17:24Z

Prevent initiating multiple Hive clients for ImageSchema.readImages




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org