subject:"\[GitHub\] spark pull request #18742\: \[Spark\-21542\]\[ML\]\[Python\]Python persistence helpe..."

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18742


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131782033
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,33 +66,86 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
 """
-Utility class that can save ML instances.
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
 
-.. versionadded:: 2.0.0
+.. versionadded:: 2.3.0
 """
 
-def save(self, path):
-"""Save the ML instance to the input path."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
-
-def overwrite(self):
-"""Overwrites if the output path already exists."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+def __init__(self):
+self._sparkSession = None
 
 def context(self, sqlContext):
 """
-Sets the SQL context to use for saving.
+Sets the Spark SQLContext to use for saving/loading.
 
 .. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
 """
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+raise NotImplementedError("Read/Write is not yet implemented for 
type: %s" % type(self))
 
 def session(self, sparkSession):
-"""Sets the Spark Session to use for saving."""
+"""
+Sets the Spark Session to use for saving/loading.
+"""
+self._sparkSession = sparkSession
+return self
+
+@property
+def sparkSession(self):
+"""
+Returns the user-specified Spark Session or the default.
+"""
+if self._sparkSession is None:
+self._sparkSession = SparkSession.builder.getOrCreate()
+return self._sparkSession
+
+@property
+def sc(self):
+"""
+Returns the underlying `SparkContext`.
+"""
+return self.sparkSession.sparkContext
+
+
+@inherit_doc
+class MLWriter(BaseReadWrite):
+"""
+Utility class that can save ML instances.
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self):
+super(MLWriter, self).__init__()
+self.shouldOverwrite = False
+
+def _handleOverwrite(self, path):
+from pyspark.ml.wrapper import JavaWrapper
+
+_java_obj = 
JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
+wrapper = JavaWrapper(_java_obj)
+wrapper._call_java("handleOverwrite", path, True, 
self.sc._jsc.sc())
--- End diff --

Let's do that when needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread ajaysaini725

Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131781393
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1158,6 +1165,33 @@ def test_decisiontree_regressor(self):
 except OSError:
 pass
 
+def test_default_read_write(self):
+temp_path = tempfile.mkdtemp()
+
+lr = LogisticRegression()
+lr.setMaxIter(50)
+lr.setThreshold(.75)
+writer = DefaultParamsWriter(lr)
+
+savePath = temp_path + "/lr"
+writer.saveImpl(savePath)
--- End diff --

Pushed a fix


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131780734
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1158,6 +1165,33 @@ def test_decisiontree_regressor(self):
 except OSError:
 pass
 
+def test_default_read_write(self):
+temp_path = tempfile.mkdtemp()
+
+lr = LogisticRegression()
+lr.setMaxIter(50)
+lr.setThreshold(.75)
+writer = DefaultParamsWriter(lr)
+
+savePath = temp_path + "/lr"
+writer.saveImpl(savePath)
--- End diff --

Good point; +1 for calling save


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131772274
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,33 +66,86 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
 """
-Utility class that can save ML instances.
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
 
-.. versionadded:: 2.0.0
+.. versionadded:: 2.3.0
 """
 
-def save(self, path):
-"""Save the ML instance to the input path."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
-
-def overwrite(self):
-"""Overwrites if the output path already exists."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+def __init__(self):
+self._sparkSession = None
 
 def context(self, sqlContext):
 """
-Sets the SQL context to use for saving.
+Sets the Spark SQLContext to use for saving/loading.
 
 .. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
 """
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+raise NotImplementedError("Read/Write is not yet implemented for 
type: %s" % type(self))
 
 def session(self, sparkSession):
-"""Sets the Spark Session to use for saving."""
+"""
+Sets the Spark Session to use for saving/loading.
+"""
+self._sparkSession = sparkSession
+return self
+
+@property
+def sparkSession(self):
+"""
+Returns the user-specified Spark Session or the default.
+"""
+if self._sparkSession is None:
+self._sparkSession = SparkSession.builder.getOrCreate()
+return self._sparkSession
+
+@property
+def sc(self):
+"""
+Returns the underlying `SparkContext`.
+"""
+return self.sparkSession.sparkContext
+
+
+@inherit_doc
+class MLWriter(BaseReadWrite):
+"""
+Utility class that can save ML instances.
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self):
+super(MLWriter, self).__init__()
+self.shouldOverwrite = False
+
+def _handleOverwrite(self, path):
+from pyspark.ml.wrapper import JavaWrapper
+
+_java_obj = 
JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
+wrapper = JavaWrapper(_java_obj)
+wrapper._call_java("handleOverwrite", path, True, 
self.sc._jsc.sc())
--- End diff --

The wrapper `FileSystemOverwrite`, can we abstract a more generic 
interface, such as `FileSystem` ? (Maybe it can be used in other place of 
pyspark, not only "file overwrite", also including other file system operations)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131771002
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1158,6 +1165,33 @@ def test_decisiontree_regressor(self):
 except OSError:
 pass
 
+def test_default_read_write(self):
+temp_path = tempfile.mkdtemp()
+
+lr = LogisticRegression()
+lr.setMaxIter(50)
+lr.setThreshold(.75)
+writer = DefaultParamsWriter(lr)
+
+savePath = temp_path + "/lr"
+writer.saveImpl(savePath)
--- End diff --

Why directly call `saveImpl` here ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131749278
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +333,201 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple :py:class`Params` types writable.  If a 
:py:class`Params`
+class stores all data as :py:class:'Param' values, then extending this 
trait will provide
--- End diff --

Does normal tick ' work, or do you have to use backticks ` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-07 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131749037
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +333,201 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple :py:class`Params` types writable.  If a 
:py:class`Params`
--- End diff --

Does this work with the missing colon?  Please make sure to generate the 
API docs and check them.  See https://github.com/apache/spark/tree/master/docs 
for instructions


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-05 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131516889
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +333,204 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple `Params` types writable.  If a `Params` 
class stores
+all data as [[pyspark.ml.param.Param]] values, then extending this 
trait will provide
+a default implementation of writing saved instances of the class.
+This only handles simple [[pyspark.ml.param.Param]] types; e.g., it 
will not handle
+[[pyspark.sql.Dataset]].
+
+@see `DefaultParamsReadable`, the counterpart to this trait
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+from pyspark.ml.param import Params
+
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+.. note:: DeveloperApi
+
+Specialization of :py:class:`MLWriter` for :py:class:`Params` types
+
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.saveMetadata(self.instance, path, self.sc)
+
+@staticmethod
+def saveMetadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+"""
+Saves metadata + Params to: path + "/metadata"
+- class
+- timestamp
+- sparkVersion
+- uid
+- paramMap
+- (optionally, extra metadata)
+@param extraMetadata  Extra metadata to be saved at same level as 
uid, paramMap, etc.
+@param paramMap  If given, this is saved in the "paramMap" field.
+"""
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter._get_metadata_to_save(instance,
+ sc,
+ 
extraMetadata,
+ paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def _get_metadata_to_save(instance, sc, extraMetadata=None, 
paramMap=None):
+"""
+Helper for [[saveMetadata()]] which extracts the JSON to save.
+This is useful for ensemble models which need to save metadata for 
many sub-models.
+
+@see [[saveMetadata()]] for details on what this includes.
+"""
+uid = instance.uid
+cls = instance.__module__ + '.' + instance.__class__.__name__
+params = instance.extractParamMap()
+jsonParams = {}
+if paramMap is not None:
+jsonParams = paramMap
+else:
+for p in params:
+jsonParams[p.name] = params[p]
+basicMetadata = {"class": cls, "timestamp": long(round(time.time() 
* 1000)),
+ "sparkVersion": sc.version, "uid": uid, 
"paramMap": jsonParams}
+if extraMetadata is not None:
+basicMetadata.update(extraMetadata)
+return json.dumps(basicMetadata, separators=[',',  ':'])
+
+
+@inherit_doc
+class DefaultParamsReadable(MLReadable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple `Params` types readable.  If a `Params` 
class stores
+all data as [[pyspark.ml.param.Param]] values, then extending this 
trait will provide
+a default implementation of reading saved instances of the class.
+This only handles simple [[pyspark.ml.param.Param]] types; e.g., it 
will not handle
+[[pyspark.sql.Dataset]].
+
+@see `DefaultParamsWritable`, the counterpart to this trait
+
+.. versionadded:: 2.3.0
+"""
+
+@classmethod
+def read(cls):
+"""Returns a DefaultParamsReader instance for this class."""
+

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-05 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131516809
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +333,204 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple `Params` types writable.  If a `Params` 
class stores
+all data as [[pyspark.ml.param.Param]] values, then extending this 
trait will provide
--- End diff --

There are still some instances of Scaladoc-style class references using 
double brackets like this: ```[[MyClass]]```.  Switch to the python style like 
```:py:class:`MyClass` ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-04 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131332345
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +341,198 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple `Params` types writable.  If a `Params` 
class stores
+all data as [[pyspark.ml.param.Param]] values, then extending this 
trait will provide
+a default implementation of writing saved instances of the class.
+This only handles simple [[pyspark.ml.param.Param]] types; e.g., it 
will not handle
+[[pyspark.sql.Dataset]].
+
+@see `DefaultParamsReadable`, the counterpart to this trait
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+.. note:: DeveloperApi
+
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
--- End diff --

If this is going to be public, let's use camelCase to match style


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-04 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131331807
  
--- Diff: python/pyspark/ml/util.py ---
@@ -156,28 +218,23 @@ def write(self):
 
 
 @inherit_doc
-class MLReader(object):
+class MLReader(BaseReadWrite):
 """
 Utility class that can load ML instances.
 
 .. versionadded:: 2.0.0
 """
 
+def __init__(self):
+super(MLReader, self).__init__()
+
 def load(self, path):
 """Load the ML instance from the input path."""
 raise NotImplementedError("MLReader is not yet implemented for 
type: %s" % type(self))
 
-def context(self, sqlContext):
-"""
-Sets the SQL context to use for loading.
-
-.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
-"""
-raise NotImplementedError("MLReader is not yet implemented for 
type: %s" % type(self))
-
 def session(self, sparkSession):
--- End diff --

You can remove this instance of session since it is inherited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-04 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131331793
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,89 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self._sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the Spark SQLContext to use for saving/loading.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+raise NotImplementedError("Read/Write is not yet implemented for 
type: %s" % type(self))
+
+def session(self, sparkSession):
+"""
+Sets the Spark Session to use for saving/loading.
+"""
+self._sparkSession = sparkSession
+return self
+
+@property
+def sparkSession(self):
+"""
+Returns the user-specified Spark Session or the default.
+"""
+if self._sparkSession is None:
+self._sparkSession = SparkSession.builder.getOrCreate()
+return self._sparkSession
+
+@property
+def sc(self):
+"""
+Returns the underlying `SparkContext`.
+"""
+return self.sparkSession.sparkContext
+
+
+@inherit_doc
+class MLWriter(BaseReadWrite):
 """
 Utility class that can save ML instances.
 
 .. versionadded:: 2.0.0
 """
 
+def __init__(self):
+super(MLWriter, self).__init__()
+self.shouldOverwrite = False
+
+def _handleOverwrite(self, path):
+from pyspark.ml.wrapper import JavaWrapper
+
+_java_obj = 
JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
+wrapper = JavaWrapper(_java_obj)
+wrapper._call_java("handleOverwrite", path, True, 
self.sc._jsc.sc())
+
 def save(self, path):
 """Save the ML instance to the input path."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
-
-def overwrite(self):
-"""Overwrites if the output path already exists."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+if self.shouldOverwrite:
+self._handleOverwrite(path)
+self.saveImpl(path)
 
-def context(self, sqlContext):
+def saveImpl(self, path):
 """
-Sets the SQL context to use for saving.
-
-.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+save() handles overwriting and then calls this method.  Subclasses 
should override this
+method to implement the actual saving of the instance.
 """
 raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
 
+def overwrite(self):
+"""Overwrites if the output path already exists."""
+self.shouldOverwrite = True
+return self
+
 def session(self, sparkSession):
--- End diff --

You can remove this instance of session since it is inherited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-04 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131332176
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +341,198 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple `Params` types writable.  If a `Params` 
class stores
+all data as [[pyspark.ml.param.Param]] values, then extending this 
trait will provide
+a default implementation of writing saved instances of the class.
+This only handles simple [[pyspark.ml.param.Param]] types; e.g., it 
will not handle
+[[pyspark.sql.Dataset]].
+
+@see `DefaultParamsReadable`, the counterpart to this trait
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+.. note:: DeveloperApi
+
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+"""
+Saves metadata + Params to: path + "/metadata"
+- class
+- timestamp
+- sparkVersion
+- uid
+- paramMap
+- (optionally, extra metadata)
+@param extraMetadata  Extra metadata to be saved at same level as 
uid, paramMap, etc.
+@param paramMap  If given, this is saved in the "paramMap" field.
+"""
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter._get_metadata_to_save(instance,
+ sc,
+ 
extraMetadata,
+ paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def _get_metadata_to_save(instance, sc, extraMetadata=None, 
paramMap=None):
+"""
+Helper for [[save_metadata()]] which extracts the JSON to save.
+This is useful for ensemble models which need to save metadata for 
many sub-models.
+
+@see [[save_metadata()]] for details on what this includes.
+"""
+uid = instance.uid
+cls = instance.__module__ + '.' + instance.__class__.__name__
+params = instance.extractParamMap()
+jsonParams = {}
+if paramMap is not None:
+jsonParams = paramMap
+else:
+for p in params:
+jsonParams[p.name] = params[p]
+basicMetadata = {"class": cls, "timestamp": long(round(time.time() 
* 1000)),
+ "sparkVersion": sc.version, "uid": uid, 
"paramMap": jsonParams}
+if extraMetadata is not None:
+basicMetadata.update(extraMetadata)
+return json.dumps(basicMetadata, separators=[',',  ':'])
+
+
+@inherit_doc
+class DefaultParamsReadable(MLReadable):
+"""
+.. note:: DeveloperApi
+
+Helper trait for making simple `Params` types readable.  If a `Params` 
class stores
+all data as [[pyspark.ml.param.Param]] values, then extending this 
trait will provide
--- End diff --

For python docs, follow other examples such as ```:py:class:`MLWriter` ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail:

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131284503
  
--- Diff: python/pyspark/ml/tests.py ---
@@ -1957,6 +1964,46 @@ def test_chisquaretest(self):
 self.assertTrue(all(field in fieldNames for field in 
expectedFields))
 
 
+class DefaultReadWriteTests(SparkSessionTestCase):
+
+def test_default_read_write(self):
+temp_path = tempfile.mkdtemp()
+
+lr = LogisticRegression()
+lr.setMaxIter(50)
+lr.setThreshold(.75)
+writer = DefaultParamsWriter(lr)
+
+savePath = temp_path + "/lr"
+writer.saveImpl(savePath)
+
+reader = DefaultParamsReadable.read()
+lr2 = reader.load(savePath)
+
+self.assertEqual(lr.uid, lr2.uid)
+self.assertEqual(lr.extractParamMap(), lr2.extractParamMap())
+
+def test_default_read_write_with_overwrite(self):
--- End diff --

Since these tests are almost the same, can you combine them to reduce code 
duplication?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131285744
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -471,3 +471,24 @@ private[ml] object MetaAlgorithmReadWrite {
 List((instance.uid, instance)) ++ subStageMaps
   }
 }
+
+private[ml] class FileSystemOverwrite extends Logging {
+
+  def handleOverwrite(path: String, shouldOverwrite: Boolean, sc: 
SparkContext): Unit = {
--- End diff --

This is the same now as the code in MLWriter.save, so please use this 
within MLWriter.save to eliminate the duplicated code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131286314
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,20 +66,74 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self._sparkSession = None
+
+def context(self, sqlContext):
--- End diff --

Leaving this is OK if you remove it from the subclasses, for the sake of 
code simplification.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131287820
  
--- Diff: python/pyspark/ml/util.py ---
@@ -237,6 +300,13 @@ def _load_java_obj(cls, clazz):
 java_obj = getattr(java_obj, name)
 return java_obj
 
+@classmethod
+def _load_given_name(cls, java_class):
--- End diff --

This can be static since you don't use ```cls```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131288629
  
--- Diff: python/pyspark/ml/util.py ---
@@ -237,6 +300,13 @@ def _load_java_obj(cls, clazz):
 java_obj = getattr(java_obj, name)
 return java_obj
 
+@classmethod
+def _load_given_name(cls, java_class):
--- End diff --

Please also stick with camelCase for methods since that's what pyspark 
does.  (here and elsewhere)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131288896
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +353,143 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+Class for making simple Params types writable. Assumes that all 
parameters
+are JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
--- End diff --

add leading underscore to make this private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131288360
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +353,143 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+Class for making simple Params types writable. Assumes that all 
parameters
+are JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
+
metadataPath,
+sc,
+
extraMetadata,
+paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def get_metadata_to_save(instance, path, sc, extraMetadata=None, 
paramMap=None):
+uid = instance.uid
+cls = instance.__module__ + '.' + instance.__class__.__name__
+params = instance.extractParamMap()
+jsonParams = {}
+if paramMap is not None:
+for p in paramMap:
+jsonParams[p.name] = paramMap[p]
+else:
+for p in params:
+jsonParams[p.name] = params[p]
+basicMetadata = {"class": cls, "timestamp": long(round(time.time() 
* 1000)),
+ "sparkVersion": sc.version, "uid": uid, 
"paramMap": jsonParams}
+if extraMetadata is not None:
+basicMetadata.update(extraMetadata)
+return json.dumps(basicMetadata, separators=[',',  ':'])
+
+
+@inherit_doc
+class DefaultParamsReadable(MLReadable):
+"""
+Class for making simple Params types readable. Assumes that all 
parameters
--- End diff --

Could you please merge the Scala doc into this to make this more detailed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131287028
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,20 +66,74 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self._sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the Spark SQLContext to use for saving/loading.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+
+def session(self, sparkSession):
+"""
+Sets the Spark Session to use for saving/loading.
+"""
+self._sparkSession = sparkSession
+return self
+
+def sparkSession(self):
+if self._sparkSession is None:
+self._sparkSession = SparkSession.builder.getOrCreate()
+return self._sparkSession
+
+@property
+def sc(self):
+return self.sparkSession().sparkContext
+
+
+@inherit_doc
+class MLWriter(BaseReadWrite):
 """
 Utility class that can save ML instances.
 
 .. versionadded:: 2.0.0
 """
 
+def __init__(self):
+super(MLWriter, self).__init__()
+self.shouldOverwrite = False
+
+def _handleOverwrite(self, path):
+from pyspark.ml.wrapper import JavaWrapper
+
+_java_obj = 
JavaWrapper._new_java_obj("org.apache.spark.ml.util.FileSystemOverwrite")
+wrapper = JavaWrapper(_java_obj)
+wrapper._call_java("handleOverwrite", path, True, 
self.sc._jsc.sc())
+
 def save(self, path):
 """Save the ML instance to the input path."""
+if self.shouldOverwrite:
+self._handleOverwrite(path)
+self.saveImpl(path)
+
+def saveImpl(self, path):
--- End diff --

Add doc from Scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131288786
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +353,143 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+Class for making simple Params types writable. Assumes that all 
parameters
+are JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
+
metadataPath,
+sc,
+
extraMetadata,
+paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def get_metadata_to_save(instance, path, sc, extraMetadata=None, 
paramMap=None):
--- End diff --

path is not used


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131288351
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +353,143 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+Class for making simple Params types writable. Assumes that all 
parameters
--- End diff --

Could you please merge the Scala doc into this to make this more detailed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131290424
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +353,143 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+Class for making simple Params types writable. Assumes that all 
parameters
+are JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
+
metadataPath,
+sc,
+
extraMetadata,
+paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def get_metadata_to_save(instance, path, sc, extraMetadata=None, 
paramMap=None):
+uid = instance.uid
+cls = instance.__module__ + '.' + instance.__class__.__name__
+params = instance.extractParamMap()
+jsonParams = {}
+if paramMap is not None:
+for p in paramMap:
+jsonParams[p.name] = paramMap[p]
+else:
+for p in params:
+jsonParams[p.name] = params[p]
+basicMetadata = {"class": cls, "timestamp": long(round(time.time() 
* 1000)),
+ "sparkVersion": sc.version, "uid": uid, 
"paramMap": jsonParams}
+if extraMetadata is not None:
+basicMetadata.update(extraMetadata)
+return json.dumps(basicMetadata, separators=[',',  ':'])
+
+
+@inherit_doc
+class DefaultParamsReadable(MLReadable):
+"""
+Class for making simple Params types readable. Assumes that all 
parameters
+are JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+@classmethod
+def read(cls):
+"""Returns a DefaultParamsReader instance for this class."""
+return DefaultParamsReader(cls)
+
+
+@inherit_doc
+class DefaultParamsReader(MLReader):
+"""
+Class for reading Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, cls):
+super(DefaultParamsReader, self).__init__()
+self.cls = cls
+
+@staticmethod
+def __get_class(clazz):
+"""
+Loads Python class from its name.
+"""
+parts = clazz.split('.')
+module = ".".join(parts[:-1])
+m = __import__(module)
+for comp in parts[1:]:
+m = getattr(m, comp)
+return m
+
+def load(self, path):
+metadata = DefaultParamsReader.loadMetadata(path, self.sc)
+py_type = DefaultParamsReader.__get_class(metadata['class'])
+instance = py_type()
+instance._resetUid(metadata['uid'])
+DefaultParamsReader.getAndSetParams(instance, metadata)
+return instance
+
+@staticmethod
+def loadMetadata(path, sc, expectedClassName=""):
--- End diff --

These static methods can be private (add leading underscores)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131288910
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +353,143 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+"""
+Class for making simple Params types writable. Assumes that all 
parameters
+are JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def write(self):
+"""Returns a DefaultParamsWriter instance for this class."""
+if isinstance(self, Params):
+return DefaultParamsWriter(self)
+else:
+raise TypeError("Cannot use DefautParamsWritable with type %s 
because it does not " +
+" extend Params.", type(self))
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+"""
+Class for writing Estimators and Transformers whose parameters are 
JSON-serializable.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+
+def saveImpl(self, path):
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
+
metadataPath,
+sc,
+
extraMetadata,
+paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def get_metadata_to_save(instance, path, sc, extraMetadata=None, 
paramMap=None):
--- End diff --

add leading underscore to make this private


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131287269
  
--- Diff: python/pyspark/ml/util.py ---
@@ -86,7 +145,7 @@ def context(self, sqlContext):
 
 def session(self, sparkSession):
--- End diff --

You can just remove this now, right?  It will be inherited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131286586
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,20 +66,74 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self._sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the Spark SQLContext to use for saving/loading.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+
+def session(self, sparkSession):
+"""
+Sets the Spark Session to use for saving/loading.
+"""
+self._sparkSession = sparkSession
+return self
+
+def sparkSession(self):
--- End diff --

Copy doc from Scala.  Also, this could be a ```@property``` to match Scala


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131265094
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -471,3 +471,26 @@ private[ml] object MetaAlgorithmReadWrite {
 List((instance.uid, instance)) ++ subStageMaps
   }
 }
+
+class FileSystemOverwrite extends BaseReadWrite with Logging {
--- End diff --

Make this package private: ```private[util]```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131265462
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/util/ReadWrite.scala ---
@@ -471,3 +471,26 @@ private[ml] object MetaAlgorithmReadWrite {
 List((instance.uid, instance)) ++ subStageMaps
   }
 }
+
+class FileSystemOverwrite extends BaseReadWrite with Logging {
--- End diff --

Also, I'd make this a static object, eliminate the BaseReadWrite 
dependency, and just take ```sc``` as an argument.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131262008
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
--- End diff --

OK, makes sense, let's leave it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131262239
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
+Sets the value for a parameter in the parameter map.
+"""
+if not self.hasParam(param.name):
+raise TypeError('Invalid param %s given for instance %r. %s' % 
(p.name, self, e))
--- End diff --

Can you add a test which tries to set this with an invalid Param?  This 
line will not work since e is not defined, but tests pass since this line of 
code is not tested.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131261798
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
+Sets the value for a parameter in the parameter map.
--- End diff --

Also, just copy the doc from Scala.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131260736
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
+Sets the value for a parameter in the parameter map.
--- End diff --

style: change indent -4 spaces


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131264975
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
--- End diff --

I'd strongly prefer to mimic the Scala APIs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-03 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131261087
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
+Sets the value for a parameter in the parameter map.
+"""
+if not self.hasParam(param.name):
--- End diff --

Reuse _shouldOwn for validation here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread ajaysaini725

Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131028435
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
--- End diff --

I included this function because the getAndSetParams function in 
DefaultParamsReader required a way to set a parameter for an instance given the 
param object and the value of the param. There wasn't any helper functionality 
in the Params class that did this so I added the set() function from Scala. Is 
it okay to leave it for this purpose or is there a better way of handling this 
case?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread ajaysaini725

Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131017418
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
+
+def context(self, sqlContext):
--- End diff --

Yeah it isn't used I just included the function definition because 
JavaMLWriter which inherits from MLWriter which inherits from BaseReadWrite 
uses it. I changed the function body to give a NotImplementedError though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131001072
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the SQL context to use for saving.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sqlContext.sparkSession
+return self
+
+def session(self, sparkSession):
+"""Sets the Spark Session to use for saving."""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sparkSession
+return self
+
+def getOrCreateSparkSession(self):
--- End diff --

Rename to sparkSession to match Scala API


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130996252
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
--- End diff --

This does look extraneous.  I'm OK with adding it to match the Scala API, 
but it'd be better to put it in a separate PR and JIRA.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131001862
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the SQL context to use for saving.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sqlContext.sparkSession
+return self
+
+def session(self, sparkSession):
+"""Sets the Spark Session to use for saving."""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sparkSession
+return self
+
+def getOrCreateSparkSession(self):
+if self.sparkSession is None:
+self.sparkSession = SparkSession.builder.getOrCreate()
+return self.sparkSession
+
+@property
+def sc(self):
+return self.getOrCreateSparkSession().sparkContext
+
+
+@inherit_doc
+class MLWriter(BaseReadWrite):
 """
 Utility class that can save ML instances.
 
 .. versionadded:: 2.0.0
 """
 
+def __init__(self):
+super(MLWriter, self).__init__()
+self.shouldOverwrite = False
+
 def save(self, path):
 """Save the ML instance to the input path."""
+if isinstance(self, JavaMLWriter):
+self.saveImpl(path)
+else:
+if self.shouldOverwrite:
+os.remove(path)
+self.saveImpl(path)
+
+def saveImpl(self, path):
--- End diff --

add leading underscore to indicate private/protected


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131000706
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
+
+def context(self, sqlContext):
--- End diff --

No need for this, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131001546
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the SQL context to use for saving.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sqlContext.sparkSession
+return self
+
+def session(self, sparkSession):
+"""Sets the Spark Session to use for saving."""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sparkSession
+return self
+
+def getOrCreateSparkSession(self):
+if self.sparkSession is None:
+self.sparkSession = SparkSession.builder.getOrCreate()
+return self.sparkSession
+
+@property
+def sc(self):
+return self.getOrCreateSparkSession().sparkContext
+
+
+@inherit_doc
+class MLWriter(BaseReadWrite):
 """
 Utility class that can save ML instances.
 
 .. versionadded:: 2.0.0
 """
 
+def __init__(self):
+super(MLWriter, self).__init__()
+self.shouldOverwrite = False
+
 def save(self, path):
 """Save the ML instance to the input path."""
+if isinstance(self, JavaMLWriter):
+self.saveImpl(path)
+else:
+if self.shouldOverwrite:
+os.remove(path)
+self.saveImpl(path)
+
+def saveImpl(self, path):
 raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
 
 def overwrite(self):
 """Overwrites if the output path already exists."""
-raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.shouldOverwrite = True
--- End diff --

return self (to allow builder pattern); always match Scala API where 
possible & where it makes sense


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131000859
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
+
+def context(self, sqlContext):
+"""
+Sets the SQL context to use for saving.
+
+.. note:: Deprecated in 2.1 and will be removed in 3.0, use 
session instead.
+"""
+# raise NotImplementedError("MLWriter is not yet implemented for 
type: %s" % type(self))
+self.sparkSession = sqlContext.sparkSession
+return self
+
+def session(self, sparkSession):
+"""Sets the Spark Session to use for saving."""
--- End diff --

Copy doc from Scala; no need to rewrite it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-02 Thread jkbradley

Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r131000608
  
--- Diff: python/pyspark/ml/util.py ---
@@ -61,32 +66,82 @@ def _randomUID(cls):
 
 
 @inherit_doc
-class MLWriter(object):
+class BaseReadWrite(object):
+"""
+Base class for MLWriter and MLReader. Stores information about the 
SparkContext
+and SparkSession.
+
+.. versionadded:: 2.3.0
+"""
+
+def __init__(self):
+self.sparkSession = None
--- End diff --

Add leading underscore _sparkSession to be Pythonic


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-01 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130745275
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
--- End diff --

Good question, in my opinion, the best way is:
when python side use `JavaMLWriter`, just directly override `save`,
but when python side is a custom `MLWriter` implemented in python, tell 
user to inherit `DefaultParamsWriter` and overwrite `saveImpl`. It will make 
the interface easier to use.
cc @jkbradley what do you think about this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-08-01 Thread ajaysaini725

Github user ajaysaini725 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130741050
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
--- End diff --

This implementation exists on the Scala side. However, in Python MLWriter 
does not even define the saveImpl() function and instead says that the user 
should override the save() function. I'm not entirely sure which is best to do 
in Python but as of now everything in pyspark overrides save() instead of 
saveImpl() so it is probably best to just stick with that pattern.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-07-31 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130523538
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
--- End diff --

Can we move the `sc` into `BaseReadWrite` abstract hierarchy? like the 
thing scala-side do.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-07-31 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130519716
  
--- Diff: python/pyspark/ml/param/__init__.py ---
@@ -375,6 +375,18 @@ def copy(self, extra=None):
 that._defaultParamMap = {}
 return self._copyValues(that, extra)
 
+def set(self, param, value):
+"""
--- End diff --

We already have `def _set(self, **kwargs)` so does it really need this 
method ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-07-31 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130521214
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+def overwrite(self):
+self.shouldOverwrite = True
+return self
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
+
metadataPath,
+sc,
+
extraMetadata,
+paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def get_metadata_to_save(instance, path, sc, extraMetadata=None, 
paramMap=None):
+uid = instance.uid
+cls = instance.__module__ + '.' + instance.__class__.__name__
+params = instance.extractParamMap()
+jsonParams = {}
+if paramMap is not None:
+for p in paramMap:
+jsonParams[p.name] = paramMap[p]
+else:
+for p in params:
+jsonParams[p.name] = params[p]
+basicMetadata = {"class": cls, "timestamp": int(round(time.time() 
* 1000)),
+ "sparkVersion": sc.version, "uid": uid, 
"paramMap": jsonParams}
--- End diff --

Maybe we should use `long(round(time.time() * 1000))` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-07-31 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130521964
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
--- End diff --

We need add a `saveImpl` method here and let user to override `saveImpl`, 
and in `save` function call the `saveImpl` (user should not override `save` 
method)
like:
```
def save(self, path):
"""
save metadata first.
"""

saveImpl(path)

def saveImpl(self, path):
raise NotImplementedError()
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-07-31 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130520335
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
+DefaultParamsWriter.save_metadata(self.instance, path, self.sc)
+
+def overwrite(self):
+self.shouldOverwrite = True
+return self
+
+@staticmethod
+def save_metadata(instance, path, sc, extraMetadata=None, 
paramMap=None):
+metadataPath = os.path.join(path, "metadata")
+metadataJson = DefaultParamsWriter.get_metadata_to_save(instance,
+
metadataPath,
+sc,
+
extraMetadata,
+paramMap)
+sc.parallelize([metadataJson], 1).saveAsTextFile(metadataPath)
+
+@staticmethod
+def get_metadata_to_save(instance, path, sc, extraMetadata=None, 
paramMap=None):
+uid = instance.uid
+cls = instance.__module__ + '.' + instance.__class__.__name__
+params = instance.extractParamMap()
+jsonParams = {}
+if paramMap is not None:
+for p in paramMap:
+jsonParams[p.name] = paramMap[p]
+else:
+for p in params:
+jsonParams[p.name] = params[p]
+basicMetadata = {"class": cls, "timestamp": int(round(time.time() 
* 1000)),
+ "sparkVersion": sc.version, "uid": uid, 
"paramMap": jsonParams}
+if extraMetadata is not None:
+basicMetadata.update(extraMetadata)
+return json.dumps(basicMetadata)
--- End diff --

I suggest use
`json.dumps(basicMetadata, separators=(',',':'))`
so that output string will compact blank character.
(scala side also do the compaction)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #18742: [Spark-21542][ML][Python]Python persistence helpe...

2017-07-31 Thread WeichenXu123

Github user WeichenXu123 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18742#discussion_r130522066
  
--- Diff: python/pyspark/ml/util.py ---
@@ -283,3 +289,124 @@ def numFeatures(self):
 Returns the number of features the model was trained on. If 
unknown, returns -1
 """
 return self._call_java("numFeatures")
+
+
+@inherit_doc
+class DefaultParamsWritable(MLWritable):
+
+# overrides the write() function in MLWriteable
+# users call .save() in MLWriteable which calls this write() function 
and then calls
+# the .save() in DefaultParamsWriter
+# this can be overridden to return a different Writer (ex. 
OneVsRestWriter as seen in Scala)
+def write(self):
+# instance of check for params?
+return DefaultParamsWriter(self)
+
+
+@inherit_doc
+class DefaultParamsWriter(MLWriter):
+
+def __init__(self, instance):
+super(DefaultParamsWriter, self).__init__()
+self.instance = instance
+self.sc = SparkContext._active_spark_context
+
+# if a model extends DefaultParamsWriteable this save() function is 
called
+def save(self, path):
+if self.shouldOverwrite:
+# This command removes a file. Is this enough?
+os.remove(path)
--- End diff --

We use `saveAsTextFile` to save metadata, it is possible to save to `HDFS` 
or other non-local FS, so here we cannot use `os.remove` to remove it. We need 
first get the File System of the `path` and call relative API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

53 matches

Mail list logo