Coding in the Spark ml "ecosystem" why is everything private?!

Thunder Stumpges Mon, 29 Aug 2016 09:47:08 -0700

Hi all,

I'm not sure if this belongs here in users or over in dev as I guess it's
somewhere in between. We have been starting to implement some machine
learning pipelines, and it seemed from the documentation that Spark had a
fairly well thought-out platform (see:
http://spark.apache.org/docs/1.6.1/ml-guide.html )


I liked the design of Transformers, Models, Estimators, Pipelines, etc.
However as soon as we began attempting to code our first ones, we began
running into one class or method after another that has been marked
private... Some examples are:

- SchemaUtils - (for validating schemas passed in and out, and adding
output columns to DataFrames)
- Loader / Saveable (traits / helpers for saving and loading models)
- Several classes under 'collection' namespace like OpenHashSet /
OpenHashMap
- All of the underlying linear algebra Breeze details
- Other classes specific to certain models. We are writing an alternative
LDA Optimizer / Trainer and everything under LDAUtils is private.

I'd like to ask what the expected approach is here. I see a few options,
none of which seem appropriate:

1. Implement everything in the org.apache.spark.* namespaces to match
package privates
    - will this even work in our own modules ?
    - we would be open to contributing some of our code back but not sure
the project wants it
2. Implement our own versions of all of these things.
   - lots of extra work for us, leads to unseen gotchas in implementations
and other unforseen issues
3. Copy classes into our namespace for use
   - duplicates code, leads to code diversion as the main code is kept up
to date.

Thanks in advance for any recommendations on this frustrating issue.
Thunder

Coding in the Spark ml "ecosystem" why is everything private?!

Reply via email to