For what it’s worth key design goals are the use of immutable objects and 
functional programming. Adding lazy evaluation allows for an optimizer 
underneath the DSL and has other benefits. I wouldn’t call mahout file-bound 
since files are really import and export. In Hadoop Mapreduce files were use 
for every intermediate result and so mahout _was_ file bound. Now it is just 
file centric and that is only because someone like you hasn’t stepped up to add 
support for DBs.

drmFromHDFS is a package level helper functions, like the coming 
indexedDatasetDFSRead(src, schema)

You don’t have to use them. There are reader and writer traits parameterized by 
what you want to r/w. These are meant to be extended with store specific read 
write functions since they only store schema (a HashMap[String, Any]) and a 
device context. 

The extending class is a reader factory for the object read in. The extending 
writer is a trait or class adding write functionality to the object read by the 
reader. You extend writer in your class or use an a extending writer trait as a 
mixin to your class. Either way it adds a .dfsWrite or in your case a 
.hbaseWrite. I’ve done this with IndexedDatasets using Spark’s parallel r/w of 
text and you may want to go that route only dealing with HBase. Alternatively 
you can create a reader for a DRM directly if you want.

I’d be interested in supporting this if you go this route and providing any 
needed refactoring.

> 
> On Oct 9, 2014, at 10:56 PM, Reinis Vicups <[email protected]> wrote:
> 
> Guys, thank you very much for your feedback.
> 
> I have already my own vanilla spark-based implementation of row similarity 
> that reads and writes into NoSQL (in my case HBase).
> 
> My intention is to profit from your effort to abstract algebraic layer from 
> physical backend because I find it a great idea.
> 
> There is no question that the effort to implement i/o with some NoSql and 
> spark is very low nowadays.
> 
> My question is more towards understanding your design.
> 
> In particular, why for instance org.apache.mahout.math.drm.DistributedEngine 
> has def drmFromHDFS()?
> 
> I do understand argument with "files is most basic and common" and "we had 
> this already in mahout 0.6 so its for compatibility purposes", but
> 
> why for instance instead of drmFromHDFS() there is no def createDRM() and 
> then some particular implementation of DistributedEngine (or medium-specific 
> helper) that then decides how DRM shall be created?
> 
> Admittedly, I do NOT understand your design fully just yet and I am asking 
> these questions not to criticize this design but to help me understand it.
> 
> Another example is existance of org.apache.mahout.drivers.Schema. It seems 
> that there is effort to kind of make medium-specific format flexible and 
> abstract it away, but again the limitation is it is file-centric.
> 
> Thank you for your hints with  drmWrap and IndexedDataset. With this in mind, 
> maybe my error is that I am trying to reuse classes in 
> org.apache.mahout.drivers, maybe I should just write my own driver from 
> scratch and with Database in mind.
> 
> Thank you again for your hints and ideas
> reinis
> 
> 
On 10.10.2014 01:00, Pat Ferrel wrote:
> There is also the mahout Reader and Writer traits and classes that currently 
> work with text delimited file I/O. These were imagined as a general framework 
> to support parallelized read/write to any format and store using whatever 
> method is expedient, including the ones Dmitriy mentions. I personally would 
> like to do MongoDB since I have an existing app using that.
> 
> These are built to support a sort of extended-DRM (IndexedDataset) which 
> maintains external IDs. These IDs can be anything you can put in a string 
> like Mongo or Cassandra keys or can be left as human readable external keys. 
> From an IndexedDataset you can get a CheckpointedDRM and do anything in the 
> DSL with it.
> 
> They are in the spark module but the base traits have been moved to the core 
> “math-scala” to make the concepts core with implementations in left in the 
> engine specific modules. This is work about to be put in a PR but you can 
> look at it in the master to see if it helps—expect some refactoring shortly.
> 
> I’m sure there will be changes needed for DBs but haven’t gotten to that so 
> would love another set of eyes on the code.
> 
> On Oct 9, 2014, at 3:08 PM, Dmitriy Lyubimov<[email protected]>  wrote:
> 
> Bottom line, some very smart people decided to do all that work in Spark
> and give us for free. Not sure why, but that did. If the capability already
> found in Spark, there's no need for us to replicate it.
> 
> WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
> more advanced things with a custom rdd implementation in Spark that was
> able to stream coprocessor outputs into rdd functors. In either case this
> is actually a fairly small effort. I never looked at it closely, but i know
> there are also Cassandra  adapters for spark as well. Chances are, you
> could probably load data from any thinkable distributed data store into
> Spark these days via off the shelf implementations. If not, Spark actually
> makes it very easy to come with one on your own.
> 
> On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov<[email protected]>  wrote:
> 
>> Matrix defines structure. Not necessarily where it can be imported from.
>> You're right in the sense that framework itself  avoids defining apis for
>> custom partition formation. But you're wrong in implying you cannot do it
>> if you wanted, our that you d have to do anything that complex as you say.
>> As long as you can form your own rdd of keys and row vectors, you can
>> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
>> other hand had been around for as long as I remember, not just in 1.0. So
>> naturally those are provided to be interoperable with mahout .9 and before,
>> e g be able to load output from stuff like seq2sparse and such.
>> 
>> Note that if you instruct your backend to use some sort off data locality
>> information, it will also be able capitalize on that automatically.
>> 
>> There is actually far greater number of concerns of interacting with
>> native engine capabilities than just reading the data. For example, what if
>> we wanted to wrap an output of a shark query into a matrix. Instead of
>> addressing all those individually, we just chose to delegate those to
>> actual capabilities of backend. Chances are they already have (and, in
>> fact, do in case of spark) all that tooling far better than we will ever
>> have on our own.
>> 
>> Sent from my phone.
>> On Oct 9, 2014 12:56 PM, "Reinis Vicups"<[email protected]>  wrote:
>> 
>>> Hello,
>>> 
>>> I am currently looking into the new (DRM) mahout framework.
>>> 
>>> I find myself wondering why is it so that from one side there is a lot
>>> of thought, effort and design complexity being invested into abstracting
>>> engines, contexts or algebraic operations,
>>> 
>>> but from the other side, even abstract interfaces, are defined in a way
>>> that everything has to be read or written from files (on HDFS).
>>> 
>>> I am considering to implement reading/writing to NoSQL database and
>>> initially I assumed it will be enough just to implement own
>>> ReaderWriter, but I am currently realizing that I will have to
>>> re-implement or hack-around by derivating own versions of large(?)
>>> portions of framework including own variant of CheckpointedDrm,
>>> DistributedEngine and what not.
>>> 
>>> Is it because abstracting away storage type would introduce even more
>>> complexity or because there are aspects of design that absolutely
>>> require to read/write only to (seq)files?
>>> 
>>> kind regards
>>> reinis
>>> 
>>> 


Reply via email to