For what it’s worth key design goals are the use of immutable objects and functional programming. Adding lazy evaluation allows for an optimizer underneath the DSL and has other benefits. I wouldn’t call mahout file-bound since files are really import and export. In Hadoop Mapreduce files were use for every intermediate result and so mahout _was_ file bound. Now it is just file centric and that is only because someone like you hasn’t stepped up to add support for DBs.
drmFromHDFS is a package level helper functions, like the coming indexedDatasetDFSRead(src, schema) You don’t have to use them. There are reader and writer traits parameterized by what you want to r/w. These are meant to be extended with store specific read write functions since they only store schema (a HashMap[String, Any]) and a device context. The extending class is a reader factory for the object read in. The extending writer is a trait or class adding write functionality to the object read by the reader. You extend writer in your class or use an a extending writer trait as a mixin to your class. Either way it adds a .dfsWrite or in your case a .hbaseWrite. I’ve done this with IndexedDatasets using Spark’s parallel r/w of text and you may want to go that route only dealing with HBase. Alternatively you can create a reader for a DRM directly if you want. I’d be interested in supporting this if you go this route and providing any needed refactoring. > > On Oct 9, 2014, at 10:56 PM, Reinis Vicups <[email protected]> wrote: > > Guys, thank you very much for your feedback. > > I have already my own vanilla spark-based implementation of row similarity > that reads and writes into NoSQL (in my case HBase). > > My intention is to profit from your effort to abstract algebraic layer from > physical backend because I find it a great idea. > > There is no question that the effort to implement i/o with some NoSql and > spark is very low nowadays. > > My question is more towards understanding your design. > > In particular, why for instance org.apache.mahout.math.drm.DistributedEngine > has def drmFromHDFS()? > > I do understand argument with "files is most basic and common" and "we had > this already in mahout 0.6 so its for compatibility purposes", but > > why for instance instead of drmFromHDFS() there is no def createDRM() and > then some particular implementation of DistributedEngine (or medium-specific > helper) that then decides how DRM shall be created? > > Admittedly, I do NOT understand your design fully just yet and I am asking > these questions not to criticize this design but to help me understand it. > > Another example is existance of org.apache.mahout.drivers.Schema. It seems > that there is effort to kind of make medium-specific format flexible and > abstract it away, but again the limitation is it is file-centric. > > Thank you for your hints with drmWrap and IndexedDataset. With this in mind, > maybe my error is that I am trying to reuse classes in > org.apache.mahout.drivers, maybe I should just write my own driver from > scratch and with Database in mind. > > Thank you again for your hints and ideas > reinis > > On 10.10.2014 01:00, Pat Ferrel wrote: > There is also the mahout Reader and Writer traits and classes that currently > work with text delimited file I/O. These were imagined as a general framework > to support parallelized read/write to any format and store using whatever > method is expedient, including the ones Dmitriy mentions. I personally would > like to do MongoDB since I have an existing app using that. > > These are built to support a sort of extended-DRM (IndexedDataset) which > maintains external IDs. These IDs can be anything you can put in a string > like Mongo or Cassandra keys or can be left as human readable external keys. > From an IndexedDataset you can get a CheckpointedDRM and do anything in the > DSL with it. > > They are in the spark module but the base traits have been moved to the core > “math-scala” to make the concepts core with implementations in left in the > engine specific modules. This is work about to be put in a PR but you can > look at it in the master to see if it helps—expect some refactoring shortly. > > I’m sure there will be changes needed for DBs but haven’t gotten to that so > would love another set of eyes on the code. > > On Oct 9, 2014, at 3:08 PM, Dmitriy Lyubimov<[email protected]> wrote: > > Bottom line, some very smart people decided to do all that work in Spark > and give us for free. Not sure why, but that did. If the capability already > found in Spark, there's no need for us to replicate it. > > WRT specifically NoSql, Spark can read HBase trivially. I also did a bit > more advanced things with a custom rdd implementation in Spark that was > able to stream coprocessor outputs into rdd functors. In either case this > is actually a fairly small effort. I never looked at it closely, but i know > there are also Cassandra adapters for spark as well. Chances are, you > could probably load data from any thinkable distributed data store into > Spark these days via off the shelf implementations. If not, Spark actually > makes it very easy to come with one on your own. > > On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov<[email protected]> wrote: > >> Matrix defines structure. Not necessarily where it can be imported from. >> You're right in the sense that framework itself avoids defining apis for >> custom partition formation. But you're wrong in implying you cannot do it >> if you wanted, our that you d have to do anything that complex as you say. >> As long as you can form your own rdd of keys and row vectors, you can >> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the >> other hand had been around for as long as I remember, not just in 1.0. So >> naturally those are provided to be interoperable with mahout .9 and before, >> e g be able to load output from stuff like seq2sparse and such. >> >> Note that if you instruct your backend to use some sort off data locality >> information, it will also be able capitalize on that automatically. >> >> There is actually far greater number of concerns of interacting with >> native engine capabilities than just reading the data. For example, what if >> we wanted to wrap an output of a shark query into a matrix. Instead of >> addressing all those individually, we just chose to delegate those to >> actual capabilities of backend. Chances are they already have (and, in >> fact, do in case of spark) all that tooling far better than we will ever >> have on our own. >> >> Sent from my phone. >> On Oct 9, 2014 12:56 PM, "Reinis Vicups"<[email protected]> wrote: >> >>> Hello, >>> >>> I am currently looking into the new (DRM) mahout framework. >>> >>> I find myself wondering why is it so that from one side there is a lot >>> of thought, effort and design complexity being invested into abstracting >>> engines, contexts or algebraic operations, >>> >>> but from the other side, even abstract interfaces, are defined in a way >>> that everything has to be read or written from files (on HDFS). >>> >>> I am considering to implement reading/writing to NoSQL database and >>> initially I assumed it will be enough just to implement own >>> ReaderWriter, but I am currently realizing that I will have to >>> re-implement or hack-around by derivating own versions of large(?) >>> portions of framework including own variant of CheckpointedDrm, >>> DistributedEngine and what not. >>> >>> Is it because abstracting away storage type would introduce even more >>> complexity or because there are aspects of design that absolutely >>> require to read/write only to (seq)files? >>> >>> kind regards >>> reinis >>> >>>
