Re: Mahout 1.0: is DRM too file-bound?

Dmitriy Lyubimov Thu, 09 Oct 2014 15:09:06 -0700

Bottom line, some very smart people decided to do all that work in Spark
and give us for free. Not sure why, but that did. If the capability already
found in Spark, there's no need for us to replicate it.


WRT specifically  NoSql, Spark can read HBase trivially. I also did a bit
more advanced things with a custom rdd implementation in Spark that was
able to stream coprocessor outputs into rdd functors. In either case this
is actually a fairly small effort. I never looked at it closely, but i know
there are also Cassandra  adapters for spark as well. Chances are, you
could probably load data from any thinkable distributed data store into
Spark these days via off the shelf implementations. If not, Spark actually
makes it very easy to come with one on your own.

On Thu, Oct 9, 2014 at 2:47 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Matrix defines structure. Not necessarily where it can be imported from.
> You're right in the sense that framework itself  avoids defining apis for
> custom partition formation. But you're wrong in implying you cannot do it
> if you wanted, our that you d have to do anything that complex as you say.
> As long as you can form your own rdd of keys and row vectors, you can
> always wrap it into a matrix (drmWrap api). Hdfs drm persistence on the
> other hand had been around for as long as I remember, not just in 1.0. So
> naturally those are provided to be interoperable with mahout .9 and before,
> e g be able to load output from stuff like seq2sparse and such.
>
> Note that if you instruct your backend to use some sort off data locality
> information, it will also be able capitalize on that automatically.
>
> There is actually far greater number of concerns of interacting with
> native engine capabilities than just reading the data. For example, what if
> we wanted to wrap an output of a shark query into a matrix. Instead of
> addressing all those individually, we just chose to delegate those to
> actual capabilities of backend. Chances are they already have (and, in
> fact, do in case of spark) all that tooling far better than we will ever
> have on our own.
>
> Sent from my phone.
> On Oct 9, 2014 12:56 PM, "Reinis Vicups" <[email protected]> wrote:
>
>> Hello,
>>
>> I am currently looking into the new (DRM) mahout framework.
>>
>> I find myself wondering why is it so that from one side there is a lot
>> of thought, effort and design complexity being invested into abstracting
>> engines, contexts or algebraic operations,
>>
>> but from the other side, even abstract interfaces, are defined in a way
>> that everything has to be read or written from files (on HDFS).
>>
>> I am considering to implement reading/writing to NoSQL database and
>> initially I assumed it will be enough just to implement own
>> ReaderWriter, but I am currently realizing that I will have to
>> re-implement or hack-around by derivating own versions of large(?)
>> portions of framework including own variant of CheckpointedDrm,
>> DistributedEngine and what not.
>>
>> Is it because abstracting away storage type would introduce even more
>> complexity or because there are aspects of design that absolutely
>> require to read/write only to (seq)files?
>>
>> kind regards
>> reinis
>>
>>

Re: Mahout 1.0: is DRM too file-bound?

Reply via email to