Re: Spark Mahout with a CLI?

Sebastian Schelter Sun, 20 Apr 2014 00:20:31 -0700

I'll create a jira ticket for this, as I have a little time to work on it.


On 04/16/2014 08:15 PM, Pat Ferrel wrote:

bug in the pseudo code, should use columnIds:

    val hashedCrossIndicatorMatrix = new 
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(), 
hashedDrms(1).columnIds())
    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, 
"hdfs://some/path/for/output”)

On Apr 16, 2014, at 10:00 AM, Pat Ferrel <[email protected]> wrote:

Great, and an excellent example is at hand. In it I will play the user and 
contributor role, Sebastian and Dmitriy the commiter/scientist role.

I have a web site that uses a Mahout+Solr recommender—the video recommender 
demo site. This creates logfiles of the form

    timestamp, userId, itemId, action
    timestamp1, userIdString1, itemIdString1, “view"
    timestamp2, userIdString2, itemIdString1, “like"

These are currently processed using the Solr-recommender example code and 
Hadoop Mahout. The input is split and accumulated into two matrices which could 
then be input to the new Spark cooccurrence analysis code (see the patch here: 
https://issues.apache.org/jira/browse/MAHOUT-1464)

    val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
        maxInterestingItemsPerThing = 100, maxNumInteractions = 500, 
Array(drmA))

What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or 
maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible 
job that takes the above logfile input and creates a HashedSparseMatrix. inside the 
HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and 
column external Id <-> mahout Id lookup.

The ‘cooccurrences' call would be identical and the data it deals with would 
also be identical. But the HashedSparseMatrix would be able to deliver two 
dictionaries, which store the dimensions length and are used to lookup string 
Ids from internal mahout ordinal integer Ids. These could be created with a 
helper function to read from logfiles.

    val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”, 
“^actions-.*“, "\t”, 1, 2, “like”, “view”)

Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.

When the output is written to a text file it will be creating a new 
HasedSparceMatrix from the cooccurrences indicator matrix and the original 
itemId dictionaries:

    val hashedCrossIndicatorMatrix = new 
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(), 
hasedDrms(1).rowIds())
    RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix, 
"hdfs://some/path/for/output")

Here the two Id dictionaries are used to create output file(s) with external 
Ids.

Since I already have to do this for the demo site using Hadoop Mahout I’ll have 
to create a Spark impl of the wrapper for the new cross-cooccurrence indicator 
matrix. And since my scripting/web app language is not Scala the format for the 
output needs to be text.

I think this meets all issues raised here. No unnecessary import/export. 
Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a 
HashedSparseMatrix, The internal calculations are done on RDDs and the drms are 
never written to disk. AND the logfiles can be consumed directly producing data 
that any language can consume directly with external Ids used and preserved.


BTW: in the MAHOUT-1464 example the drms are read in serially single threaded 
but written out using Spark (unless I missed something). In the proposed impl 
the read and write would be Sparkified.

BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using 
cron directly with no additional processing pipeline and by people unfamiliar 
with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the 
demo site but with a lot of non-Mahout code.

BTW3: This type of thing IMO must be done for any Mahout job we want to be 
widely used. Otherwise we leave all of this wrapper code to be duplicated over 
and over again buy users and expect them to know too much about Spark Mahout 
internals.



On Apr 15, 2014, at 6:45 PM, Ted Dunning <[email protected]> wrote:

Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.


On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <[email protected]> wrote:

Which is why it’s an import/export issue.

On Apr 15, 2014, at 5:48 PM, Ted Dunning <[email protected]> wrote:

On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <[email protected]>
wrote:

As to the statement "There is not, nor do i think there will be a way to
run this stuff with CLI” seems unduly misleading. Really, does anyone
second this?

There will be Scala scripts to drive this stuff and yes even from the

CLI.

Do you imagine that every Mahout USER will be a Scala + Mahout DSL
programmer? That may be fine for commiters but users will be PHP devs,

Ruby

devs, Python or Java devs maybe even a few C# devs. I think you are
confusing Mahout DEVS with USERS. Few users are R devs moving into
production work, they are production engineers moving into ML who want a
blackbox. They will need a language agnostic way to drive Mahout. Making
statements like this only confuse potential users and drive them away to

no

purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in

the

typical user’s world view.


Yes, ultimately there may need to be command line programs of various
sorts, but the fact is, we need to make sure that we avoid files as the API
for moving large amounts of data. That means that we have to have some way
of controlling the persistence of in-memory objects and in many cases, that
means that processing chains will not typically be integrated at the level
of command line programs.

Dmitriy's comment about R is apropos.  You can put scripts together for
various end-user purposes but you don't have a CLI for every R comment.
Nor for every Perl, python or php command either.

To the extent we have in-memory persistence across the life-time of
multiple driver programs, then a sort of CLI interface will be possible.  I
know that h2o will do that, but I am not entirely clear on the life-time of
RDD's in Spark relative to Mahout DSL programs.  Regardless of possibility,
I don't expect CLI interface to be the primary integration path for these
new capabilities.

Re: Spark Mahout with a CLI?

Reply via email to