bug in the pseudo code, should use columnIds:
val hashedCrossIndicatorMatrix = new
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).columnIds(),
hashedDrms(1).columnIds())
RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix,
"hdfs://some/path/for/output”)
On Apr 16, 2014, at 10:00 AM, Pat Ferrel <[email protected]> wrote:
Great, and an excellent example is at hand. In it I will play the user and
contributor role, Sebastian and Dmitriy the commiter/scientist role.
I have a web site that uses a Mahout+Solr recommender—the video recommender
demo site. This creates logfiles of the form
timestamp, userId, itemId, action
timestamp1, userIdString1, itemIdString1, “view"
timestamp2, userIdString2, itemIdString1, “like"
These are currently processed using the Solr-recommender example code and
Hadoop Mahout. The input is split and accumulated into two matrices which could
then be input to the new Spark cooccurrence analysis code (see the patch here:
https://issues.apache.org/jira/browse/MAHOUT-1464)
val indicatorMatrices = cooccurrences(drmB, randomSeed = 0xdeadbeef,
maxInterestingItemsPerThing = 100, maxNumInteractions = 500,
Array(drmA))
What I propose to do is replace my Hadoop Mahout impl by creating a new Scala (or
maybe Java) class, call it HashedSparseMatrix for now. There will be a CLI accessible
job that takes the above logfile input and creates a HashedSparseMatrix. inside the
HashedSparseMatrix will be a drm SparseMatrix and two hashed dictionaries for row and
column external Id <-> mahout Id lookup.
The ‘cooccurrences' call would be identical and the data it deals with would
also be identical. But the HashedSparseMatrix would be able to deliver two
dictionaries, which store the dimensions length and are used to lookup string
Ids from internal mahout ordinal integer Ids. These could be created with a
helper function to read from logfiles.
val hashedDrms = readHashedSparseMatrices(“hdfs://path/to/input/logfiles”,
“^actions-.*“, "\t”, 1, 2, “like”, “view”)
Here hasedDrms(0) is a HasedSparceMatrix corresponding to drmA, (1) = drmB.
When the output is written to a text file it will be creating a new
HasedSparceMatrix from the cooccurrences indicator matrix and the original
itemId dictionaries:
val hashedCrossIndicatorMatrix = new
HashedSparseMatrix(indicatorMatrices(1), hashedDrms(0).rowIds(),
hasedDrms(1).rowIds())
RecommendationExamplesHelper.saveIndicatorMatrix(hashedCrossIndicatorMatrix,
"hdfs://some/path/for/output")
Here the two Id dictionaries are used to create output file(s) with external
Ids.
Since I already have to do this for the demo site using Hadoop Mahout I’ll have
to create a Spark impl of the wrapper for the new cross-cooccurrence indicator
matrix. And since my scripting/web app language is not Scala the format for the
output needs to be text.
I think this meets all issues raised here. No unnecessary import/export.
Dmitriy doesn’t need to write a CLI. Sebastian doesn’t need to write a
HashedSparseMatrix, The internal calculations are done on RDDs and the drms are
never written to disk. AND the logfiles can be consumed directly producing data
that any language can consume directly with external Ids used and preserved.
BTW: in the MAHOUT-1464 example the drms are read in serially single threaded
but written out using Spark (unless I missed something). In the proposed impl
the read and write would be Sparkified.
BTW2: Since this is a CLI interface to Spark Mahout it can be scheduled using
cron directly with no additional processing pipeline and by people unfamiliar
with Scala, the Spark shell, or internal Mahout Ids. Just as is done now on the
demo site but with a lot of non-Mahout code.
BTW3: This type of thing IMO must be done for any Mahout job we want to be
widely used. Otherwise we leave all of this wrapper code to be duplicated over
and over again buy users and expect them to know too much about Spark Mahout
internals.
On Apr 15, 2014, at 6:45 PM, Ted Dunning <[email protected]> wrote:
Well... I think it is an issue that has to do with figuring out how to
*avoid* import and export as much as possible.
On Tue, Apr 15, 2014 at 6:36 PM, Pat Ferrel <[email protected]> wrote:
Which is why it’s an import/export issue.
On Apr 15, 2014, at 5:48 PM, Ted Dunning <[email protected]> wrote:
On Tue, Apr 15, 2014 at 10:58 AM, Pat Ferrel <[email protected]>
wrote:
As to the statement "There is not, nor do i think there will be a way to
run this stuff with CLI” seems unduly misleading. Really, does anyone
second this?
There will be Scala scripts to drive this stuff and yes even from the
CLI.
Do you imagine that every Mahout USER will be a Scala + Mahout DSL
programmer? That may be fine for commiters but users will be PHP devs,
Ruby
devs, Python or Java devs maybe even a few C# devs. I think you are
confusing Mahout DEVS with USERS. Few users are R devs moving into
production work, they are production engineers moving into ML who want a
blackbox. They will need a language agnostic way to drive Mahout. Making
statements like this only confuse potential users and drive them away to
no
purpose. I’m happy for the nascent Mahout-Scala shell, but it’s not in
the
typical user’s world view.
Yes, ultimately there may need to be command line programs of various
sorts, but the fact is, we need to make sure that we avoid files as the API
for moving large amounts of data. That means that we have to have some way
of controlling the persistence of in-memory objects and in many cases, that
means that processing chains will not typically be integrated at the level
of command line programs.
Dmitriy's comment about R is apropos. You can put scripts together for
various end-user purposes but you don't have a CLI for every R comment.
Nor for every Perl, python or php command either.
To the extent we have in-memory persistence across the life-time of
multiple driver programs, then a sort of CLI interface will be possible. I
know that h2o will do that, but I am not entirely clear on the life-time of
RDD's in Spark relative to Mahout DSL programs. Regardless of possibility,
I don't expect CLI interface to be the primary integration path for these
new capabilities.