Travis, 0.10.x branch is for spark 1.2.x and master (0.11.0-snapshot) is for spark 1.3.x. my undersanding 0.11.0 should mostly work with exception for Spark shell, which is disabled on the HEAD. we are still woking on PR https://github.com/apache/mahout/pull/146 to re-enable it again.
numNonZeroElementsPerRow is in RLikeDrmOps. Operations is a Scala pattern (not sure of its name -- operation decorator or something?) On Thu, Jul 9, 2015 at 7:25 AM, Hegner, Travis <[email protected]> wrote: > Hello list, > > I am having some trouble getting a SimilarityAnalysis.rowSimilarityIDS() > job to run. First some info on my environment: > > I'm running hadoop with cloudera 5.4.2 with their built in spark on yarn > setup it's pretty much an OOTB setup, but it has been upgraded many times > since probably CDH4.8 or so. It's running spark 1.3.0 (perhaps some 1.3.1 > commits merged in from what I've read about cloudera's versioning). I have > my own fork of mahout which is currently just a mirror of > 'github.com:pferrel/spark-1.3'. > I'm very comfortable making changes, compiling, and using my version of the > library should your suggestions lead me in that direction. I am still > pretty new to scala, so I have a hard time wrapping my head around what > some of the syntactic sugars actually do, but I'm getting there. > > I'm successfully getting my data transformed to an RDD that essentially > looks like (<document_id>, <tag>), creating an IndexedDataSet with that, > and feeding that into SimilarityAnalysis.rowSimilarityIDS(). I've been able > to narrow the issue down to a specific case: > > Let's say I have the following records (among others) in my RDD: > > ... > (doc1, tag1) > (doc2, tag1) > ... > > doc1, and doc2 have no other tags, but tag1 may exist on many other > documents. The rest of my dataset has many other doc/tag combinations, but > I've narrowed down the issue to seemingly only occur in this case. I've > been able to trace down that the java.lang.IllegalArgumentException is > occuring because k21 is < 0 (i.e. "numInteractionsWithB = 0" and > "numInteractionsWithAandB = 1") when calling > LogLikelihood.logLikelihoodRatio() from > SimilarityAnalysis.logLikelihoodRatio(). > > Speculating a bit, I see that in SimilarityAnalysys.rowSimilarity() on the > line (163 in my branch): > > val bcastInteractionsPerItemA = drmBroadcast(drmA.numNonZeroElementsPerRow) > > ...my IDE (intellij) complains that it cannot resolve > "drmA.numNonZeroElementsPerRow", however the library compiles successfully. > Tracing the codepath shows that if that value is not being correctly > populated, it would have a direct impact on the values used in > logLikelihoodRatio(). That said, it seems to only fail in this very > particular case. > > I should note that I can run SimilarityAnalysis.cooccurrencesIDSs() > successfully with a single list of (<user_id>, <item_id>) pairs of my own > data. > > I have 3 questions given this scenario: > > First, am I using the proper branch of code for attempting to run on a > spark 1.3 cluster? I've read about a "joint effort" for spark 1.3, and this > was the only branch I could find for it. > > Second, Is anyone able to shed some light on the above error? Is drmA not > a correct type, or does that method no longer apply to that type? > > Third, what would be the mathematical implications if I run > SimilarityAnalysis.cooccurrencesIDSs() with a list of (<tag>,<document_id>) > pairs. Would the results be sound, or does that make absolutely no sense? > Would it be beneficial even as only a troubleshooting step? > > Thanks in advance for any help you may be able to provide! > > Travis Hegner > > ________________________________ > > The information contained in this communication is confidential and is > intended only for the use of the named recipient. Unauthorized use, > disclosure, or copying is strictly prohibited and may be unlawful. If you > have received this communication in error, you should know that you are > bound to confidentiality, and should please immediately notify the sender. >
