Re: Spectral clustering - a bundle of issues

Shannon Quinn Fri, 09 Sep 2011 14:01:58 -0700

I've had a chance to look at some of these. Using your dataset and yourcommand line arguments:

1) This was a problem for me on a pseudo-distributed Ubuntu machine;however, on a single-node setup on my OS X laptop, I didn't run intothis problem. Can anyone with more experience with the setJarByClasstickets elucidate?

2) Newlines and comments would be a nice thing to add, absolutely. SinceI'll be changing how the input is done to the spectral algorithms, thisis good to keep in mind.

3) Not sure on this one. I'll have to drum up some more creative testcases to figure out what's going on here, but you're probably right inthat it's a consequence of the aforementioned newlines and comments.

4) I think this is actually an off-by-one error: you specify 37dimensions, but unfortunately I designed the input m/r job to be0-indexed (another caveat/artifact that will vanish once raw input isacceptable), so you'll either need to re-index your affinity text file,or simply make the number of dimensions 38 and put up with an first rowand column made entirely of 0's.

Even so, running this job with --dimensions 38 causes the Lanczos solverpath errors you and I have been seeing a lot of lately:

Exception in thread "main" java.lang.IllegalStateException:java.io.FileNotFoundException: Filefile:/data/output/calculations/laplacian-80/tmp/data does not exist.atorg.apache.mahout.math.hadoop.DistributedRowMatrix.times(DistributedRowMatrix.java:222)atorg.apache.mahout.math.decomposer.lanczos.LanczosSolver.solve(LanczosSolver.java:104)

...

The postfix "data" after "tmp" is strange to me; my guess is that thevoodoo magic I did with the Paths in the SpectralKMeansDriver lastsummer is indeed rotting at this point and needs to be reworked. I'llstart there.


Thanks for helping with this, Dan.

Shannon

On 9/7/11 8:45 AM, Dan Brickley wrote:

Trying to run https://cwiki.apache.org/MAHOUT/spectral-clustering.html
... seems perhaps some code rot?

Can anyone else report success with Spectral clustering against recent trunk?

Trying bin/mahout spectralkmeans -k 2 -i speccy -o specout --maxIter
10 --dimensions 37

...with the small example affinity file we discussed yesterday, I hit
a series of problems.

data: http://danbri.org/2011/mahout/afftest.txt

1. As I mentioned in comments in
http://spectrallyclustered.wordpress.com/2010/07/14/sprint-3-quick-update/
(both for local pseudo-cluster, and a real one) I had to patch in
calls to job.setJarByClass before job.waitForCompletion. This problem
occured for others elsewhere in Mahout, e.g. MAHOUT-428 and
MAHOUT-197, but I presume it can't be hitting everyone. From grepping
around, this might not be the only component missing setJarByClass
calls. Or is this just me, somehow?

2. Newlines in the input data made it fail, but the associated warning
from AffinityMatrixInputMapper was very vague. I'd suggest allowing
those and #-comments, but maybe not a good idea to make per-component
syntax designs? Suggest also it's worth printing the problem line (see
patch below) when complaining.

3. Failing to load the affinity matrix (surely a requirement for
further progress?) does not seem to halt the job, I see exceptions
mixed in with ongoing processing (until a later problem hits us).
Transcript: https://gist.github.com/1200455 ... actually it wasn't
clear if the newline problem was more of a warning, and other rows
from the input data were accepted. In which case, reporting them as
java.io.IOException seems a bit draconian. So maybe bits of the input
file were in fact loaded. It would be great to clarify what expected
behaviour is.

4. After all that, the job still fails. Full transcript here:
https://gist.github.com/1200428

Excerpt: (I've added a bit more reporting output in a few places)

11/09/07 14:25:06 INFO common.VectorCache: Loading vector from:
specout/calculations/diagonal/part-r-00000
Exception in thread "main" java.util.NoSuchElementException
at
com.google.common.collect.AbstractIterator.next(AbstractIterator.java:152)
at
org.apache.mahout.clustering.spectral.common.VectorCache.load(VectorCache.java:121)

However that file does exist in hdfs, and seqdumper seems to accept
it; it just seems empty:

Input Path: specout/calculations/diagonal/part-r-00000
Key class: class org.apache.hadoop.io.NullWritable Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

I've posted an informal composite patch at
https://raw.github.com/gist/1200439/4ad433b51e9d963cff5d500d974fa5cb6904b9c3/gistfile1.txt
... if you can confirm the above issues and a breakdown into JIRAs,
I'll attach cleaner patches where appropriate.

Looking forward to getting this running,

cheers,

Dan

Re: Spectral clustering - a bundle of issues

Reply via email to