Could someone please explain the "part-m-00000" thing?

robpd Fri, 21 Oct 2011 01:01:28 -0700

Hi

I am very new to Mahout and trying to learn about it.  I managed to get the
MeanShiftClusterer to work and print out my output using SequenceFile.Reader
with a small number of points. I'm using a pseudo-distributed configuration
under cygwin on my laptop.


Although it worked I really do not understand the reason for the
'part-m-00000' required in the path of the reader.  From what I have read
the 'm' stands for 'map' and the 00000 means it's the first map.  
Apparently there can also be a 'part-r-00000' for the first reduce. Is that
correct?  Here's where my confusion starts.

1) Why is there not a 'part-r-00000' present after I run my code? Surely the
finished clusters should have been subject to a reduce after the maps?

2) Does this mean that my clusterer only did the map, but not the reduce (so
is not correct)?

3) If I had a proper distributed hardware setup would I also find that there
were 'part-m-00001', 'part-m-00002', 'part-m-0000n'? So would I need to read
them all or would there be one or more 'part-r-000s'

4) The bottom line is I WANT TO READ ALL CLUSTERS irrespective of the number
of hardware nodes and I want them to be fully 'reduced'. Given my confusion
over the above, what's the best way of doing this? Is there any sample code
out there to do this?

Any help would be gratefully received.

Rob







--
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-someone-please-explain-the-part-m-00000-thing-tp3440174p3440174.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Could someone please explain the "part-m-00000" thing?

Reply via email to