A Mahout vector is not one format- it is a family of data structures optimized for various tasks. A Mahout vector file is a Hadoop sequencefile of 0 or more entries of Writable duples (Writable, VectorWritable).
Several programs require the first writable to be an int row number. There may be single-process programs which require the row numbers in sequence. There should not be any Hadoop-friendly programs which require this. The "rowid" you will notice referred to a lot is a pair of programs that replace the Writable with a unique integer, and save the Writable out to a dictionary of int->writable sequencefile. On Thu, Jul 12, 2012 at 12:30 PM, Robert Hall <[email protected]> wrote: > Greetings. > > I'm trying to jump from the examples in mahout to a practical job of my > very own. First, I'm very new to mahout but I do have some experience with > machine learning, clustering, and classifications. > > My goal: To get KMeans clusters of time-based use from structured data > > Example Input: > John Doe,1324,1233,2234,1267,1456,1745,1212 > > There's a name and a variable series of numbers that correspond to time in > seconds to complete an operation. The times are pre-filtered > 1200 and > built by date/time (pivoted into nameless columns) of the operation, but > the date/time is not relevant to my goal. > > Can someone point me toward any resources that explain, not how to run an > example, but how the examples were put together? > > If not a resource, how about a high-level description on what mahout is > looking for and how it does, say a KMeans cluster analysis. > > Finally, can someone describe a mahout vector and vector file? A > description plus the actual format of a vector row/file. > > > > -- > Robert Hall > [email protected] -- Lance Norskog [email protected]
