Re: From example to job

Lance Norskog Fri, 13 Jul 2012 21:47:16 -0700

A Mahout vector is not one format- it is a family of data structures
optimized for various tasks.
A Mahout vector file is a Hadoop sequencefile of 0 or more entries of
Writable duples (Writable, VectorWritable).


Several programs require the first writable to be an int row number.
There may be single-process programs which require the row numbers in
sequence. There should not be any Hadoop-friendly programs which
require this.

The "rowid" you will notice referred to a lot is a pair of programs
that replace the Writable with a unique integer, and save the Writable
out to a dictionary of int->writable sequencefile.

On Thu, Jul 12, 2012 at 12:30 PM, Robert Hall <[email protected]> wrote:
> Greetings.
>
> I'm trying to jump from the examples in mahout to a practical job of my
> very own. First, I'm very new to mahout but I do have some experience with
> machine learning, clustering, and classifications.
>
> My goal: To get KMeans clusters of time-based use from structured data
>
> Example Input:
> John Doe,1324,1233,2234,1267,1456,1745,1212
>
> There's a name and a variable series of numbers that correspond to time in
> seconds to complete an operation. The times are pre-filtered > 1200 and
> built by date/time (pivoted into nameless columns) of the operation, but
> the date/time is not relevant to my goal.
>
> Can someone point me toward any resources that explain, not how to run an
> example, but how the examples were put together?
>
> If not a resource, how about a high-level description on what mahout is
> looking for and how it does, say a KMeans cluster analysis.
>
> Finally, can someone describe a mahout vector and vector file? A
> description plus the actual format of a vector row/file.
>
>
>
> --
> Robert Hall
> [email protected]



-- 
Lance Norskog
[email protected]

Re: From example to job

Reply via email to