Re: getting started with mahout and kmeans

Ted Dunning Tue, 27 Nov 2012 10:03:41 -0800

Bash?  Not sure how the speed of bash matters here.

My question was leading to a suggestion that R may be a better choice.  If
you have less than millions of data points and need to experiment with
variable creation, then R is much better than Mahout.  If you know your
data already and have many more data points than millions, then Mahout is a
reasonable choice.


On Tue, Nov 27, 2012 at 8:45 AM, Eduard Gamonal <[email protected]>wrote:

> I'll start with a small dataset, ~1000 rows (data points?) of about 20
> features each.
> the thing is that I'd like to use a much bigger dataset later and, in
> that case, bash would be too slow, wouldn't it?
>
> On Tue, Nov 27, 2012 at 12:54 AM, Ted Dunning <[email protected]>
> wrote:
> > How many data points are you clustering?  How many dimensions?
> >
> > On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal <
> [email protected]>wrote:
> >
> >> Hi,
> >> I'm doing a MSc at Northeastern and I'm working on analyzing some US
> >> election polls with kmeans.
> >> I'm a beginner with both Mahout and Hadoop. I've been reading the docs
> >> but I'd still appreciate some orientation on these questions:
> >>
> >> * I can transform my input data into vectors and run k-means using the
> >> command line [1]. I downloaded hadoop (1.0.4, working in a real
> >> cluster) and I wrote a program for it. Then I downloaded Mahout and I
> >> saw that there is a  jar file included (0.20, single node:
> >> M2_REPO/org/apache/hadoop/hadoop-core/
> >> 0.20.204.0/hadoop-core-0.20.204.0.jar
> >> ). If I point HADOOP_HOME to my hadoop installation, will mahout use
> >> it? I set HADOOP_HOME in hadoop/conf/hadoop-env.sh, though.
> >>
> >> * I might need to remove some columns of my data set. With Hadoop I
> >> could write a program to tokenize the input and create the data
> >> structures I needed, and then call kmeansdriver. I can use bash to
> >> remove the columns and mahout from command line. should I write a
> >> program instead?
> >>
> >> * How do I write a program for Mahout 0.7 (and Hadoop 1.x), from
> scratch?
> >> I need to transform the dataset: Vectors should be created only with
> >> the features I want k-means to consider to cluster my data. Then I can
> >> call kmeansdriver.  I think I can do both using the explanation of
> >> http://www.odbms.org/download/TamingTextCH06.pdf
> >>
> >> Should the main class extend any other?
> >> How do I deploy it in a cluster with hadoop?
> >>
> >> * it is my understanding that mahout is a framework. I read the code
> >> example in org.apache.mahout.clustering.syntheticcontrol.kmeans. It
> >> extends AbstractJob. I made a new project in Eclipse and copied the
> >> example. My goal was to run it. I tried "java -jar myjar.jar" and
> >> passing my new jar as a parameter to hadoop. What's the correct way of
> >> running a program for mahout?
> >>
> >> Thanks
> >>
> >> [1] https://cwiki.apache.org/MAHOUT/k-means-commandline.html
> >>
>

Re: getting started with mahout and kmeans

Reply via email to