Clustering Spatial Data

2013-12-02 Thread Peter K
Hi there, I've have no experience with mahout but I know that it will solve my problem :) ! I've the following requirements: * No hadoop setup should be necessary. I want a simple approach and I know this is possible with mahout! * I have lots of points (~100 million) but also some RAM (32GB)

Write SequenceFile from custom data

2013-12-02 Thread Angelo Immediata
Hi I'm pretty newbie regarding learning achine and above all Apache Mahout, so pardon me my low level questions I need to do some cluster analysis by using some data. At the beginning this data can be not too much huge, but after some time they can be really huge (I did some calculation and

theta normalization fo naive bayes is commented out

2013-12-02 Thread tuku
hello; i searched wiki and the web but couldn't find the reason why theta normalization is commented out for naive bayes classification. there is a todo comment on top that states this will be enabled soon. is there any schedule for this? do anyone know the reason not to use theta normalization?

Canopy generation out of memory troubleshooting

2013-12-02 Thread Chih-Hsien Wu
Hi All, I posted up a Mahout canopy generation related troubleshoot last week; however, I didn't get the problem solved. The message below is the error I received. I'm trying to run canopy generation about 900 mb worth of information. There are estimated about 120,000 vectors. I'm currently

Re: Test naivebayes task running really slowly and not in distributed mode

2013-12-02 Thread Fernando Santos
Hello Ted, No, the training ran also in one machine. What happens sometimes is that each box execute one job one at a time, but not together. For example, if it will run 3 jobs, it runs the first job in box1, the next in box2 and the next in box 1 again. The full dataset is a csv around 70MB. I

Recommending already consumed items

2013-12-02 Thread Mario Levitin
Hi all, In some recommender applications the system might recommend already consumed items. For example, a hotel recommendation site might recommend hotel A to a user who already stayed at hotel A before. In order to recommend already consumed items we have to rank all of the items (consumed and

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Gokhan Capan
Gokhan On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to

Re: Clustering Spatial Data

2013-12-02 Thread Ted Dunning
Peter, What you say is a bit confusing to me. You say you have centers already. But then you talk about algorithms which find the centers. Also, you say you want to assign points based on centers, but you also say that clusters have different shapes, area, size and point count. Do you mean

Pig vector project

2013-12-02 Thread Sameer Tilak
Hi All,We are using Pig top build our data pipeline. I came across the following:https://github.com/tdunning/pig-vector The last commit was 2 yrs ago. Any information on will there be any further work on this project?

Re: Pig vector project

2013-12-02 Thread Andrew Musselman
You might also look into elephant-bird from Twitter; covers a lot of ground. https://github.com/kevinweil/elephant-bird On Mon, Dec 2, 2013 at 4:10 PM, Sameer Tilak ssti...@live.com wrote: Hi All,We are using Pig top build our data pipeline. I came across the

Mahout for clustering

2013-12-02 Thread Sameer Tilak
Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion: userid, age, items {code 1, code 2, ….}, few other features... Each item has a unique alphanumeric code. I would like to use mahout for clustering it. Based on my current reading I see

Re: Pig vector project

2013-12-02 Thread Ted Dunning
Elephant bird is distinctly superior to Pig Vector for many things (it moved forward, Pig Vector did not). I believe here is also a Twitter internal project known as PigML which is much more what Pig Vector wanted to be. There is also https://github.com/hanborq/pigml, but I think it is very

RE: Pig vector project

2013-12-02 Thread Sameer Tilak
Cool! I am using it for sequence file reading so will be happy to look into it. Date: Mon, 2 Dec 2013 16:14:23 -0800 Subject: Re: Pig vector project From: andrew.mussel...@gmail.com To: user@mahout.apache.org You might also look into elephant-bird from Twitter; covers a lot of ground.

RE: Mahout for clustering

2013-12-02 Thread Sameer Tilak
I am looking for some input on how to vectorize my data. From: ssti...@live.com To: user@mahout.apache.org Subject: Mahout for clustering Date: Mon, 2 Dec 2013 16:22:03 -0800 Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion:

Re: Mahout for clustering

2013-12-02 Thread Andrew Musselman
I would probably write a script to parse that out and stream to it from Pig. http://pig.apache.org/docs/r0.11.0/basic.html#stream On Mon, Dec 2, 2013 at 4:30 PM, Sameer Tilak ssti...@live.com wrote: I am looking for some input on how to vectorize my data. From: ssti...@live.com To:

Re: Mahout for clustering

2013-12-02 Thread Ted Dunning
Do you want to cluster users or items? For items, the vectorization that you suggest will work reasonably well, especially if you use TF.IDF weighting and normalize the resulting vectors. You can also use one of the matrix decomposition techniques and cluster the resulting vectors. The spectral

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Ted Dunning
Inline On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote: ... To accomplish this, we used AdaptiveLogisticRegression and trained 46 binary classification models. Our approach has been to do an 80/20 split on the data, holding the 20% back for cross-validation of the

Re: Write SequenceFile from custom data

2013-12-02 Thread Angelo Immediata
well similarity between data should be calculated by taking care of the following variables: meteo, manifestation, day of the week, month of the year and vacation 2013/12/3 Ted Dunning ted.dunn...@gmail.com The key first question is how you plan to compute similarity between data points. It