Re: clustering with kmeans, java app

2012-08-07 Thread Yuval Feinstein
I spent a week trying to get Hadoop to work on Windows 7, and then gave up. Do you manage to run Hadoop on Windows? Do Hadoop tests (e.g. wordcount) work? http://en.wikisource.org/wiki/User:Fkorning/Code/Hadoop-on-Cygwin has lots of details about this. Some of the possible problems are cygwin

RE: clustering with kmeans, java app

2012-08-07 Thread Videnova, Svetlana
Hi, Yes i'm using mahout and hadoop libs on windows. My cluster output is not written on hdfs but in LOCAL. Thanks to cygwin I am able to run unix command in order to run mahout on windows. I changed the path on windows as well. I didn’t test if wordcount is working, because I am using only

Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-07 Thread Yuval Feinstein
This is the case: https://issues.apache.org/jira/browse/MAHOUT-973 The bug exists in Mahout 0.6 and was fixed in Mahout 0.7. I also used the workaround of using a high value for --maxDFPercent (I guess the number of documents in the corpus is enough). Maybe it will be good to fix it on 0.6 as

RE: ClusterDumper eclipse human readable output kmeans

2012-08-07 Thread Videnova, Svetlana
I already generated points directory when i run cluster (kmeans in my case). But for the moment I can't generate clustedump because of error on this line: ClusterDumper.readPoints(new Path(output/kmeans/clusters-0), 2, conf); Second parameter is double but he wants int but does not accept int

Re: ClusterDumper eclipse human readable output kmeans

2012-08-07 Thread Paritosh Ranjan
I don't know why ClusterDumper is not working, but I can give an alternate solution. Use ClusterOutputPostProcessor (clusterpp), on the clusters-*-final directory. https://cwiki.apache.org/MAHOUT/top-down-clustering.html It will arrange the vectors in respective directories. However, it will

RE: ClusterDumper eclipse human readable output kmeans

2012-08-07 Thread Videnova, Svetlana
Just succeed to make work my app. Should to use ClusterDumperWriter.gettopfeatures(ar1,arg2,arg3) and that gave me the top words on human readable format :D -Message d'origine- De : Paritosh Ranjan [mailto:pran...@xebia.com] Envoyé : mardi 7 août 2012 10:32 À : user@mahout.apache.org

Re: Tags generation?

2012-08-07 Thread SAMIK CHAKRABORTY
Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an articles automatically when someone posts a link in our micro-blogging site. The goal was to allow us to follow a tag instead (in

Re: Tags generation?

2012-08-07 Thread Ted Dunning
Nice stuff. And glad that Mahout was able to help! On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY sam...@gmail.com wrote: Hi All, We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an

how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Dominik Lahmann
Hi, I would like to know how I can deal with multiple preference values for the same (user, item)-pair from a machine learning perspective? That means, I have got more than one rating from a user u for an item i available. Of course using any kind of average (maybe also taking date information

Re: how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Julian Ortega
As far as I remember, Mahout overrides older preference values with the newest one. On Tue, Aug 7, 2012 at 2:14 PM, Dominik Lahmann dominik.lahm...@fu-berlin.de wrote: Hi, I would like to know how I can deal with multiple preference values for the same (user, item)-pair from a machine

Re: how to deal with mutiple preference values for same (user, item)-pair

2012-08-07 Thread Sean Owen
It depends on what the values really mean. If they are something like ratings, using the most recent version makes most sense. (This is what the implementations do now.) If they are some kind of sampled reading it might make sense to take an average. If the input is based on observed activity, it

Re: Question about recommender database drivers

2012-08-07 Thread kiran kumar
I have used the same steps to create the dictionary and vector output from solr using *lucene.vector* command. Is there any way to pull only latest changes from solr and create vectors. Later how do we run clustering algorithms using this incremented vector files. Can you shed some light on this?

Re: LDA Questions

2012-08-07 Thread Gokhan Capan
Hi Jake, Today I submitted the diff. It is available at https://issues.apache.org/jira/browse/MAHOUT-1051 Thanks for the advices On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix jake.man...@gmail.com wrote: Sounds great Gokhan! On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan gkhn...@gmail.com

HA: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-07 Thread Abramov Pavel
Hello Yuval, Thanks for the link. But I am sure I use 0.7 version. I will double check it Pavel От: Yuval Feinstein [yuv...@citypath.com] Отправлено: 7 августа 2012 г. 11:08 To: user@mahout.apache.org Тема: Re: Seq2sparse example produces bad TFIDF

KMeans job fails during 2nd iteration. Java Heap space

2012-08-07 Thread Abramov Pavel
Hello, I am trying to run KMeans example on 15 000 000 documents (seq2sparse output). There are 1 000 clusters, 200 000 terms dictionary and 3-10 terms document size (titles). seq2sparse produces 200 files 80 MB each. My job failed with Java heap space Error. 1st iteration passes while 2nd