Re: How to approach this? Classification vs Recommendation

2012-05-18 Thread Ted Dunning
Not so trivially, these classifiers can help each other. What you have is a form of transduction or example based learnng. On Fri, May 18, 2012 at 5:24 PM, Sean Owen sro...@gmail.com wrote: Trivially it's four classifiers. You have just one input here, and it's binary. That seems like too

Re: question on VectorWritable convertor in elephant-bird.

2012-05-15 Thread Ted Dunning
Sounds like a class path issue. Sent from my iPhone On May 15, 2012, at 2:43 AM, Yohan Chin yohan@gmail.com wrote: Hi, Recently, I've tried to utilize elephant-bird for loading mahout result into pig. I could install elephant-bird and got .jar file. and followed instructions as

Re: Exception running 20newsgroups example

2012-05-14 Thread Ted Dunning
What you are missing is a Linux compatible environment. Running programs under Cygwin can be pretty difficult because of the path name insanity that often ensues. Sent from my iPhone On May 13, 2012, at 6:33 PM, mahout-newbie raman.sriniva...@gmail.com wrote: When I try to run the 20

Re: Question about storage in Pig-vector (Pig + Mahout)

2012-05-14 Thread Ted Dunning
Tim, Sorry for the confusion and lack of help. Pig-vector is half-done and not even quite half-baked. Your help in updating the readme is very much appreciated. On Mon, May 14, 2012 at 10:17 AM, Timothy Potter thelabd...@gmail.comwrote: Hi Ted, Re: In the readme, there is an example of

Re: large scale kmeans

2012-05-14 Thread Ted Dunning
I have tried it. And an unnamed large customer of ours has tried it with good results. That isnt much of a track record yet but it is encouraging. All of this use so far is as part of k-nearest neighbor work so there isn't a comparison for pure clustering. Also, this work is all at 10-50

Re: Canopy estimator

2012-05-12 Thread Ted Dunning
One thing that may be happening here is that the scale of your data varies from place to place. Have you tried the upcoming k-means stuff? On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel p...@farfetchers.com wrote: One problem I have is that virtually any value for T gives me a very large number

Re: Canopy estimator

2012-05-12 Thread Ted Dunning
Roughly. But it also gives you a small-ish surrogate for your data that would let you use all kinds of different clustering methods since the surrogate fits in memory. On Sat, May 12, 2012 at 9:51 AM, Pat Ferrel p...@occamsmachete.com wrote: This why canopy has been frustrating because by

Re: [Announcement] Giraph talk in Berlin on May 29th

2012-05-12 Thread Ted Dunning
Wish I could be there. Can you send slides when they are available? On Sat, May 12, 2012 at 2:58 AM, Sebastian Schelter s...@apache.org wrote: Hi, I will give a talk titled Large Scale Graph Processing with Apache Giraph in Berlin on May 29th. Details are available at:

Re: Canopy estimator

2012-05-12 Thread Ted Dunning
Yes. It may help with variable scale. The class technique for dealing with that is to cluster with a small number of clusters at a gross level and then cluster each set of documents that belong to a single large cluster. This automatically adapts to different scales. The new stuff would

Re: Some guidance for this noob - Metadata Matching Engine

2012-05-11 Thread Ted Dunning
Regarding whether this is classification or clustering, it is clustering but you have some initial conditions that should be used to prime the algorithm. Manuel's links are excellent. The LSH hash based clustering in the new clustering codes could be competitive with these other methods in the

Re: Question about storage in Pig-vector (Pig + Mahout)

2012-05-11 Thread Ted Dunning
PigModelStorage stores SGD models. The elephant bird stuff stores data in the form of vectors. On Fri, May 11, 2012 at 11:38 AM, Timothy Potter thelabd...@gmail.comwrote: So my main question is what does the elephant-bird model storage stuff do that PigModelStorage doesn't?

Re: kmeans not returning k clusters

2012-05-07 Thread Ted Dunning
On Mon, May 7, 2012 at 12:01 AM, Dawid Weiss dawid.we...@cs.put.poznan.plwrote: - it doesn't have the final pass of in-memory clustering so it really just gives you an indifferent quality clustering with a huge number of weighted clusters. With the final pass, it will give you a high

Re: Recommendation scores from LogLikelihood Similarity recommender

2012-05-06 Thread Ted Dunning
As Sean points out, cosine should pick up on this. You will have the usual problems with small counts that any rating based system has. And in spite of your last comment, I would strongly recommend that you test a boolean approach where in *any* action is considered positive and another where

Re: kmeans not returning k clusters

2012-05-06 Thread Ted Dunning
Pat, You may be interested in the code at https://github.com/tdunning/knn This includes some high speed clustering code that could help you with your issues. To wit, - there aren't as many knobs to tweak on the algorithm (you still have data scaling tricks to do) - the speed should be 10-100x

Re: SGD cold start and model persistence questions

2012-05-05 Thread Ted Dunning
On Sat, May 5, 2012 at 12:06 AM, hao wang wang...@huofar.com wrote: 1) is there anyway we can dump the weights of the features from a trained-model? Yes. Use the model dissector or just grab the weights out of the model. You can also access the weights matrix directly using getBeta()

Re: Mahout + BigDataR Linux

2012-05-03 Thread Ted Dunning
Gently here: You misspelled woWpal wabbit. I look forward to seeing you at the graphlab workshop and hearing more about this. On Thu, May 3, 2012 at 7:06 AM, Nicholas Kolegraff nickkolegr...@gmail.comwrote: Hi Everyone, I'm working on a Linux Distro with a focus around Machine Learning and

Re: Mahout + BigDataR Linux

2012-05-03 Thread Ted Dunning
Thanks for including Mahout. As a point of strategy, wouldn't have better to just build a debian package repository and a script for installing packages? That would allow people to use their own debian or ubuntu based distros for their own special needs such as hardware virtualization or special

Re: Mahout + BigDataR Linux

2012-05-03 Thread Ted Dunning
Yes. It is impossible for me to correctly spell when correcting somebody else's spelling. I think that this follows from the general karmic principle. On Thu, May 3, 2012 at 9:36 AM, Sean Owen sro...@gmail.com wrote: *V*owpal Wabbit ? :) On Thu, May 3, 2012 at 5:32 PM, Ted Dunning ted.dunn

Re: Re: Mahout + BigDataR Linux

2012-05-03 Thread Ted Dunning
On Thu, May 3, 2012 at 10:06 AM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: ... I have this crazy notion that nothing should ever be installed and bootstrapping is really annoying. This opinion is more and more in the minority. Yum and apt have made this much less painful. And

Re: Mahout + BigDataR Linux

2012-05-03 Thread Ted Dunning
Don't take any of our suggestions as discouragement. At most treat them as an excuse to reexamine your decisions. Sent from my iPhone On May 3, 2012, at 6:58 PM, Nicholas Kolegraff nickkolegr...@gmail.com wrote: Agree, this could prove insane. If that is the case, it wouldn't be *too*

Re: Mahout - Pig Hackday

2012-05-02 Thread Ted Dunning
On Wed, May 2, 2012 at 11:06 AM, Timothy Potter thelabd...@gmail.comwrote: We're really keen on Ted's pig-vector project (https://github.com/tdunning/pig-vector) as we're building a number of classifiers on Mahout's SGD framework, with the bulk of our data being in Cassandra processed almost

Re: Mahout - Pig Hackday

2012-05-02 Thread Ted Dunning
Making a pig module for mahout is a fine idea. The twitter guys may have something better, though, so we should explore that as well. Andy's comments make that possibility very interesting. On Wed, May 2, 2012 at 5:20 PM, Timothy Potter thelabd...@gmail.com wrote: Thanks Ted! Removing the

Re: Mahout - Pig Hackday

2012-05-02 Thread Ted Dunning
On Wed, May 2, 2012 at 9:05 PM, Jake Mannix jake.man...@gmail.com wrote: On Wed, May 2, 2012 at 8:07 PM, Ted Dunning ted.dunn...@gmail.com wrote: Making a pig module for mahout is a fine idea. The twitter guys may have something better, though, so we should explore that as well. Andy's

Re: integrating databases

2012-04-29 Thread Ted Dunning
On Mon, Apr 30, 2012 at 1:36 AM, Amrhal Lelasm arm...@hotmail.com wrote: I'm wondering how I can combine these two to get the input data for my recommender engine. Do, I start by implementing the the JDBCDataModel or ? Yes. I appreciate any insight you might have for this? Sounds like

Re: [mahout] labels in clustering algorythms

2012-04-28 Thread Ted Dunning
Yuriy, Take a look at https://github.com/tdunning/knn to see some upcoming k-means stuff that may help you out with respect to speed. On Sat, Apr 28, 2012 at 11:19 AM, Юрий Басов basov.yo1...@gmail.com wrote: Good day. My name is Yuriy. I'm working as engineer in Rambler Internet Holding.

Re: --features in trainlogistic what is this for?

2012-04-27 Thread Ted Dunning
Putting a smaller value here will degrade prediction quality because more and more features will collide in the hashed feature space. Increasing this beyond a certain point, however, will not significantly increase prediction quality and it will increase memory usage. On Fri, Apr 27, 2012 at

Re: regularization in logistic training?

2012-04-27 Thread Ted Dunning
It is determined automagically by an evolutionary process. From what I hear, it has a tendency to do a good job on regularization and a bad job on learning rate optimization. On Fri, Apr 27, 2012 at 11:41 PM, Yang tedd...@gmail.com wrote: when I run mahout trainlogistic is there an

Re: Genetic Algorithm

2012-04-24 Thread Ted Dunning
The GA is old code and unused and unmaintained for the most part. I would expect that unless somebody steps up, it is a candidate for removal. The EP code is an implementation of recorded step meta-mutation as described here: http://arxiv.org/abs/0803.3838 The EP code is unrelated to genetic

Re: shortest-path maintenance

2012-04-20 Thread Ted Dunning
I think that map-reduce has broader applicability than just places were you need the sort, but I completely agree that other models are far better than most graph theoretic programs unless you have a problem that is susceptible to spectral methods. This last proviso applies because map-reduce can

Re: Mahout and PCA for a music analysis tool

2012-04-17 Thread Ted Dunning
Nicolas, Are you replying to this? Or asking these questions? On Tue, Apr 17, 2012 at 11:03 AM, Nicolas Pied nicolas.p...@gmail.comwrote: Hello, I would like to implement an application like Like.fm / Pandora (but more simple) that suggests musics close to a given one. I think

Re: Mahout and PCA for a music analysis tool

2012-04-17 Thread Ted Dunning
If you really want to recommend music that people will like, you have to start from the realization that most of musical appreciation is social, not auditory. This has been substantiated in controlled tests where as much as 60% of appreciation was driven by very weak social cues in a test. In my

Re: Mahout and PCA for a music analysis tool

2012-04-17 Thread Ted Dunning
Now that I have been all negative, if you want to go developing auditory features, look up music information retrieval. The ISMIR conferences have a wealth of information. http://www.ismir.net/ On Tue, Apr 17, 2012 at 11:03 AM, Nicolas Pied nicolas.p...@gmail.comwrote: Hello, I would

Re: Mahout Logisitc Regression Do Not Work Properly for Me

2012-04-12 Thread Ted Dunning
So, the first thought that I have is that it sounds like you have dense variables rather than sparse. This may affect behavior of the Mahout system. If you have some text-like features of the ad, then you may get cleaner results. Secondly, I don't see any interaction features. With as much

Re: citing mahout

2012-04-09 Thread Ted Dunning
Well, this shorter reference does avoid the problem of having a typo in the abstract. On Mon, Apr 9, 2012 at 2:35 AM, Sebastian Schelter s...@apache.org wrote: I use a (not so beautiful) very short reference: @Unpublished{Mahout, key = {Apache Mahout}, title = {Apache {Mahout},

Re: citing mahout

2012-04-08 Thread Ted Dunning
Beautiful, I was just writing up some clustering work and needed exactly this. Thanks! On Sun, Apr 8, 2012 at 4:54 PM, Manuel Blechschmidt manuel.blechschm...@gmx.de wrote: Hi Ahmed, I used the following BibTex entry in my Master Thesis: @webpage{mahout, Abstract = {Apache Mahout's

Re: recommend ads using mahout?

2012-04-04 Thread Ted Dunning
The current state of the art in ad recognition is contextual bandits backed up by logistic or probit regression. The mahout logistic regression is a decent first step on this but probably doesn't provide the necessary accuracy. I have some early work on the bandit algorithms on github but

Re: Available Recommenders' Implementations

2012-04-04 Thread Ted Dunning
There is also the stochastic projection code. Search for ssvd in the mailing list archives. Sent from my iPhone On Apr 4, 2012, at 8:36 AM, Sebastian Schelter s...@apache.org wrote: There is a distributed recommender that uses matrix factorization via Alternating Least Squares. Due to

Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-04 Thread Ted Dunning
With this announcement, this group has a fork in the road facing us. We can choose the Hadoop path of forcibly excluding anybody with a slightly wrong commercial taint from discussions (I call this the more GNU than GNU philosophy). Or we can choose a real community based approach that includes

Re: TrainNewsGroups source code

2012-04-04 Thread Ted Dunning
I am sorry, but I don't understand the question. All of the code in Mahout compiles. This is verified several times a day by the continuous integration testing. Can you say more specifically what you mean? Line 95 of what? On Wed, Apr 4, 2012 at 12:18 PM, Ahmed Abdeen Hamed

Re: Commercializing Mahout: the Myrrix recommender platform

2012-04-04 Thread Ted Dunning
works, but figured it's a good as time as any to ask I figure. On Wed, Apr 4, 2012 at 5:35 PM, Ted Dunning ted.dunn...@gmail.com wrote: With this announcement, this group has a fork in the road facing us. We can choose the Hadoop path of forcibly excluding anybody with a slightly wrong

Re: Any way Mahout overcome the data sparsity problem ?

2012-04-02 Thread Ted Dunning
set of items. That makes the computation of similarity between users imprecise and consequently reduces the accuracy of CF algorithms. http://www.jucs.org/jucs_17_4/a_clustering_approach_for On Sun, Apr 1, 2012 at 1:20 PM, Ted Dunning ted.dunn...@gmail.com wrote: Could you say a bit more

Re: Any way Mahout overcome the data sparsity problem ?

2012-04-02 Thread Ted Dunning
preferences ? What about semi-anonymous model ? very good answer. Thanks Ted On Mon, Apr 2, 2012 at 7:40 PM, Ted Dunning ted.dunn...@gmail.com wrote: This problem is much more commonly referred to as the cold start problem and is far smaller than many authors assume. Typically a dozen good

Re: Any way Mahout overcome the data sparsity problem ?

2012-04-01 Thread Ted Dunning
Could you say a bit more about what you mean? Which data sparsity problem? Sent from my iPhone On Apr 1, 2012, at 6:35 AM, ziad kamel ziad.kame...@gmail.com wrote: Hi, Is there any ways that mahout CF can overcome the data sparsity problem? Thanks

Re: User Similarity and neighborhoods

2012-04-01 Thread Ted Dunning
It depends. The large scale systems for item based recommendations definitely do not do this. Sent from my iPhone On Apr 1, 2012, at 7:13 AM, ziad kamel ziad.kame...@gmail.com wrote: Do Mahout compute the similarity between every pair of users to determine their neighborhoods ?

Re: CityBlockSimilarity details

2012-03-29 Thread Ted Dunning
It is very common that preferences or ratings DECREASE recommendation performance. The basic reason is that there is little or no real signal in the ratings after you account for the fact that the rating exists at all. In practice, there is the additional reason that if you don't need a rating,

Re: Getting InMemBuilder to use more mappers

2012-03-29 Thread Ted Dunning
Split your training data into lots of little files. Depending on the wind, that may cause more mappers to be invoked. On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw jls...@uw.edu wrote: Suggestion, indeed. I passed that option, but still only 2 mappers were created. On Thu, Mar 29, 2012 at

Re: CityBlockSimilarity details

2012-03-29 Thread Ted Dunning
? On Thu, Mar 29, 2012 at 5:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is very common that preferences or ratings DECREASE recommendation performance. The basic reason is that there is little or no real signal in the ratings after you account for the fact that the rating exists

Re: I don't get a reply to my email

2012-03-28 Thread Ted Dunning
Have you subscribed? Most readers of the email list will assume that you have subscribed to the list and they will answer to the list. If you haven't subscribed, you won't see these answers. On the other hand, some questions may not be answered if the questions are difficult to understand or

Re: options for finding smallest eigenvectors

2012-03-27 Thread Ted Dunning
THe smallest eigenvalues are always problematic in large matrices. Any trick to expose them (such as the diagonal subtraction that you mention) should work with any of our stuff as well. On Tue, Mar 27, 2012 at 2:01 AM, Dan Brickley dan...@danbri.org wrote: If one wanted the *smallest*

Re: Mahout beginner questions...

2012-03-26 Thread Ted Dunning
recommendations and models per click of the user (because you need to rebuild the data in the HDFS run you batch job, and return an answer) -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Monday, March 26, 2012 00:56 To: user@mahout.apache.org Subject: Re

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
It rounds like the original poster isn't clear about the division between off-line and on-line work. Almost all production recommendation systems have a large off-line component which analyzes logs of behavior and produces a recommendation model. This model typically consists of item-item

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
Not really. See my previous posting. The best way to get fast recommendations is to use an item-based recommender. Pre-computing recommendations for all users is not usually a win because you wind up doing a lot of wasted work and you still don't have anything for new users who appear between

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
On Sun, Mar 25, 2012 at 3:36 PM, Razon, Oren oren.ra...@intel.com wrote: ... The system I need should of course give the recommendation itself in no time. ... But because I'm talking about very large scales, I guess that I want to push much of my model computation to offline mode (which

Re: Mahout beginner questions...

2012-03-25 Thread Ted Dunning
On Sun, Mar 25, 2012 at 4:02 PM, Razon, Oren oren.ra...@intel.com wrote: So let's continue with your example... I will do I 2 I similarity matrix on Hadoop and then will do online recommendation based on it and the user ranked items. Yes. So where does the online part will sit at? Is it

Re: Significant - serendipity in recommending

2012-03-24 Thread Ted Dunning
I don't know what you mean by significant any more than Sean. But serendipity in a recommender comes from two sources. Both must be present. One source is having enough people who interact with the recommender. The second source is a judicious injection of exploration which can come from

Re: Significant - serendipity in recommending

2012-03-24 Thread Ted Dunning
they want. good luck :-) On 24 March 2012 17:00, Ted Dunning ted.dunn...@gmail.com wrote: I don't know what you mean by significant any more than Sean. But serendipity in a recommender comes from two sources. Both must be present. One source is having enough people who interact

Re: Merging similarities from two different approaches

2012-03-23 Thread Ted Dunning
My own recommendation is to reduce both scores to binary form using whatever sound statistical method you care to adopt and then use OR. A viable alternative that is relatively good is to convert both scores to percentiles with the same polarity (i.e. 99-th %-ile is very close or very similar).

Re: MongoDBDataModel in memory ?

2012-03-19 Thread Ted Dunning
Session data never needs to be in memory. It can be processed sequentially or using map reduce. The item item data is all you need in memory. Sent from my iPhone On Mar 18, 2012, at 10:19 PM, Mridul Kapoor mridulkap...@gmail.com wrote: On 19 March 2012 02:24, Ted Dunning ted.dunn

Re: Edit Distance

2012-03-19 Thread Ted Dunning
While I didn't as nice a job as your friend, TFIDF of n-grams has consistently done very well for me. The soft TFIDF that they examine is something that I haven't previously looked at, but everything else seems just in order based on what I have seen. On Mon, Mar 19, 2012 at 1:06 PM, Dawid Weiss

Re: MongoDBDataModel in memory ?

2012-03-19 Thread Ted Dunning
On Mon, Mar 19, 2012 at 10:06 PM, Mridul Kapoor mridulkap...@gmail.comwrote: Is there a way that I run the ItemSimilarityJob on a single machine ? Yes. There is a sequential invocation as well.

Re: How to do logistic regression by mahout?

2012-03-18 Thread Ted Dunning
The last third of the Mahout in Action book covers this pretty extensively. On Sun, Mar 18, 2012 at 5:25 AM, Felix.徐 ygnhz...@gmail.com wrote: Hi,all. I'm new to mahout, it seems that logistic regression is already integrated into

Re: MongoDBDataModel in memory ?

2012-03-18 Thread Ted Dunning
Mridul, What is the humongous amount of data in Mongo? Is it really item-item links? Or is it session information? With a recommender, it is unusual to have more than a few hundred links to other items for any given item. This means that even for 10 million items, you only have about a

Re: Injecting content into item-item CF

2012-03-13 Thread Ted Dunning
This is search, not recommendation. For search, you need to build and index (which can be built off-line). In the process of building that index, you can propagate content terms across highly similar (behaviorally) items and you can include references to and from similar items. Content-based

Re: Item Recommendations - Time based

2012-03-12 Thread Ted Dunning
In order to get time similarity that you want, you can have virtual users for each session as well as real users for longer time periods. The longer periods will have weaker statistics so you probably won't have to weight things. This will let you use the standard Mahout framework for everything

Re: Item Recommendations - Time based

2012-03-12 Thread Ted Dunning
Sean's comment is dead-on and your design inclinations are just fine. Hadoop can (eventually) help with the offline item similarity computation. The existing Mahout recommendation engine can do the actual item recommendation work at very high speed with an appropriate data store. On Mon, Mar

Re: Cluster-based recommenders

2012-03-12 Thread Ted Dunning
Be aware that cluster based recommenders almost never perform as well as user/item based recommenders. On Mon, Mar 12, 2012 at 10:03 AM, Ahmed Abdeen Hamed ahmed.elma...@gmail.com wrote: This is really great. Thanks so much! -Ahmed On Mon, Mar 12, 2012 at 12:13 PM, Sean Owen

Re: Item Recommendations - Time based

2012-03-12 Thread Ted Dunning
Actually I don't think that you will need to implement your own item similarity. Just preprocess your input by grouping by user and sorting by time. Then break user sessions into separate users and emit the standard user,item,pref format for the mahout processing. The pref will be always 1 in

Re: Item Recommendations - Time based

2012-03-12 Thread Ted Dunning
I would generally recommend using the LLR similarity. But if you have an itch, scratch it. I do think we have a tanimoto similarity already, possibly under a slightly different name. Sent from my iPhone On Mar 12, 2012, at 2:00 PM, Mridul Kapoor mridulkap...@gmail.com wrote: Ah, right.

Re: Trouble with deriving popular items from mahout

2012-03-11 Thread Ted Dunning
It is probably worth trying the LLR item-item off-line build. This is more like what the guy needs than raw counts. On Sun, Mar 11, 2012 at 5:42 AM, Sean Owen sro...@gmail.com wrote: No, it's so easy you can do it in about 20 lines of code so I don't think it really warrants a software

Re: packaging a recommender as a war file

2012-03-07 Thread Ted Dunning
A separate project like this is a better way to package this in any case. It is bad practice to have developers modifying Mahout itself in order to build their applications. Nice work, Manuel! On Wed, Mar 7, 2012 at 10:00 AM, Manuel Blechschmidt manuel.blechschm...@gmx.de wrote: Hi Ben, I

Re: experimenting with mahout taste and ontologies

2012-03-07 Thread Ted Dunning
Business logic like this can be built into the IDRescorer capabilities. There is a lot of information in the mailing list archive on this kind of thing. See http://www.lucidimagination.com/search/p:mahout?q=IDRescorersearchProvider=lucid and

Re: Washing machines - Mahout algorithm advice

2012-03-03 Thread Ted Dunning
And further linear Markov chains can be expressed as matrix products which can be computed efficiently using SVD's. Zoltan, is this literally the problem that you are working on? Or is this a shadow of the problem that you are interested in? On Sat, Mar 3, 2012 at 9:55 AM, Jack Tanner

Re: Naive-Bayes work flow

2012-02-28 Thread Ted Dunning
I think that you have an invocation or format bug and you are effectively giving NB different data you think. Note that this is what is called a stopped clock model. That means it is only getting correct results by putting out a constant value. Sent from my iPhone On Feb 28, 2012, at 2:58

Re: Naive-Bayes work flow

2012-02-27 Thread Ted Dunning
This is a tiny dataset. Have you considered just trying R? In fact in terms of just diagnosing the problem it would be good to run a regression in R first. Sent from my iPhone On Feb 27, 2012, at 3:57 AM, Naveenchandra naveenchandr...@gmail.com wrote: Hi guys, Thanks alot for your regular

Re: Naive-Bayes work flow

2012-02-24 Thread Ted Dunning
If your synthetic data comes from the se distribution for yellow and purple then clearly no classifier will help. Also naive bayes wants words not numbers. Sent from my iPhone On Feb 24, 2012, at 5:08 AM, Naveenchandra naveenchandr...@gmail.com wrote: The python code which used is :

Re: Goals for Mahout 0.7

2012-02-23 Thread Ted Dunning
No problem. And thank you for being kind when I used language less moderate than appropriate. On Thu, Feb 23, 2012 at 8:13 PM, Ioan Eugen Stan stan.ieu...@gmail.comwrote: 2012/2/23 Ted Dunning ted.dunn...@gmail.com: Is this a joke? new String[] {-t, INPUT_TABLE, -m, MAIL_ACCOUNT_ID

Re: 0.7 Priorities

2012-02-22 Thread Ted Dunning
Aye say I. Sent from my iPhone On Feb 22, 2012, at 4:24 AM, Jake Mannix jake.man...@gmail.com wrote: If we're able to wrap this release up cleanly and get quickly moving on to new features again, maybe we can try this on a more regular basis, with even releases being feature-work, and odd

Re: Regression Algorithm

2012-02-21 Thread Ted Dunning
Bigger is always better. But you may be happier if you downsample the negative cases since they will be providing very little value in this model. Can you say what you mean by threshold? There is no threshold in Mahout's logistic regression. On Tue, Feb 21, 2012 at 5:44 PM, Sagar Sharma

Re: Latent Semantic Analysis

2012-02-19 Thread Ted Dunning
Mahout 0.4 is ancient. Upgrade! Nobody can help with such an old version, really. On Sun, Feb 19, 2012 at 6:34 PM, Peyman Mohajerian mohaj...@gmail.comwrote: Hi Dmitriy Others, Dmitriy thanks for your previous response. I have a follow up question to my LSA project. I have managed to

Re: Naive-Bayes work flow

2012-02-15 Thread Ted Dunning
Efficiency is not normally a term used with classifiers. Can you define it? From you confusion matrix, it looks like nearly all of your documents are being classified into one class. That usually indicates that there is some fundamental formatting difference between your original training data

Re: Goals for Mahout 0.7

2012-02-13 Thread Ted Dunning
John, This is well said and is a critical need. There are some beginnings to this. The recommender side of the house already works the way you say. The classifier and hashed encoding API's are beginning to work that way. The naive Bayes classifiers pretty much do not and the classifier API's

Re: Hash-coded Vectorization and bogus information

2012-02-13 Thread Ted Dunning
On Tue, Feb 14, 2012 at 2:25 AM, Lance Norskog goks...@gmail.com wrote: ... OnlineLogisticRegression allocates DenseVector/DenseMatrix objects- if it used RandomSparse Vector/Matrix could it operate on million-term sparse arrays? Not likely. The feature vectors that come in are sparse and

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Ted Dunning
Hash coded vectorization *is* a random projection. It is just one that preserves some degree of sparsity. It definitely loses information when you use it to decrease dimension of the input. It does not add bogus information. SGD doesn't like dense vectors, actually. In fact, one of the nice

Re: Hash-coded Vectorization and bogus information

2012-02-12 Thread Ted Dunning
. On Sun, Feb 12, 2012 at 7:00 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hash coded vectorization *is* a random projection. It is just one that preserves some degree of sparsity. It definitely loses information when you use it to decrease dimension of the input. It does not add bogus information

Re: Need help in loading model for classification

2012-02-02 Thread Ted Dunning
Trim the model by setting a minimum term frequency. On Thu, Feb 2, 2012 at 9:39 PM, SAMIK CHAKRABORTY sam...@gmail.com wrote: Hi, I am new to mahout and hadoop. I have created a model (following the train classifier command) which has a size of 500MB. Now when I am loading the model for

Re: understanding naive bayes + ngrams

2012-02-01 Thread Ted Dunning
I think your analysis is correct, but you are also probably correct that having multiple levels at the same time would be preferable. On Wed, Feb 1, 2012 at 1:05 PM, Stuart Smith stu24m...@yahoo.com wrote: Hello, I was curious about how bayes handles the ngram argument, and how it could be

Re: Parallel ALS-WR on very large matrix -- crashing (I think)

2012-02-01 Thread Ted Dunning
So the total size of the data is modest at about 560 M non-zero elements. Total data should be small compared to your node sizes. But the distribution of your data can be important as well. Can you say if you have any rows or columns are extremely dense? On Wed, Feb 1, 2012 at 4:58 PM, Kate

Re: mahout matrix package

2012-01-25 Thread Ted Dunning
Matrix inverse is almost never a good idea. The same effect can usually be had using a decomposition at far less cost. For instance, for solving a linear system, QR decomposition provides two sub-matrices that can easily have an inverse multiply operation applied to them avoiding the need for

Re: Suggestions Needed : Developing application using Mahout

2012-01-24 Thread Ted Dunning
THere are a bunch of papers on this. Search named entity recognizer CRF on google. The basic idea is that an HMM or CRF has internal state that can be used to mark named entities. We don't have to define what the hidden states mean, just help the HMM or CRF find an internal representation that

Re: SGD: mismatch in percentCorrect vs classify() on training data?

2012-01-24 Thread Ted Dunning
From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org; Stuart Smith stu24m...@yahoo.com Cc: Mahout List mahout-u...@lucene.apache.org Sent: Monday, January 23, 2012 5:52 PM Subject: Re: SGD: mismatch in percentCorrect vs classify() on training data? Hmm... I am surprised

Re: Clustering or classification?

2012-01-24 Thread Ted Dunning
If you have supervised training data (and it sounds that way), then classification is likely to be more effective. On Tue, Jan 24, 2012 at 7:44 PM, Vikas Pandya vika...@yahoo.com wrote: Thanks. creating vectors for these three columns and clustering them doesn't bring desired results. here is

Re: Suggestions Needed : Developing application using Mahout

2012-01-23 Thread Ted Dunning
The HMM implementations might be of help, but I think that a small CRF implementation that is oriented around string transduction would be more helpful. The Stanford Named Entity Recognizer (NER) has such an implementation. I think NLTK has one. I think GATE has one as well. The basic

Re: Mahout Taste Deployment On Hadoop

2012-01-20 Thread Ted Dunning
I doubt if it will work on Hadoop 0.19. Mahout requires 0.20 and pretty much always has. Changing that will be difficult to check even if it isn't difficult to do. In any case, you should probably get off of 0.19 as soon as possible as well since there are known stability problems with that

Re: Why Mahout bayes implementation is tightly coupled with Hadoop?

2012-01-19 Thread Ted Dunning
Yes. The use of Hadoop here makes things silly slow. On Thu, Jan 19, 2012 at 8:07 AM, Daniel Korzekwa daniel.korze...@gmail.comwrote: ./mahout trainclassifier -i /mnt/hgfs/C/daniel/my_fav_data/test -o model -type bayes -ng 1 -source hdfs, it takes 40 seconds to train a model for a file with

Re: Why Mahout bayes implementation is tightly coupled with Hadoop?

2012-01-19 Thread Ted Dunning
Mike, I think that where you are going is that Mahout might be well served by non-Hadoop implementations or map-reduce or by non-map-reduce frameworks, especially where smaller data and experimental use is concerned. You are right. Or, at least I agree with what I think you are saying. Sean is

Re: About QRDecomposition

2012-01-18 Thread Ted Dunning
There are lots of QR decomposition algorithms and the results are not necessarily unique, especially for rank deficient inputs. If you post your exact results, I could comment more specifically. Without more details, I really can't answer your question in any specific way. On Wed, Jan 18, 2012

Re: MR Vectorization

2012-01-17 Thread Ted Dunning
Time since the last packet from the same source or to the same destination is another interesting feature. On Tue, Jan 17, 2012 at 11:10 AM, Harry Potter harry123gr...@yahoo.comwrote: thanks sir... that was really helpful.. From: Ioan Eugen Stan

Re: Clustering user profiles

2012-01-15 Thread Ted Dunning
On Sun, Jan 15, 2012 at 2:13 PM, Raviv Pavel ra...@gigya-inc.com wrote: If I understand correctly, in normalization option #2 you mean that each interest is encoded to value so that the sum of all interests is 1? Yes. Also, What do you mean by normalize the interests to have unit vector

Re: Help in running from command line

2012-01-15 Thread Ted Dunning
It isn't that bad. Maven is opinionated (that is a feature, not a defect). But it isn't that hard to deal with. The first concept to deal with is that maven has pre-defined life cycle goals. The most important for most programmers are compile, test, package and install. These pretty much mean

Re: Clustering user profiles

2012-01-13 Thread Ted Dunning
I usually prefer to represent location as an xyz triple on a unit sphere. That allows Euclidean distance to be useful. On the 1 of n encoded values. Euclidean works as well. For gender, it also works fine. The only issue is how to combine these with reasonable weightings. An easy way to do

<    5   6   7   8   9   10   11   12   13   14   >