Re: Clustering users

2010-05-11 Thread Ted Dunning
Generally, you want to do a bit of projection on these data before clustering. One option is random projection. This maps each item to a sparse binary vector based on a few independent hashes of the original item id. This gives you are moderate dimensional vector to do clustering in (say

Re: Clustering help

2010-05-15 Thread Ted Dunning
You won't necessarily see any distinct clumps, depending on your data. With some text. you might get such, but with resumes, especially if you don't do IDF weighting you are likely to have a pretty nasty distribution that doesn't clump very well at all. Even with IDF weighting on terms and the

Re: Mahout LDA Parameter: maxIter

2010-05-23 Thread Ted Dunning
What happens if the number is too large? Is this a dense matrix we are talking about? Would it work to make it a random access sparse matrix with very, very large bounds? On Sun, May 23, 2010 at 10:29 AM, Jeff Eastman j...@windwardsolutions.comwrote: I agree it is not very friendly.

Re: Collocation and Seq2Sparse Questions

2010-05-27 Thread Ted Dunning
Just to forestall some effort on this, LLR is very good for threshold, but the value is bad as a score so substituting TF or TFIDF is entirely appropriate. There may be use cases for keeping LLR if only for diagnostic purposes. On Thu, May 27, 2010 at 8:52 AM, Drew Farris drew.far...@gmail.com

Re: Collocation and Seq2Sparse Questions

2010-05-27 Thread Ted Dunning
A bit off topic, but what you really want is collocations that bring different information to the party than the constituent words. That is, you need to detect cases where the meaning of the collocation is not compositionally predicted by the meanings of the words in the collocation. Simple

Re: M/R Job for Log file to FPG

2010-05-27 Thread Ted Dunning
That should be a small change (and helpful for a lot of mining tasks). But once you jump on that slippery slope, why not allow a tiny Groovy closure to be injected? Or to pass in an object that will extract a map of values from each line? On Thu, May 27, 2010 at 2:59 PM, Grant Ingersoll

Re: Understanding the SVD recommender

2010-06-03 Thread Ted Dunning
understand Nu x VTk, but then P is defined by an additional product with Uk In short... what? On Thu, Jun 3, 2010 at 4:15 PM, Ted Dunning ted.dunn...@gmail.com wrote: Fire away. On Thu, Jun 3, 2010 at 3:52 AM, Sean Owen sro...@gmail.com wrote: Is anyone out there familiar enough

Re: Understanding the SVD recommender

2010-06-04 Thread Ted Dunning
better approach with SVD++ and their time dynamics trick. That is much the same as mean removal. On Fri, Jun 4, 2010 at 6:48 AM, Ted Dunning ted.dunn...@gmail.com wrote: You are correct. The paper has an appalling treatment of the folding in approach. In fact, the procedure is dead

Re: Generating a Document Similarity Matrix

2010-06-15 Thread Ted Dunning
Threshold are generally dangerous. It is usually preferable to specify the sparseness you want (1%, 0.2%, whatever), sort the results in descending score order using Hadoop's builtin capabilities and just drop the rest. On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack mrkrisj...@gmail.com wrote: I

Re: Predicting Successor Item

2010-06-16 Thread Ted Dunning
items? On Tue, Jun 15, 2010 at 8:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: You have most of the workings available to do a reasonable job of this in Mahout. The simplest method in my mind is to grovel the logs and emit pairs of items with the key being the last item and previous

Re: Predicting Successor Item

2010-06-16 Thread Ted Dunning
I would follow Sean's suggestion and try simpler methods first. My guess is that the important structure of the HMM may be much easier to learn by sparsification techniques. Sequence aware methods also have potential for harm in that they may just be reverse-engineer your current link structure.

Re: PFPGrowth on cluster does not distribute work load equally on nodes

2010-06-16 Thread Ted Dunning
How large is your input and how is it arranged in files? Is your input oddly distributed? Are there big skews in item frequency? 2010/6/16 Björn Jacobs jac...@gmx.de Is this a bug or do I have to configure something to get this working?

Re: GenericDataModel Serializable

2010-06-21 Thread Ted Dunning
Tamas, In what context is this serialization occurring? Would it be better to use an alternative serialization framework such as Gson or Hadoop or Avro? I tend to try to avoid native serialization because of the problems that come up so easily. On Sun, Jun 20, 2010 at 5:54 PM, Tamas Jambor

Re: Content-based Recommender Implementation

2010-06-22 Thread Ted Dunning
You can also recommend attributes to users by reducing the user, item history file to a user, attribute history file. Once you have recommended attributes, you can use a search engine or an attribute to item recommendation engine to get the items to recommend. On Tue, Jun 22, 2010 at 5:43 AM,

Re: Rule-based classifier

2010-06-24 Thread Ted Dunning
The SGD and SVM implementations (neither released yet) both have sequential versions. I expect that for pretty large corpora that they will be faster than the MR learners due to lower overhead and faster convergence. See http://leon.bottou.org/projects/sgd for why. On Wed, Jun 23, 2010 at

Re: Rule-based classifier

2010-06-24 Thread Ted Dunning
On Wed, Jun 23, 2010 at 11:13 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: * Do any classifiers offer the option of basing classification on linguistic rules? It is common in advanced test classifiers to include human guided features such as you suggest here. This is one of the

Re: Recommendations on binary ratings

2010-06-26 Thread Ted Dunning
Pranay, Sean's comments are dead-on. You may be able to get a feel for how good (or not) that these results are by marking all unrated items either as good or bad. That will likely tell you that the real precision is between 0.22 and 0.9. This same problem is exhibited by essentially all other

Re: suggestion for SVD

2010-06-28 Thread Ted Dunning
How much speedup do you observe? On Mon, Jun 28, 2010 at 2:29 PM, Tamas Jambor jambo...@gmail.com wrote: Hi, I was looking at the SVD code, I am sure you are aware of this modification, but it would really make things faster. The idea is that you set up a minimum RMSE improvement so it

Re: suggestion for SVD

2010-06-28 Thread Ted Dunning
or later). Tamas On 28/06/2010 22:31, Ted Dunning wrote: How much speedup do you observe? On Mon, Jun 28, 2010 at 2:29 PM, Tamas Jamborjambo...@gmail.com wrote: Hi, I was looking at the SVD code, I am sure you are aware of this modification, but it would really make things faster

Re: ICML / COLT and Mahout

2010-06-30 Thread Ted Dunning
Indeed. Did you mention it? On Wed, Jun 30, 2010 at 12:32 AM, Danny Leshem dles...@gmail.com wrote: Just came back from ICML / COLT. The two conferences held a joint workshop day, with one of the tracks concentrating on open-source software for machine-learning (see

Re: page rank algorithm?

2010-06-30 Thread Ted Dunning
Also note that there *is* a pretty large scale SVD solver in Mahout. That can give you a short-cut to pageRank. On Wed, Jun 30, 2010 at 12:11 PM, Grant Ingersoll gsing...@apache.orgwrote: If not, I'd like to implement it. Any advice appreciated, Have a look at the matrix/vector libraries.

Re: page rank algorithm?

2010-07-01 Thread Ted Dunning
Jimmy Lin's presentation (first link on this page: http://www.umiacs.umd.edu/~jimmylin/) had to do with data structure improvements for link distance computations. After his talk, there was an interesting discussion with Arun Murthy of the map-reduce team at Yahoo. Arun's contention was that it

Re: Mahout running on Hadoop

2010-07-02 Thread Ted Dunning
By this, do you mean migrate from using the Mahout recommendation framework without hadoop to using the Mahout recommendation framework with Hadoop? On Fri, Jul 2, 2010 at 8:26 AM, matboeh...@googlemail.com wrote: However, I am currently looking for an easy way of how to migrate to Hadoop.

Re: SVD and Clustering

2010-07-05 Thread Ted Dunning
Practically speaking, term weighting is important, but you also have to watch out for eigen-spoke behavior. https://research.sprintlabs.com/publications/uploads/icdm-09-ldmta-camera-ready.pdf This can arise when you have strong clique-phenomenon in your data (not likely in your case) or where

Re: [OT] Mahout expertise

2010-07-07 Thread Ted Dunning
Pity. I am in the San Francisco Bay area. Would love to help. Robin Anil is in India, but I think he is totally over-committed. On Wed, Jul 7, 2010 at 9:17 AM, tog guillaume.all...@gmail.com wrote: Hi, I am looking for a Mahout (and related technologies) expert in Bangalore for a few

Re: Beginner questions on clustering M/R

2010-07-15 Thread Ted Dunning
Clustering of time series data is usually better done in an abstract relatively low dimensional coordinate space based on some transform like a locality sensitive frequency transform. Gabor transforms might be appropriate. You might be able to get away with something like an SVD of your daily

Re: Newbie questions about Mahout 228: Logistic Regression LR (SGD)

2010-07-19 Thread Ted Dunning
On Mon, Jul 19, 2010 at 1:29 AM, ihadanny ido.hada...@gmail.com wrote: I've been trying out mahout-228: Sequential LR (using SGD). Thanks! Few things I haven't been able to figure out: 1. Is there a parallel version? Can it integrate with hadoop and do each pass in parallel? Not

Re: Cloudera HUE Opensourced

2010-07-19 Thread Ted Dunning
That would be great! On Mon, Jul 19, 2010 at 7:38 PM, Josh Patterson j...@cloudera.com wrote: From just a personal time perspective, I may try and mock up some demos for something like this.

Re: Adding weighting to boolean data

2010-07-21 Thread Ted Dunning
This is, roughly, a reasonable thing to do. If you want to maintain the fiction of counts a little bit more closely, you might consider just having counts decay over time and having short visits only give partial credit. On Wed, Jul 21, 2010 at 3:54 PM, Dave Williford

Re: How to combine boolean datamodel with datamodel

2010-07-21 Thread Ted Dunning
This is a ubiquitous problem with coocurrence algorithms since they scale in the square of the number of occurrences most popular item. The good news is that you learn everything there is to learn about that item if you look at just a sampling of the occurrences so sampling is your friend. If

Re: Adding weighting to boolean data

2010-07-21 Thread Ted Dunning
make me even less likely to consider it as an early design option. On Wed, Jul 21, 2010 at 5:02 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is, roughly, a reasonable thing to do. If you want to maintain the fiction of counts a little bit more closely, you might consider just having

Re: finding new users

2010-07-27 Thread Ted Dunning
Sean, Are you back yet? I have a friend in London who is apparently in somewhat dire straits (mugged, everything taken except passport). I am looking for resources in London to help him out. On Tue, Jul 27, 2010 at 6:26 AM, Sean Owen sro...@gmail.com wrote: There's no direct way to do this,

Re: Using Mahout with Lucene 4.0

2010-08-06 Thread Ted Dunning
Lucene 4.0? 3.0 just came out. http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/lucene/build/docs/changes/Changes.html#older On Fri, Aug 6, 2010 at 8:59 AM, smcgi...@seas.upenn.edu wrote: Hello, I am trying to import an index from Solr 1.5,

Re: Using Mahout with Lucene 4.0

2010-08-06 Thread Ted Dunning
importing vectors from this Solr-trunk/Lucene-trunk combination. Thanks! Steve Quoting Ted Dunning ted.dunn...@gmail.com: Lucene 4.0? 3.0 just came out. http://hudson.zones.apache.org/hudson/view/Lucene/job/Lucene-trunk/lastSuccessfulBuild/artifact/lucene/build/docs/changes/Changes.html

Re: Google Prediction API

2010-08-06 Thread Ted Dunning
I was at this talk and it was appallingly bad. The most serious confusion is that the algorithms behind the prediction API are NOT the same as the algorithms described in the talk. The talk was really two talks glued together without a transition. The first part was essentially just a rehash of

Re: A question regarding GenericUserBasedRecommender

2010-08-12 Thread Ted Dunning
Focussing on rating error is also problematic in that it causes us to worry about being correct about the estimated ratings for items that will *never* be shown to a user. In my mind, the only thing that matters in a practical system is the ordering of the top few items and the rough composition

Re: ItemSimilarityJob

2010-08-12 Thread Ted Dunning
or tomorrow. On Thu, Aug 12, 2010 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com wrote: Jimmy Lin's stripes work was presented at the last Summit and there was heated (well, warm and cordial at least) discussion with the Map-reduce committers about whether good use of a combiner wouldn't do

Re: problem when using k-means on sythetic contral data

2010-08-19 Thread Ted Dunning
Ahh thanks for being brave enough to ask. A JIRA is a bug ticket. See http://issues.apache.org/jira/browse/MAHOUT Filing a complete statement of the problem there will really help with documenting the problem. Also, if you can develop a patch that helps fix the problem, you can attach it

Re: Least Square Regression in Mahout?

2010-08-20 Thread Ted Dunning
We don't have mega scale ols but we do have mega scale svd which should be close to what you want if you have sparse data. Sent from my iPhone On Aug 20, 2010, at 1:37 PM, Chris Bates christopher.andrew.ba...@gmail.com wrote: Hi all, I'm new to the list. I have a bunch of algorithms

Re: adding feature:skip user's non-interested items when generate recommendation for user.

2010-08-23 Thread Ted Dunning
Sorry to chime in late, but removing items after recommendation isn't such a crazy thing to do. In particular, it is common to remove previously viewed items (for a period of time). Likewise, it the user says don't show this again, it makes sense to backstop the actual recommendation system with

Re: SequentialAccessSparseVector iterator() and iterateNonZero() odd behaviour

2010-08-25 Thread Ted Dunning
Can you file a bug report at http://issues.apache.org/jira/browse/MAHOUT ? Please attach your test case. On Wed, Aug 25, 2010 at 7:25 AM, Laszlo Dosa laszlo.d...@fredhopper.comwrote: Hi, I tried to iterate over the elements of a SequentialAccessSparseVector. I run the following test and

Re: SequentialAccessSparseVector iterator() and iterateNonZero() odd behaviour

2010-08-27 Thread Ted Dunning
I formatted your tests as a patch and attached them to the bug itself. On Fri, Aug 27, 2010 at 8:38 AM, Laszlo Dosa laszlo.d...@fredhopper.comwrote: It is files as MAHOUT-489. Regards, Laszlo -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: 25 August

Re: mahout guide or tutorial or how to for test and run kmean on hadoop

2010-08-27 Thread Ted Dunning
I don't know much about weka lately, but I don't know about any support for calling Mahout clustering algorithms from weka. Typically people run Mahout clustering from the command line. On Fri, Aug 27, 2010 at 1:06 PM, Valerio valerio.cera...@gmail.com wrote: hi all, I need some guides that

Re: recommendations

2010-08-29 Thread Ted Dunning
These are examples of what I call cross-recommendation where you have user x item1 and user x item2 data and you want item1 = item2 recommendations. All of the standard techniques apply (user-based, LLR cooccurrence, SVD, latent factor models), but you have to rejigger things here and there.

Re: SVD Expectations

2010-08-29 Thread Ted Dunning
Like Jake said. On Sun, Aug 29, 2010 at 4:48 PM, Ted Dunning ted.dunn...@gmail.com wrote: In particular, since our sparse representation requires an int (4 bytes) and a double (8 bytes) to store one non-zero entry while a dense row requires only 8 bytes per entry then your original data

Re: recommendations

2010-08-30 Thread Ted Dunning
Metaphorically speaking if user x search term is A and user x item is B, then transpose(B) * B is item x item, transpose(A) * B) is search term x search term and transpose(B)*A is item x search-term. Depending on what kind of recommendation system you are using, the actual mechanics will be

Re: User/Item symmetry?

2010-08-31 Thread Ted Dunning
Lance, As Sean said, there is definitely a performance and API-intelligibility motivated difference between From-things and To-things, but you are right that there is a conceptual symmetry between the two objects just as there is symmetry or duality between the rows and columns of a matrix. On

Re: About the SVDRecommender

2010-08-31 Thread Ted Dunning
A 20% spread in what? Speed? Results? Iterations? On Mon, Aug 30, 2010 at 11:26 PM, Lance Norskog goks...@gmail.com wrote: SVDRecommender is really sensitive to the random number seed. AADRE gives about a 20% spread in its evaluations. (I have only tried

Re: Question about data warehousing and mining through Mahout

2010-08-31 Thread Ted Dunning
Yes. Mahout can support this. On Tue, Aug 31, 2010 at 2:55 PM, hdev ml hde...@gmail.com wrote: But we also want to mine this data to get some predictive capabilities like what is the likelihood that the user will use the same device again or if we get sales/marketing data (on the roadmap

Re: Question about data warehousing and mining through Mahout

2010-08-31 Thread Ted Dunning
For categorization, there are several different answers to the integration problem, but text export of a sampled and curated data file is pretty typical as a data path. The on-line sequential classifiers are a bit more flexible and would allow different input formats at the cost of coding on your

Re: Question about data warehousing and mining through Mahout

2010-08-31 Thread Ted Dunning
I think that Chris was actually recommending stuff that is too simple to call data-mining. Basically this stuff is simpler than any machine learning algorithm so there isn't anything really to write. An example for recommendations is to simply recommend the most popular items to everybody,

Re: Meetup in the Bay Area in Sept?

2010-09-02 Thread Ted Dunning
+1 I'm in. On Thu, Sep 2, 2010 at 6:50 AM, Ken Krugler kkrugler_li...@transpac.comwrote: On Sun, Aug 29, 2010 at 6:33 AM, Grant Ingersoll gsing...@apache.org wrote: Anyone in the Bay Area interested in getting together to talk Mahout on Sept. 16th or 17th? Nothing formal required. If

Re: Error:.pros file can't be loaded

2010-09-03 Thread Ted Dunning
What version of Mahout? (I will assume the trunk) What platform? I see that you are using hadoop 0.21. So far, we only officially support 0.20.2, although that is clearly not your problem. It may become a problem in your next step. This looks like a problem in the Mahout compilation. The

Re: What's the best method or strategy to train a bayes classifier on a multi labeled training set ?

2010-09-04 Thread Ted Dunning
Multiple classification is a classic problem and raises many problems. Currently Mahout has classifiers that do 1 of n classification which is a useful basis for multiple classification, but it isn't the final answer by any means. As a simple start, you can build multiple binary classifiers, one

Re: Map/Reduce algorithm discussion goups?

2010-09-05 Thread Ted Dunning
Not much that I know of. There are bound to be some off-line academic talks, and possibly some academic areas. On Sun, Sep 5, 2010 at 8:32 PM, Lance Norskog goks...@gmail.com wrote: The Hadoop lists seem to be all about the sysad aspects of Hadoop, while Mahout users talk about algorithms a

Re: Regarding the scalability of SVD code in Mahout

2010-09-07 Thread Ted Dunning
Just to cross-check, is it true that your data has 35 x 100 million non-zeros in it? On Tue, Sep 7, 2010 at 6:16 PM, Akshay Bhat akshayub...@gmail.com wrote: - the total number of non-zero elements. This drives the scan time and, to some extent the cost of the multiplies. The total

Re: Classpath question

2010-09-10 Thread Ted Dunning
Should? or Is? The answer to the should question is possibly. The answer to the is question is no. This behavior is the reason for the jar-with-dependencies maven assembly that is built in. Very handy for this problem. On Fri, Sep 10, 2010 at 6:44 PM, Mark static.void@gmail.com wrote:

Re: Using SVD with Canopy/KMeans

2010-09-11 Thread Ted Dunning
Should be close. The matrixMult step may be redundant if you want to cluster the same data that you decomposed. That would make the second transpose unnecessary as well. On Sat, Sep 11, 2010 at 2:43 PM, Grant Ingersoll gsing...@apache.orgwrote: To put this in bin/mahout speak, this would look

Re: Using SVD with Canopy/KMeans

2010-09-11 Thread Ted Dunning
I think you were translating. But the last multiply is still redundant, I think. On Sat, Sep 11, 2010 at 4:55 PM, Grant Ingersoll gsing...@apache.orgwrote: On Sep 11, 2010, at 5:50 PM, Ted Dunning wrote: Should be close. The matrixMult step may be redundant if you want to cluster

Re: WEKA vs Mahout

2010-09-15 Thread Ted Dunning
Steven's comments are correct. Weka has a larger collection of algorithms. Mahout is specialized around scalable algorithms and scalable implementations. Both packages support supervised and unsupervised algorithms. Due to scalability concerns, Mahout does not have much in the way of

Re: Evaluator for RecommenderJob (hadoop implementation)?

2010-09-18 Thread Ted Dunning
I don't know the answer to this, but previously this kind of problem was caused by highly skewed statistics in the input data. If there are things that cooccur with everything, then you will have problems with the speed of the algorithm. Can you say something about the distribution of your data?

Re: PFP Growth

2010-09-18 Thread Ted Dunning
Good advice relative to Mahout as well. Trying it on a smaller sample will tell you if it is due to bad scaling or really a hangup. On Sat, Sep 18, 2010 at 12:03 PM, Mark static.void@gmail.com wrote: Thanks. Ill give this a try and see how it performs On 9/18/10 12:01 PM, Neal Richter

Re: PlusAnonymousUserDataModel usage?

2010-09-20 Thread Ted Dunning
Anonymous can mean many things. It can mean a) here is a user with no history or b) here is a user with history but possibly no formal login It is normally true that the history that a user has when recommendations need done is not the history that that or any user necessarily had when the

Re: Build Error in Twenty Newsgroups Classification Example

2010-09-21 Thread Ted Dunning
Did you do [mvn -DskipTests install] at the top level before trying this? On Tue, Sep 21, 2010 at 9:15 AM, Neil Ghosh neil.gh...@gmail.com wrote: Hi , I am trying to run the example using Mahout 0.3 at https://cwiki.apache.org/confluence/display/MAHOUT/Twenty+Newsgroups I have carried out

Re: PFP Growth

2010-09-21 Thread Ted Dunning
then it will hang and never finish. Is this a possible hadoop configuration bug? On 9/18/10 12:08 PM, Ted Dunning wrote: Good advice relative to Mahout as well. Trying it on a smaller sample will tell you if it is due to bad scaling or really a hangup. On Sat, Sep 18, 2010 at 12:03 PM

Re: Logistic Regression: java.lang.NullPointerException

2010-09-22 Thread Ted Dunning
Isabel noted the same thing. I will get to it shortly. Most likely I have broken these older API's in some subtle (or not) fashion. On Wed, Sep 22, 2010 at 2:57 AM, Frank Wang wangfan...@gmail.com wrote: I was running the donut example for logistic regression. It has always worked until

Mahout talk from lurkers

2010-09-22 Thread Ted Dunning
This is cool: http://lca2011.linux.org.au/programme/schedule/view_talk/213?day=None That is the first Mahout talk I have seen announced by somebody whose name I don't recognize. It looks like a reasonable topic and I will be interested to hear how their results turned out. Are Aneesha Bakharia

Re: Possible multi thread issue in AbstractDifferenceRecommenderEvaluator

2010-09-23 Thread Ted Dunning
I don't think that the future.get() will ever be done. Testing for !future.done() will always return false after invokeAll because invokeAll waits for all tasks to complete. On Thu, Sep 23, 2010 at 7:57 PM, Stanley Ipkiss saurabhnan...@gmail.comwrote: According to me, the first line is

Re: Example Application using Mahout

2010-09-23 Thread Ted Dunning
This looks like a great series. Could you do us a favor and point to http://mahout.apache.org instead? The URL you have is old and we haven't yet redirected from there to the current web site. On Thu, Sep 23, 2010 at 9:38 PM, Timothy Potter thelabd...@gmail.comwrote: I've just put the

Re: Text Classification using Mahout

2010-09-24 Thread Ted Dunning
There isn't a lot more documentation than that. There is a forthcoming book by Grant called Taming Text that might help you and the currently being written classification sections of the forthcoming Mahout in Action book might be helpful. On 9/24/10, Neil Ghosh neil.gh...@gmail.com wrote: Is

Re: Searching more Mahout content

2010-09-24 Thread Ted Dunning
That would be fabulous. On Fri, Sep 24, 2010 at 6:07 AM, Alex Baranau alex.barano...@gmail.comwrote: I'd suggest to use the approach discussed (and accepted) at https://issues.apache.org/jira/browse/TIKA-488, which is about using multiple search engines. Will create a patch (to include both

Re: Possible multi thread issue in AbstractDifferenceRecommenderEvaluator

2010-09-24 Thread Ted Dunning
Is that the complete stack trace? Threaded code like this usually has two or three levels of Caused by seconds. The last is the critical one. On Fri, Sep 24, 2010 at 1:07 PM, Stanley Ipkiss saurabhnan...@gmail.comwrote: I did that change yesterday in my code, but forgot to post the update

Re: Text Classification using Mahout

2010-09-25 Thread Ted Dunning
Either Naive Bayes or the SGD classifiers will do a nice job for most text classification problems. On Sat, Sep 25, 2010 at 11:48 AM, Neil Ghosh neil.gh...@gmail.com wrote: ctually I want to know how can I use mahout for text classification. Will naive bayes. be enough ?

Re: What are the ways to train and run classifiers on text?

2010-09-26 Thread Ted Dunning
Drew, You do recall correctly. This is a good example to follow for the Naive Bayes side of the house. On Sun, Sep 26, 2010 at 1:05 PM, Drew Farris d...@apache.org wrote: The PrepareTwentyNewsgroups example converts a bunch of files organized into directories into the Bayes input format,

Re: Loading and run classification/regression on a model.

2010-09-28 Thread Ted Dunning
The test that you are reading is testing an entire command line interface. If you look inside that code, you can probably see something simpler. Also, you can take a look at the SGD models which are much easier to use on a small scale. There the pertinent classes are

Re: Usage of TF-IDF weights in cbayes Mahout

2010-09-30 Thread Ted Dunning
That is exactly what it does. On Thu, Sep 30, 2010 at 8:37 AM, Neal Richter nrich...@gmail.com wrote: On Thu, Sep 30, 2010 at 8:37 AM, Neil Ghosh neil.gh...@gmail.com wrote: Does anybody have examples/reference how to use TF-IDF weights in mahout cbayes for particular words and phrases

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Ted Dunning
A very good practice is to use a data set like this: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz Segregating by date avoids problems with duplicate documents appearing in both training and test. It also gives you a standard split so that you can compare to other

Re: recommendation mechanism

2010-09-30 Thread Ted Dunning
And if you want to see more about recommendation using side data as well as interaction data, the best reference I know of is Menon and Elkan's recent paper: http://arxiv.org/abs/1006.2156 On Thu, Sep 30, 2010 at 4:45 PM, Sebastian Schelter s...@apache.org wrote: If you just wanna know more

Re: Mahout usage

2010-10-01 Thread Ted Dunning
The best argument I have seen (with one powered-by sticker still pending) is that it helps with recruiting. On Fri, Oct 1, 2010 at 1:34 AM, Isabel Drost isa...@apache.org wrote: On Thu, 30 Sep 2010 Grant Ingersoll gsing...@apache.org wrote: Now, if we could just get people to add to the

Re: kmeans vectors

2010-10-01 Thread Ted Dunning
No there isn't. Your other option is to use kmeans directly and set k (as you seem to do now). t1 and t2 can also be quite delicate parameters. My own tendency is to try to use a good initialization scheme such as kmeans++ (which we don't yet have) and just specify the number of clusters. If

Re: unknown test data twenty-newsgroups example

2010-10-01 Thread Ted Dunning
Yes. Instance = training example. Your method of duplicating lines is just what Robin meant. On Fri, Oct 1, 2010 at 3:55 AM, Robin Anil robin.a...@gmail.com wrote: Let me list what I understood. Pl confirm if I got it correct? Add duplicate extra lines many times in an extra file

possible alternative to very large scale SVD's

2010-10-01 Thread Ted Dunning
Jake, You asked a bit ago about strategies for very large SVD's. I wonder if interpolative decompositions might be an avenue toward that. See, for instance, Less is More: Compact Matrix Decomposition for Large Sparse Graphs http://www.cs.cmu.edu/~jimeng/papers/SunSDM07.pdf The idea is that if

Re: Local compiles and testing

2010-10-01 Thread Ted Dunning
Can you provide a transcript of the commands you use to do this? You might even try computing an md5sum on all of the source files in the src directory and the class files in the target directory to verify that you know exactly what is changing. In general, when I have these kinds of problems,

Re: Local compiles and testing

2010-10-01 Thread Ted Dunning
Matt, This is good detail. On Fri, Oct 1, 2010 at 3:44 PM, Matt Tanquary matt.tanqu...@gmail.comwrote: I forced rebuild of the projects after changing org.apache.mahout.clustering.kmeans.KMeansDriver I noticed that the

Re: Training/Classification techniques in mahout

2010-10-02 Thread Ted Dunning
-type bayes is the other option. If time allows cbayes will probably be better for most purposes. See http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.13.8572 for details on the algorithm and comparisons. On Fri, Oct 1, 2010 at 11:13 PM, Neil Ghosh neil.gh...@gmail.com wrote: Hello,

Re: Mahout Hadoop

2010-10-02 Thread Ted Dunning
If that 50GB represents 20million training examples for a classifier, then you are fine without hadoop. If it is data to cluster or do SVD on, the answer is probably the same. This might be near the edge. If it is data for recommendations, that is a moderate amount and with or without hadoop is

Re: How to get multi-language support for training/classifying text into classes through Mahout?

2010-10-02 Thread Ted Dunning
You will need to make sure that the tokenization is done reasonable. There is an example program for a sequential classifier in org.apache.mahout.classifiers.sgd.TrainNewsGroups It assumes data in the 20 news groups format and uses a Lucene tokenizer. The NaiveBayes code also uses a Lucene

Re: Local compiles and testing

2010-10-02 Thread Ted Dunning
To rebuild the job jar use maven's command [mvn -DskipTests install] (but make sure you run the tests occasionally) You can't trust Eclipse to understand the entire build. It will be ok if you are running unit tests, but if you try to submit a Hadoop job, you need to package everything up. On

Re: Mahout Hadoop

2010-10-02 Thread Ted Dunning
The SGD classifier software will use all the cores for training even without Hadoop. Hadoop can definitely run on a multi-core machine, but the overhead introduced will mean that your net gain will be distinctly less than 8x. On Sat, Oct 2, 2010 at 6:43 PM, Latency Buster

Re: Query

2010-10-03 Thread Ted Dunning
This paper had some interesting references. The problem they worked on was different from yours, but if you know something abou the training images, this might work out. The something might be the original web-site nearby text or almost anything.

Re: How to get multi-language support for training/classifying text into classes through Mahout?

2010-10-03 Thread Ted Dunning
verify with Hindi text as string ? Thanks Neil On Sun, Oct 3, 2010 at 10:20 AM, Ted Dunning ted.dunn...@gmail.com wrote: Hindi should be pretty good to go with the default Lucene analyzer. You should look at the tokens to be sure they are reasonable. Punctuation and some other work

Re: Query

2010-10-03 Thread Ted Dunning
a large library such as Mahout. On Sun, Oct 3, 2010 at 11:41 AM, gagan chhabra gagan.13031...@gmail.comwrote: I was proposed yo use MATLAB for this project but I had no idea so i somehow ended up here. Is it possible to implement in MATLAB?? On Sun, Oct 3, 2010 at 11:48 PM, Ted Dunning

Re: Mahout Hadoop

2010-10-03 Thread Ted Dunning
In that case, another Faloutsos paper would be of interest: 2002 Performance - best student paper award: Mengzhi Wang, Anastassia Ailamaki and Christos Faloutsos, *Capturing the spatio-temporal behavior of real traffic datahttp://www.cs.cmu.edu/~christos/PUBLICATIONS/performance02.pdf *

Re: Query

2010-10-04 Thread Ted Dunning
mention. On Mon, Oct 4, 2010 at 1:34 AM, Ted Dunning ted.dunn...@gmail.com wrote: Try this: http://www.public.asu.edu/~huanliu/sbp09/Presentations/paper%20presentations/SBP09_3-31(Baoxin%20Li%20-4).pdf On Sun, Oct 3, 2010 at 12:57 PM, Federico Castanedo fcast...@inf.uc3m.es wrote

Re: Query

2010-10-04 Thread Ted Dunning
Texture models like Gabor transforms. On Mon, Oct 4, 2010 at 9:10 AM, gagan chhabra gagan.13031...@gmail.comwrote: So wat about the images of animals and humans..?? Any particulars for them like histogram is for snow and sunsets etc.

Re: Context-aware recommendations

2010-10-12 Thread Ted Dunning
My own best candidate for using side information, of which context is just one source, is the latent factor log-linear approach described in Menon and Elkan's paper. I am part-way into an implementation of this, but it will not be integrated into the recommendation framework at first. As soon as

Re: Modelling typed vectors?

2010-10-12 Thread Ted Dunning
There is currently no provision for a payload in the VectorWritable. It is plausible that such a capability could be added. Perhaps you could suggest an implementation? On Tue, Oct 12, 2010 at 2:28 PM, Lance Norskog goks...@gmail.com wrote: Ok. Now, how would one save payloads with the Vector

Re: Modelling typed vectors?

2010-10-13 Thread Ted Dunning
On Tue, Oct 12, 2010 at 5:30 PM, Lance Norskog goks...@gmail.com wrote: This use case is doing Random Projection with paired vectors. Look up 'semantic vectors' for an explanation. Even so, I think that there is another way to do this by just keeping an id on each vector. In random

Re: CBayesClassifier Problem

2010-10-14 Thread Ted Dunning
can you attach your test docs to a jira report? On Thu, Oct 14, 2010 at 2:51 AM, Sreejith S srssreej...@gmail.com wrote: Hi all... I used Mahout CBayes Classifier (and Bayes) to tarin a sample data set.The data set consists of 500 positive and 500 negative documents.After training i passed

Re: relative order in recommendations

2010-10-16 Thread Ted Dunning
If you are comparing ranking systems against a gold standard of relevance, the accepted standard measure is AUC. You can define AUC most conveniently as the probability that the score of a randomly chosen known good example is higher than the score of a randomly chosen known bad example. This is

  1   2   3   4   5   6   7   8   9   10   >