Re: How large should windowSize should be when setting parameters for AdaptiveLogisticRegression?

2012-12-14 Thread Ted Dunning
windowSize) . Would you mind telling more about that? Thanks! On Sat, Dec 15, 2012 at 2:44 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would recommend testing with OnlineLogisticRegression first. The AdaptiveLogisticRegression has a tendency to freeze on sub-optimal parameter

Re: Streaming KMeans Text Clustering Concurrency and Advice

2012-12-13 Thread Ted Dunning
On Thu, Dec 13, 2012 at 2:29 PM, Brandon Root brandonr...@gmail.com wrote: This is a question regarding the new KNN library that Ted Dunning and Dan Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope this is the appropriate list for this question instead of mahout-dev

Re: Streaming KMeans Text Clustering Concurrency and Advice

2012-12-13 Thread Ted Dunning
What Dan says here is correct. The lack of dependence on k in the current code is definitely a problem. The work-around is to set the maxClusters to the point that the log factor should have grown to. That sucks so we should fix the heuristic sizing along the lines that Dan says. There should

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-13 Thread Ted Dunning
If your input files are in S3 then the map-reduce steps that mahout spawns can access them without problems. In order to run Mahout programs, you will need to install mahout. There are command line programs in $MAHOUT_HOME/bin that will do what you need. On Thu, Dec 13, 2012 at 10:58 AM, hellen

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
You are trying to run this job as a single step in an EMR flow. Mahout's command line programs assume that you are running against a live cluster that will hang around (since many mahout steps involve more than one map-reduce). It would probably be best to separate the creation of the cluster

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
? From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org; hellen maziku nahe...@yahoo.com Sent: Wednesday, December 12, 2012 9:48 AM Subject: Re: Creating vectors from lucene index on EMR via the CLI You are trying to run this job as a single step in an EMR

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
-mapreduce --create --alive--log-uri s3n://mahout-output/logs/ --name dict_vectorize doesn't that mean that the keep alive is set? From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org; hellen maziku nahe...@yahoo.com Sent: Wednesday, December

Re: Creating vectors from lucene index on EMR via the CLI

2012-12-12 Thread Ted Dunning
? From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org; hellen maziku nahe...@yahoo.com Sent: Wednesday, December 12, 2012 10:56 AM Subject: Re: Creating vectors from lucene index on EMR via the CLI I would still recommend that you switch to using the mahout programs

Re: Decision Forest - Partial implementation

2012-12-10 Thread Ted Dunning
Yep. On Sun, Dec 9, 2012 at 11:33 PM, Marty Kube martyk...@beavercreekconsulting.com wrote: Because it uses Java pointers instead of offsets. The mmap'ed structure could be mapped into memory at any address and thus must be position independent. Okay, I think I get the point here.

Re: Decision Forest - Partial implementation

2012-12-09 Thread Ted Dunning
in the cluster as a normal file which can then be mapped. On 12/08/2012 03:43 AM, Ted Dunning wrote: There are several approaches that might help: 1) use shared memory via mmap to store the forest. This allows multiple mapper threads to access the same forest. The current Mahout in-memory

Re: Decision Forest - Partial implementation

2012-12-09 Thread Ted Dunning
Yeah... right now you have the full cross product, but one side only has one element so the product is trivial. It isn't that much worse if that side has a few elements. On Sat, Dec 8, 2012 at 9:49 PM, Marty Kube martyk...@beavercreekconsulting.com wrote: #2 Might be a nice general approach.

Re: Cluster: find medoid its n nearest elements

2012-12-07 Thread Ted Dunning
There isn't a clever way to find the medoid in Mahout. Finding the n nearest elements can be done using a Searcher. The Brute implementation should suffice. On Thu, Dec 6, 2012 at 10:16 AM, Stefan Kreuzer stefankreuze...@aol.dewrote: Hello, when inspecting a cluster of sparse vectors, what

Re: Clustering points in a unit hypercube

2012-12-06 Thread Ted Dunning
the link [1]. [1] https://github.com/dfilimon/knn/wiki/skm-visualization On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning ted.dunn...@gmail.com wrote: Still not that odd if several clusters are getting squashed. This can happen if the threshold increases too high or if the searcher is unable

Re: Remove unused recommenders?

2012-12-06 Thread Ted Dunning
Deprecating is a nice first step to let people know where things are headed. On Thu, Dec 6, 2012 at 4:21 PM, Sebastian Schelter s...@apache.org wrote: The other three recommenders seem to be used almost never, so I'd like to remove them, however I wouldn't have a problem with keeping them for

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Ted Dunning
Try the cascaded model. Train the downstream model on data without the don't-care docs or train it on documents that actually get through the upstream model. On Wed, Dec 5, 2012 at 4:50 PM, Raman Srinivasan raman.sriniva...@gmail.com wrote: I can exclude the don't care cases from the training

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
How many clusters are you talking about? If you pick a modest number then streaming k-means should work well if it has several times more surrogate points than there are clusters. Also, typically a hyper-cube test works with very small cluster radius. Try 0.1 or 0.01. Otherwise, your clusters

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-05 Thread Ted Dunning
even before I can sub-class them. What's usually a good approach when less than 5% of the data is meaningful. On Wed, Dec 5, 2012 at 10:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Try the cascaded model. Train the downstream model on data without the don't-care docs or train

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104 [3] https://github.com/dfilimon/knn/issues/1 On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning ted.dunn...@gmail.com wrote: How many clusters are you talking about? If you pick

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
Ahh... this may also be a problem. You should get better results with a Brute searcher here, but a ProjectionSearcher with lots of projections may work well. On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon dangeorge.fili...@gmail.comwrote: So, yes, it's probably a bug of some kind since I end up

Re: Clustering points in a unit hypercube

2012-12-05 Thread Ted Dunning
dangeorge.fili...@gmail.comwrote: But the weight referred to is the distance between a centroid and the mean of a distribution (a cube vertice). This should still be very small (also BallKMeans gets it right). On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning ted.dunn...@gmail.com wrote: IN order to succeed

Re: Mahout Amazon EMR usage cost

2012-12-05 Thread Ted Dunning
On Wed, Dec 5, 2012 at 5:29 PM, Koobas koo...@gmail.com wrote: ... Now yet another naive question. Ted is probably going to go ballistic ;) I hope not. Assuming that simple overlap methods suck, is there still a metric that works better than others (i.e. Tanimoto vs. Jaccard vs

Re: Clustering algorithms

2012-12-04 Thread Ted Dunning
The minhash algorithm itself should work as well with non-English text. It is likely that the input phases where the text is analyzed would not work correctly, however. On Tue, Dec 4, 2012 at 6:05 PM, Varun Thacker varunthacker1...@gmail.comwrote: I'd tried out the MinHash algorithm in mahout

Re: Very high average absolute difference score

2012-12-04 Thread Ted Dunning
Bernát I am guessing from the fact that you have accents in your name that you may be in Europe. If so, it is possible that there is a confusion about the decimal point that Mahout uses and the one that you use. Is it possible that you have decimal numbers like 3,1 instead of 3.1? On Tue, Dec

Re: Seeking advice on a classification problem (needle-in-haystack situation)

2012-12-04 Thread Ted Dunning
What Kate says is good advice. You can have considerable amounts of bias, but you may be telling the model something about the relative cost of errors and that can result in things happening that you don't like. As you noted, your model could have gotten 95% correct by simply saying DON'T CARE

Re: Mahout Amazon EMR usage cost

2012-12-04 Thread Ted Dunning
Also, you have to separate UI considerations from algorithm considerations. What algorithm populates the recommendations is the recommender algorithm. It has two responsibilities... first, find items that the users will like and second, pick out a variety of less certain items to learn about.

Re: Mahout Amazon EMR usage cost

2012-12-04 Thread Ted Dunning
On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas paulo.vl...@gmail.comwrote: On 05/12/12 00:53, Ted Dunning wrote: Also, you have to separate UI considerations from algorithm considerations. What algorithm populates the recommendations is the recommender algorithm. It has two

Re: Mahout Amazon EMR usage cost

2012-12-03 Thread Ted Dunning
On Mon, Dec 3, 2012 at 3:06 AM, Koobas koo...@gmail.com wrote: Thank you very much. The pointer to Myrrix is a very useful piece of information. Myrrix, however, relies on an iterative sparse matrix factorization to do PCA. I want to produce Amazon-like recommendations. I.e., 70% of users

Re: Recommender Evaluator

2012-12-03 Thread Ted Dunning
Also, don't make algorithm choices based on small data samples. Bigger data will change the ordering of which algorithms work well. On Mon, Dec 3, 2012 at 10:04 PM, Sean Owen sro...@gmail.com wrote: You may do better with a latent feature approach -- working in lower dimensional space won't

Re: How to concatenate Vectors?

2012-11-29 Thread Ted Dunning
don't have the same cardinality, so vector1.plus(vector2) does not work. Is there a way to resize a given vector? Sorry I am a complette Mahout-noob. -Ursprüngliche Mitteilung- Von: Ted Dunning ted.dunn...@gmail.com An: user user@mahout.apache.org Verschickt: Do, 29 Nov 2012 11:45

Re: Mahout SGD - is it really descent?

2012-11-28 Thread Ted Dunning
Robert's analysis is correct. This would be worthy of a comment at the least. On Wed, Nov 28, 2012 at 11:53 AM, Lancaster, Robert (Orbitz) robert.lancas...@orbitz.com wrote: graidentBase is coming from: double gradientBase = gradient.get(i); Prior to that: Vector gradient =

Re: Mahout SGD - is it really descent?

2012-11-28 Thread Ted Dunning
+1 On Wed, Nov 28, 2012 at 12:56 PM, Jake Mannix jake.man...@gmail.com wrote: or maybe call the variable negativeGradient, instead?

Re: getting started with mahout and kmeans

2012-11-27 Thread Ted Dunning
and, in that case, bash would be too slow, wouldn't it? On Tue, Nov 27, 2012 at 12:54 AM, Ted Dunning ted.dunn...@gmail.com wrote: How many data points are you clustering? How many dimensions? On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal eduard.gamo...@gmail.comwrote: Hi, I'm

Re: getting started with mahout and kmeans

2012-11-26 Thread Ted Dunning
How many data points are you clustering? How many dimensions? On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal eduard.gamo...@gmail.comwrote: Hi, I'm doing a MSc at Northeastern and I'm working on analyzing some US election polls with kmeans. I'm a beginner with both Mahout and Hadoop. I've

Re: Mahout svd command question

2012-11-22 Thread Ted Dunning
That implementation is deprecated. The SSVD implement should be used instead. On Thu, Nov 22, 2012 at 9:58 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Hi, Here is step by step manual for Lanczos implementation: https://cwiki.apache.org/MAHOUT/dimensional-reduction.html Pavel

Re: How to interpret recommendation strength

2012-11-15 Thread Ted Dunning
...@gmail.com wrote: That's kind of what it does now... though it weights everything as 1. Not so smart, but for sparse-ish data is not far off from a smarter answer. On Thu, Nov 15, 2012 at 6:47 PM, Ted Dunning ted.dunn...@gmail.com wrote: My own preference (pun intended) is to use log

Re: Mahout dependency problem with asm-1.3

2012-11-10 Thread Ted Dunning
Why do you have maven.glassfish.org in your repo path? On Fri, Nov 9, 2012 at 7:17 PM, Lance Norskog goks...@gmail.com wrote: I'm getting this from the current git checkout. There are 301 (redirections) but there is nothing at the target either. Downloading:

Re: Jobs Hadoop-Mahout: Full Capacity

2012-11-10 Thread Ted Dunning
If you want k-means speed see the new k-means code: https://github.com/tdunning/knn Can you describe your data a bit? On Sat, Nov 10, 2012 at 11:22 AM, pricila rr pricila...@gmail.com wrote: I am running kmeans algorithm. Increasing the number of tasktrackers and datanodes, increase the

Re: need help on mahout

2012-11-09 Thread Ted Dunning
There is additional confusion typically because supervised and unsupervised methods are commonly used together. For instance, clustering (unsupervised) can be used to generate cluster proximity features that are then used as features for classification (supervised). Another example might be

Re: Mix of Content Based and Collaborative Filtering

2012-11-06 Thread Ted Dunning
On Mon, Nov 5, 2012 at 9:16 PM, Johannes Schulte johannes.schu...@gmail.com wrote: is it possible you are mixing up payloads and stored fields? The latter ones are not indexed and can only be used for the top n results. Maybe we're talking about different things.. I think I did mix these

Re: Mix of Content Based and Collaborative Filtering

2012-11-05 Thread Ted Dunning
perform best. Maybe this blog article by netflix is a good start http://techblog.netflix.com/2012/06/netflix-recommendations-beyond-5-stars.html Cheers, Johannes On Fri, Nov 2, 2012 at 6:21 AM, Ted Dunning ted.dunn...@gmail.com wrote: Speaking with no principles in hand

Re: Mix of Content Based and Collaborative Filtering

2012-11-05 Thread Ted Dunning
On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte johannes.schu...@gmail.com wrote: do you really mean payloads? Because i consider them part of the index as they are stored per position and can be accessed during scoring. I had the impression that they were not indexed. They are

Re: one vector or many vectors?

2012-11-01 Thread Ted Dunning
Your mileage will vary. It is often helpful to classify small parts of large articles and then somehow deal with these multiple classifications at the full document level. Sometimes it is not helpful, especially if the small parts get too small. Try it both ways. My tendency is to prefer to

Re: SGD: Logistic regression package in Mahout

2012-11-01 Thread Ted Dunning
for your reply. Thanks Rajesh On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Ted, Thanks for reply. I will wait for JIRA and hope to get rid of any encoding issue. Thanks, Rajesh On Oct 31, 2012 5:24 AM, Ted Dunning

Re: one vector or many vectors?

2012-11-01 Thread Ted Dunning
wrong. You are right. It does make things harder. It can also make them better. On Thu, Nov 1, 2012 at 9:39 PM, Ted Dunning ted.dunn...@gmail.com wrote: Your mileage will vary. It is often helpful to classify small parts of large articles and then somehow deal with these multiple

Re: SGD: Logistic regression package in Mahout

2012-10-30 Thread Ted Dunning
as class_1. AUC = 0.50 confusion: [[*26563.0, 23006.0*], [0.0, 0.0]] entropy: [[-0.0, -0.0], [-46.1, -21.4]] I am not sure why this is failing all the time. Looking forward for your reply. Thanks Rajesh On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning ted.dunn

Re: Evaluation of Mahout recommenders

2012-10-27 Thread Ted Dunning
It is a nice writeup, but the Mahout comparison was a bit of a strawman. I wish I could go to their talk, but I was in office hours right then. Coincidentally, I was advising somebody that an excellent way to deploy a recommendation system is on top of Solr. As the regulars here will know, I

Re: If you're at Hadoop World this year

2012-10-21 Thread Ted Dunning
If we have descended to personal advertising, then I should mention that I am speaking as well. http://strataconf.com/stratany2012/public/schedule/speaker/126559 I will also have office hours afterwards during which the topic is unlimited. Drop by! On Sun, Oct 21, 2012 at 11:20 AM, Josh

Re: If you're at Hadoop World this year

2012-10-21 Thread Ted Dunning
Sounds good! On Sun, Oct 21, 2012 at 12:59 PM, Grant Ingersoll gsing...@apache.orgwrote: I'll be at Strata, too, but not speaking... sounds like we have the makings of an informal Mahout gathering? On Oct 21, 2012, at 2:42 PM, Ted Dunning wrote: If we have descended to personal

Re: ** Problem using SGD and iris arff as test set **

2012-10-11 Thread Ted Dunning
, 2012 at 8:08 PM, Ted Dunning ted.dunn...@gmail.com wrote: Sgd is more suitable for large data. I will take a look later today. Sent from my iPhone On Oct 9, 2012, at 11:29 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Ted, Putting specific question with data for getting

Re: Create vector from text

2012-10-11 Thread Ted Dunning
You have to tokenize your text and then use some form of vector encoding. If you have a known dictionary of all interesting words, you can simply make a vector as long as the number of words in your dictionary and put a 1 in the right place. If you don't want to do that either because you don't

Re: ** Problem using SGD and iris arff as test set **

2012-10-11 Thread Ted Dunning
, Ted Dunning ted.dunn...@gmail.com wrote: My first thought was that we needed several passes, but that is clearly wrong. I think that the problem is in the data formatting and conversion somehow. Haven't had time to dope this out just yet. The iris data should converge trivially

Re: ** Problem using SGD and iris arff as test set **

2012-10-10 Thread Ted Dunning
Sgd is more suitable for large data. I will take a look later today. Sent from my iPhone On Oct 9, 2012, at 11:29 PM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Ted, Putting specific question with data for getting problem with SGD. I am using Iris Plants Database from Michael

Re: mahout-error in virtual machine

2012-10-10 Thread Ted Dunning
This might work, but the messages indicate that the environment is seriously messed up. Just getting the code isn't going to help. The tests are indicating that there is a real problem (and it isn't likely Mahout). That problem needs fixing and once fixed running the tests isn't a bad thing.

Re: Tuning OnlineLogisticRegression Algo

2012-10-09 Thread Ted Dunning
See this page: http://leon.bottou.org/research/stochastic Google is your friend. This API is, however, not particularly friendly. Therefore, you will have to read about the basics and be able to figure these things out from first principles. There is some documentation in the code. You can

Re: Evolution of ratings over time

2012-09-30 Thread Ted Dunning
Other experiments have shown that 60-80% of perception of music likes is due to social factors. Factoring this out may or may not be a good thing. My feeling is that if you are trying to make people happy with what you recommend then you need to go with whatever makes them happy whether it is

Re: Combiner applied on multiple map task outputs (like in Mahout SVD)

2012-09-27 Thread Ted Dunning
Combiners can be called zero or more times. That can happen on the map side or on the reduce side. On Thu, Sep 27, 2012 at 4:56 AM, Sigurd Spieckermann sigurd.spieckerm...@gmail.com wrote: @Jake: Could you please elaborate on how exactly the combiner can be called before the reducer gets the

Re: SGD AdaptiveLogisticRegression vs OnlineLogisticRegression

2012-09-23 Thread Ted Dunning
I think that there is an excessive stability issue, actually. What seems to happen is that the adaptive part locks down the learning rate too quickly. This is related to several other issues: - the cross fold learning paradigm is kind of dangerous since it depends on the user not having

Re: hadoop-0.19 and mahout 0.7: throwing incompatible errors, how can I fix it?

2012-09-21 Thread Ted Dunning
On the other hand, the only way that I have been able to do a major version upgrade of Hadoop is to start a new company. It is really hard to change code and platform at the same time. If you don't have enough hardware to have two clusters temporarily, things will be really hard moving off of

Re: rate option of trainLogistic command

2012-09-21 Thread Ted Dunning
This changes the initial learning rate. CHanging this can definitely change convergence properties. On Fri, Sep 21, 2012 at 9:33 AM, Watson Watson watso...@gmail.com wrote: Hi, My question is why changing the rate parameter we always change the coefficients (results of RunLogistic)? I

Re: The default category of a binary classifier

2012-09-19 Thread Ted Dunning
If a classifier is presented text with no words in common with the training data, it will give you back the most common category in the training data. That said, it is likely to be quite rare when a new document consists *entirely* of new words. Any overlap with trained vocabulary is likely to

Re: The default category of a binary classifier

2012-09-19 Thread Ted Dunning
goks...@gmail.com wrote: Shouldn't this be 'unclassified'? I think I have seen data in the unclassified buckets with both Bayes and SGD. - Original Message - | From: Ted Dunning ted.dunn...@gmail.com | To: user@mahout.apache.org | Sent: Wednesday, September 19, 2012 2:54:25 PM

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
And if you want the reduced rank representation of A, you have it already with A_k = U_k S_k V_k' Assume that A is n x m in size. This means that U_k is n x k and V_k is m x k The rank reduced projection of an n x 1 column vector is u_k = U_k U_k' u Beware that v_k is probably not

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
? On Sun, Sep 16, 2012 at 5:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: And if you want the reduced rank representation of A, you have it already with A_k = U_k S_k V_k' Assume that A is n x m in size. This means that U_k is n x k and V_k is m x k The rank reduced

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
by basically saying that the projection is Uk' times the new vector, so, I never understood this expression. On Sun, Sep 16, 2012 at 7:13 PM, Ted Dunning ted.dunn...@gmail.com wrote: A is in there implicitly. U_k provides a basis of the row space and V_k provides a basis of the column space

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
/A If you shove u through U_k U_k' you get this: U_k U_k' u = U_k U_k' (u_A + u_/A) = U_k U_k' (u_A) + 0 = u_A This is another way of showing that U_k U_k' projects a vector into span A. On Sun, Sep 16, 2012 at 12:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: U_k ' U_k = I U_k U_k ' != I

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
in terms of the latent variables. On Sun, Sep 16, 2012 at 8:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: U_k ' U_k = I U_k U_k ' != I

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
projecting back into span A and you are talking about expressing things in terms of the latent variables. On Sun, Sep 16, 2012 at 8:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: U_k ' U_k = I U_k U_k ' != I -- Lance Norskog goks...@gmail.com

Re: how to work with ARFF files using Mahout clustering

2012-09-15 Thread Ted Dunning
Rajesh On Thu, Sep 13, 2012 at 8:53 PM, Ted Dunning tdunn...@maprtech.com wrote: Send this to the mailing list. On Thu, Sep 13, 2012 at 7:35 AM, Rajesh Nikam rajeshni...@gmail.com wrote: Hi Ted, I have data in WEKA ARFF format. What to how to use this ARFF formatted

Re: Building Mahout

2012-09-13 Thread Ted Dunning
Yes. It is a grave embarrassment to us, but not a functional requirement. On Thu, Sep 13, 2012 at 6:42 AM, I-Scarlatti, David david.scarla...@boeing.com wrote: Ok. So tests are just tests... not needed for having mahout running Thanks! -Original Message- From: Paritosh

Re: Is mahout kmeans slow ?

2012-09-12 Thread Ted Dunning
Yes. I have been working (slowly) on moving some very fast single pass clustering into Mahout. My work in progress currently does very fast clustering of small dense vectors and it should scale to sparse vectors fairly well with some small changes. See https://github.com/tdunning/knn for more

Re: Is mahout kmeans slow ?

2012-09-12 Thread Ted Dunning
Also, with 500MB of data, this is likely to only take a few minutes on a single machine with the new clustering stuff. It is hard to estimate precisely, however, due to the difference between dense and sparse cases. On Wed, Sep 12, 2012 at 8:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: 200

Re: ArrayIndexOutOfBoundsException SparseMatrix

2012-09-09 Thread Ted Dunning
You are using lots of threads but the sparse matrix structure is not thread safe. Setting a value in the SparseMatrix causes mutation to internal data structures. If you can have each thread do all the updates for a single thread, that would be much better. Another option is to synchronize on

Re: SGD Based Recommender Contribution Proposal

2012-09-09 Thread Ted Dunning
Great. If the update has a huge impact on existing code, can you break it into manageable pieces? If it is just an addition, having a big blob of stuff is probably fine. On Sun, Sep 9, 2012 at 7:01 AM, Gokhan Capan gkhn...@gmail.com wrote: On Fri, Sep 7, 2012 at 12:48 AM, Ted Dunning ted.dunn

Re: Should I be using OnlineLogisticRegression?

2012-09-07 Thread Ted Dunning
how it turns out. Mike On Thu, Sep 6, 2012 at 8:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: Try transforming them as well, likely with a log if they are positive and have heavily skewed values. Can you suck the data into R and paste in the results of summary(x)? (assuming you put

Re: SGD Based Recommender Contribution Proposal

2012-09-06 Thread Ted Dunning
This sounds pretty exciting. Beyond that, it is hard to say much. Can you say a bit more about how you would see introducing the code into Mahout? On Thu, Sep 6, 2012 at 9:14 AM, Gokhan Capan gkhn...@gmail.com wrote: By the way, I want to mention that my thesis is advised by Ozgur Yilmazel,

Re: Should I be using OnlineLogisticRegression?

2012-09-06 Thread Ted Dunning
Try transforming them as well, likely with a log if they are positive and have heavily skewed values. Can you suck the data into R and paste in the results of summary(x)? (assuming you put the data into the variable x). This should look something like: summary(x) v1 v2

Re: PCA doc question for devs:

2012-09-05 Thread Ted Dunning
Yes. (A-M)V is U \Sigma. You may actually want something like U \sqrt \Sigma instead, though. On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Hello, I have a question w.r.t what to advise people in the SSVD manual for PCA. So we have (A-M) \approx U \Sigma V^t

Re: SSVD Wrong Singular Vectors

2012-09-04 Thread Ted Dunning
A quick t-test on these differences gives the same results no significant difference. On Mon, Sep 3, 2012 at 11:34 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Then i subtracted error means between two methods (+ sign means smaller error for MR version, -sign means smaller error for R

Re: SSVD Wrong Singular Vectors

2012-09-03 Thread Ted Dunning
results and errors so it doesn't make sense to make any error comparison just between single runs of the variations. Instead, it makes sense to compare error mean and variations on a better number of runs. -d On Sun, Sep 2, 2012 at 12:00 AM, Ted Dunning ted.dunn...@gmail.com wrote

Re: SSVD Wrong Singular Vectors

2012-09-02 Thread Ted Dunning
Did Ahmed even use a power iteration? On Sun, Sep 2, 2012 at 1:35 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: but there is still a concern in a sense that power iterations should've helped more than they did. I'll take a closer look but it will take me a while to figure if there's something

Re: SSVD Wrong Singular Vectors

2012-09-01 Thread Ted Dunning
with similar parameters. One significant difference between MR and sequential version is that sequential version is using ternary random matrix (instead of uniform one), perhaps that may affect accuracy a little bit. On Fri, Aug 31, 2012 at 10:55 PM, Ted Dunning ted.dunn...@gmail.com wrote

Re: SSVD Wrong Singular Vectors

2012-09-01 Thread Ted Dunning
is that sequential version is using ternary random matrix (instead of uniform one), perhaps that may affect accuracy a little bit. On Fri, Aug 31, 2012 at 10:55 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you provide your test code? What difference did you observe? Did you account

Re: SSVD error

2012-09-01 Thread Ted Dunning
With 57 crawled docs, you can't reasonably set p 57. That is your second error. On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel pat.fer...@gmail.com wrote: I have a small data set that I am using in local mode for debugging purposes. The data is 57 crawled docs with something like 2200 terms. I

Re: SSVD error

2012-09-01 Thread Ted Dunning
, at 7:53 AM, Ted Dunning ted.dunn...@gmail.com wrote: With 57 crawled docs, you can't reasonably set p 57. That is your second error. On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel pat.fer...@gmail.com wrote: I have a small data set that I am using in local mode for debugging purposes

Re: SSVD Wrong Singular Vectors

2012-09-01 Thread Ted Dunning
On Sun, Sep 2, 2012 at 12:26 AM, Ahmed Elgohary aagoh...@gmail.com wrote: - I am using k = 30 and p = 2 so (k+p)99 (Rank(A)) - I am attaching the csv file of the matrix A Brilliant. And the attachment actually made it through. - yes, the difference is significant. Here is the output of

Re: Voronoi

2012-08-31 Thread Ted Dunning
Yes. Essentially this means construct the Voronoi tesellation for all points and for each post code, use the union of the regions for each point in that post code. You will not necessarily have convex hulls for each post-code, but you will have hulls and will almost certainly have a single hull

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
First, this is a tiny training set. You are well outside the intended application range so you are likely to find less experience in the community in that range. That said, the algorithm should still produce reasonably stable results. Here are a few questions: a) which class are you using to

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
) and dint find that the data was passed more than once. Yes I randomize the order every time. a) I am using AdaptiveLearningRegression (just like 20newsgroup). Thanks! On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote: First, this is a tiny training set. You are well outside the intended

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
. And randomize the order each time? On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood sal...@influestor.com wrote: Cheers ted. Appreciate the input! Sent from my iPhone On 31 Aug 2012, at 17:53, Ted Dunning ted.dunn...@gmail.com wrote: OK. Try passing through the data 100 times

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
] http://en.wikipedia.org/wiki/Bootstrapping_(statistics) On Fri, Aug 31, 2012 at 11:24 PM, Ted Dunning ted.dunn...@gmail.com wrote: That would be best, but practically speaking, randomizing once is usually OK. With a tiny data set like this that is in memory anyway, I wouldn't take any chances

Re: SSVD Wrong Singular Vectors

2012-08-31 Thread Ted Dunning
Can you provide your test code? What difference did you observe? Did you account for the fact that your matrix is small enough that it probably wasn't divided correctly? On Sat, Sep 1, 2012 at 1:27 AM, Ahmed Elgohary aagoh...@gmail.com wrote: Hi, I used mahout's stochastic svd

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
:53 PM, Whitmore, Mattie mwhit...@harris.comwrote: I need to be using the matrices for BallKmeans. Can matrices be named? By this I mean can I assign a column of my matrix to be the name of each row? Thanks! -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
But columns aren't what I would expect you to want labeled. I think that row labels might be nicer. Happily, each named vector has a name for the entire vector as well. On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning ted.dunn...@gmail.com wrote: The input to the BallKmeans is actually

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Thursday, August 30, 2012 2:52 PM To: user@mahout.apache.org Subject: Re: Mahout-279/kmeans++ But columns aren't what I would expect you to want labeled. I think that row labels might be nicer. Happily, each named vector

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
No. The algorithm works either way. The algorithm doesn't need the full capabilities of a matrix since it just makes a few sequential passes through the data. On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie mwhit...@harris.comwrote: Would the algorithm implement better as if given a matrix?

Re: Deploying a classification model using zookeeper

2012-08-29 Thread Ted Dunning
It isn't a big deal to increase the Znode size, but it is bad practice. ZK isn't a file store. It is a coordination server. The size limit is intended to prevent large operations slowing down other operations. If you aren't sharing your ZK or your neighbors don't have response time

Re: Voronoi

2012-08-29 Thread Ted Dunning
Karl, I don't think that I understand your request. What I think I hear is that you want an implementation (with unknown inputs and outputs) that encodes a Voronoi tesselation using boundary vertices instead of centroids. Is that correct? If so, it is relatively easy to go from centroid form

Re: great sgd datasets

2012-08-28 Thread Ted Dunning
These are fairly straightforward to generate from random data. Not particularly realistic, but highly parametrizable. RCV1 should be almost in that range. I think that the recent KDD music classification exercise would be in that range if viewed as a classification exercise. See

Re: Malicious users on recommender system

2012-08-28 Thread Ted Dunning
The single most effective thing you can do with malicious users like this is to let them think that they have won. In the ideal case, you can detect simple click frauds and maintain a per user play adjustment so that they see the fraudulent stats and everybody else sees the corrected stats. If

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

2012-08-27 Thread Ted Dunning
Obviously, you need to refer also to scores of other items as well. One handy stat is AUC whcih you can compute by averaging to get the probability that a relevant (viewed) item has a higher recommendation score than a non-relevant (not viewed) item. On Sun, Aug 26, 2012 at 5:55 PM, Sean Owen

<    3   4   5   6   7   8   9   10   11   12   >