Re: mahout on GPU

2012-07-13 Thread Ted Dunning
stuff. I am new to both one and I > want to choose one to focus on. > > > > On Tue, Jul 10, 2012 at 4:35 PM, Ted Dunning > wrote: > > > Note that on page 6 they explicitly say that if they had to actually read > > their input, this wouldn't help. Since they

Re: irregular kmeans clusters on binary data

2012-07-13 Thread Ted Dunning
On Fri, Jul 13, 2012 at 12:09 PM, Masoud Moshref Javadi wrote: > First of all thank you for your response with pictures. > That's true. Some features are 1 in many points and some are not. That's > the nature of my problem. But I did not scale features. > Should I do scaling? may be using a dimens

Re: RowSimilarity

2012-07-13 Thread Ted Dunning
For document-like things objects search using text retrieval-like techniques (but in batch) is good. For reduced dimension document-like things then you need to go to alternative methods to do full scale nearest neighbor computations. With a strong metric like L_2, you can to all-points nearest n

Re: RowSimilarity

2012-07-14 Thread Ted Dunning
Solr would do this well. The upcoming knn package would do it differently and for different purposes, but also would do it well. On Sat, Jul 14, 2012 at 8:17 AM, Pat Ferrel wrote: > Intersting. > > I have another requirement, which is to do something like real time vector > based queries. Imagi

Re: RowSimilarity

2012-07-14 Thread Ted Dunning
I would call it kinda-cosine distance. There are some intricate normalization factors. On Sat, Jul 14, 2012 at 5:22 PM, Lance Norskog wrote: > Lucene's MoreLikeThis feature does cosine distance (I think) directly > against term vectors. > > On Sat, Jul 14, 2012 at 11:1

Re: Adaptive logistic regression: inconsistent results

2012-07-15 Thread Ted Dunning
It is possibly sparseness, but more likely this is the known pathology of the adaptive logistic regression in which it gets over-confident and locks down training rate too early. I have a few suggestions: 1) try the OnlineLogisticRegression. I think that you can find decent training parameters p

Re: RowSimilarity

2012-07-18 Thread Ted Dunning
For picking terms from a document that stand apart from those in a large corpus, this tf*idf trick is nearly identical to using the latent log likelihood test. It produces pretty darned good results. On Tue, Jul 17, 2012 at 8:22 PM, Ken Krugler wrote: > The simplistic approach I used was to extr

Re: eigendecomposition of very large matrices

2012-07-19 Thread Ted Dunning
Folks have done SVD on very large matrices with Mahout, but not necessarily for spectral clustering. Are you sure that you actually need 4000 vectors? As sparse as your data is, I would expect that no more than a few hundred are anything but statistical noise. On Thu, Jul 19, 2012 at 6:32 PM, An

Re: eigendecomposition of very large matrices

2012-07-19 Thread Ted Dunning
> Hi Ted, > Thanks for your reply. > I am doing clustering of 10^6 objects (thus affinity matrix of that size) > and expect 4000-10,000 clusters. That's why I need those many eigenvectors. > > Will SVD be faster in this case ? > > Aniruddha > > > > On Jul 19, 2012, a

Re: performance study

2012-07-28 Thread Ted Dunning
I am unaware of such comparisons. I also don't know of any practical implementations for doing really huge decompositions in parallel. On Sat, Jul 28, 2012 at 10:27 AM, mohsen jadidi wrote: > Thank you for your replies. What I am interested to know is that if I want > to compute the SVD for hug

Re: eigendecomposition of very large matrices

2012-07-28 Thread Ted Dunning
The algorithm used doesn't change this. If U S V' = A is the SVD of A, then A' A = (U S V')' U S V' = V S U' U S V' = V S^2 V' On Thu, Jul 26, 2012 at 4:31 PM, John Stewart wrote: > With Lanczos, the eigenvectors of A'A give you the orthogonal matrix V of > SVD, and th

Re: RowSimilarity, Solr, or truncated clustering?

2012-07-30 Thread Ted Dunning
Pat, Seed selection is a big deal. See this paper for some ideas: http://www.math.uwaterloo.ca/~cswamy/papers/kmeansfnl.pdf On Mon, Jul 30, 2012 at 11:33 AM, Pat Ferrel wrote: > I need to create groups of items that are similar to a seed item. This > seed item may be a synthetic vector or may

Re: Mahout LanczosSolver explanation

2012-07-31 Thread Ted Dunning
WHy are you using Lanczos? Why not use something more recent? On Tue, Jul 31, 2012 at 7:00 PM, Aniruddha Basak wrote: > Hi, > I am working on Spectral Kmeans which involves an eigen-decomposition step > using Lanczos. As I did not get exact similar results as expected, I tried > to understand th

Re: Mahout LanczosSolver explanation

2012-07-31 Thread Ted Dunning
ternative of Lanczos. > > Thanks, > Aniruddha > > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Tuesday, July 31, 2012 6:24 PM > To: user@mahout.apache.org > Cc: Jake Mannix > Subject: Re: Mahout LanczosSolver explanation > >

Re: performance study

2012-08-01 Thread Ted Dunning
I would like to endorse this point. If your sparse data fits in memory on a single machine, it is very unlikely that you will be able to improve on the cost of doing a stochastic projection on that one machine using any Hadoop based solution. Even with MPI and crazy RDMA networking, I doubt that

Re: Tags generation?

2012-08-03 Thread Ted Dunning
tf-idf is a good approximation of the LLR score for many applications and often gives useful signatures although not always super pretty. It helps to have an overall minimum document frequency for terms of the be considered for being tags. This is the same as an IDF maximum. On Fri, Aug 3, 2012

Re: Tags generation?

2012-08-03 Thread Ted Dunning
Unstemming is pretty simple. Just build an unstemming dictionary based on seeing what word forms have lead to a stemmed form. Include frequencies. When unstemming in the context of a document, pick the most popular (corpus-wide) version that actually appears in the document. On Fri, Aug 3, 2012

Re: Tags generation?

2012-08-03 Thread Ted Dunning
This is definitely just the first step. Similar goofs happen with inappropriate stemming. For instance, AIDS should not stem to aid. A reasonable way to find and classify exceptional cases is to look at cooccurrence statistics. The contexts of original forms can be examined to find cases where

Re: MIA graphs

2012-08-03 Thread Ted Dunning
Later diagrams in the classifier section were created using Omnigraffle. Again, nothing too fancy. On Fri, Aug 3, 2012 at 2:53 PM, Sean Owen wrote: > (You can ask in the book forum if it is specific to the book rather than > the project. Maybe I can follow up with you directly off list.) > > Wh

Re: Maven build unpacks jars- would jar of jars work?

2012-08-04 Thread Ted Dunning
I didn't think that Java supports jars inside jars. On Sat, Aug 4, 2012 at 5:04 PM, Lance Norskog wrote: > The Maven build does a grand project unpacking multiple jars into one > big one. Java apparently supports packing jars inside other jars- the > outer jar needs a classpath property for the

Re: Maven build unpacks jars- would jar of jars work?

2012-08-05 Thread Ted Dunning
upId}:${artifact.artifactId} > org.apache.hadoop:hadoop-core > > > > true > > ${artifact.groupId}:${artifact.artifactId} > > > > > > > On Sun, Aug 5, 2012 at 1:10 AM, Ted Dunning wrote: > > I didn&

Re: Tags generation?

2012-08-07 Thread Ted Dunning
On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > > > wrote: > > >> I know, I know. :) Just wanted to mention that it could lead to funny > > >> results, that's all. There are lots of way of doing proper form > > >> disambiguation, includ

Re: KMeans job fails during 2nd iteration. Java Heap space

2012-08-09 Thread Ted Dunning
The upcoming knn package has a file based matrix implementation that uses memory mapping to allow sharing a copy of a large matrix between processes and threads. Sent from my iPhone On Aug 9, 2012, at 1:48 AM, Abramov Pavel wrote: > Hello, > > If think Zipf's law is relevant for my data.

Re: KMeans job fails during 2nd iteration. Java Heap space

2012-08-09 Thread Ted Dunning
/knn Any help in testing these new capabilities or plumbing them into the standard Mahout capabilities would be very much appreciated. On Thu, Aug 9, 2012 at 7:05 AM, Ted Dunning wrote: > The upcoming knn package has a file based matrix implementation that uses > memory mapping to allow sha

Re: How good recommendations and precision works

2012-08-09 Thread Ted Dunning
Recommenders and classifiers are very similar animals in general except for the training data. You can view a recommender as an engine that invents a classifier for each user but it does this by using other user histories as training data. This means that there can be a lot of confusion when look

Re: genetic algorithms / watchmaker removal

2012-08-11 Thread Ted Dunning
The Watchmaker implementation was not very scalable and there was no perceptible user demand for it. There was also nobody who was maintaining it. So we nuked it. There is still a limited evolutionary algorithm that is part of the AdaptiveLogisticRegression. It is likely to be pretty good on pr

Re: genetic algorithms / watchmaker removal

2012-08-11 Thread Ted Dunning
27;ll take a look at the old Watchmaker code and maybe try to improve on it. > Thanks for the help. > > -Jason > > > On Sat, Aug 11, 2012 at 6:20 PM, Ted Dunning > wrote: > > > The Watchmaker implementation was not very scalable and there was no > > perceptible

Re: genetic algorithms / watchmaker removal

2012-08-12 Thread Ted Dunning
2012 at 1:36 AM, Ted Dunning > wrote: > > > That sounds like a continuous optimization problem. > > > > Look at the org.apache.mahout.ep.EvolutionaryProcess > > > > It is an implementation of recorded step meta-mutation and does quite > well > > on many pr

Re: Mahout-279/kmeans++

2012-08-15 Thread Ted Dunning
Mattie, Would this help? https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java and https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie wrote: > Hi! > > I have

Re: Encoding and vectorizing

2012-08-16 Thread Ted Dunning
If your data is dense and numerical, then you don't need anything but trivial encoding. Just copy the values from your CSV file into the vector, converting to numbers as you go. If some of your data are categorical or textual, you will need fancier footwork. On Thu, Aug 16, 2012 at 3:28 AM, Chan

Re: Apache Mahout without Hadoop

2012-08-17 Thread Ted Dunning
Most algorithms have non-hadoop versions. On Thu, Aug 16, 2012 at 9:22 AM, Chandra Mohan, Ananda Vel Murugan < ananda.muru...@honeywell.com> wrote: > I think Mahout can be used as a library too. Some algorithms are > implemented in map-reduce fashion and they may need Hadoop, but rewriting > them

Re: Encoding and vectorizing

2012-08-17 Thread Ted Dunning
footwork? Should I convert categories into some numbers > and store in vector? Thanks!! > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Thursday, August 16, 2012 8:08 PM > To: user@mahout.apache.org > Cc: mahout-u...@apache.org > Subject

Re: Mahout-279/kmeans++

2012-08-22 Thread Ted Dunning
gt;= clusterClassificationThreshold; > > } > > > > On 17-08-2012 20:06, Whitmore, Mattie wrote: > > > >> Hi Ted, > >> > >> Yes this is great! I hope to start working with this algorithm in the > next couple weeks. > >> > >&

Re: Mahout-279/kmeans++

2012-08-22 Thread Ted Dunning
algorithm from dropping non-distinct vectors/data > points (which is what I THINK but have yet to verify is what is going on)? > > Thanks, > > Mattie > > -----Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Wednesday, August 22, 2012 1:18 PM &

Re: A Single DistributedRowMatrix with Multiple SequenceFiles

2012-08-22 Thread Ted Dunning
Not yet, but it makes a lot of sense to allow an InputProvider from the guava library in addition to a single file. Not a lot of sense in things in between. On Wed, Aug 22, 2012 at 8:55 PM, Ahmed Elgohary wrote: > Hi, > > I was wondering why the constructor of DistributedRowMatrix restricts the

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

2012-08-27 Thread Ted Dunning
Obviously, you need to refer also to scores of other items as well. One handy stat is AUC whcih you can compute by averaging to get the probability that a relevant (viewed) item has a higher recommendation score than a non-relevant (not viewed) item. On Sun, Aug 26, 2012 at 5:55 PM, Sean Owen wr

Re: Visualization of word clusters

2012-08-27 Thread Ted Dunning
Here is some pretty old work that did the same sort of thing. The self organizing map (SOM) is an interesting alternative to MDS since it allows mapping a low dimensional approximate manifold to a linear space. The basic idea is that it preserves close distances and doesn't much care about distan

Re: Can someone suggest an approach for calculating precision and recall for distributed recommendations?

2012-08-27 Thread Ted Dunning
In another forum, I responded to this question this way: One short answer is that you only need enough test data to drive the > accuracy of your PR estimates to the point you need them. That isn't all > that much data so the sequential version should do rather well. > The gold standard, of course,

Re: great sgd datasets

2012-08-28 Thread Ted Dunning
These are fairly straightforward to generate from random data. Not particularly realistic, but highly parametrizable. RCV1 should be almost in that range. I think that the recent KDD music classification exercise would be in that range if viewed as a classification exercise. See http://jmlr.csa

Re: Malicious users on recommender system

2012-08-28 Thread Ted Dunning
The single most effective thing you can do with malicious users like this is to let them think that they have won. In the ideal case, you can detect simple click frauds and maintain a per user play adjustment so that they see the fraudulent stats and everybody else sees the corrected stats. If yo

Re: Malicious users on recommender system

2012-08-28 Thread Ted Dunning
can solve this case that happened to > Amazon > http://news.cnet.com/2100-1023-976435.html > > Thanks > > > > > On Tue, Aug 28, 2012 at 8:23 PM, Ted Dunning > wrote: > > The single most effective thing you can do with malicious users like this > > is to

Re: Deploying a classification model using zookeeper

2012-08-29 Thread Ted Dunning
It isn't a big deal to increase the Znode size, but it is bad practice. ZK isn't a file store. It is a coordination server. The size limit is intended to prevent large operations slowing down other operations. If you aren't sharing your ZK or your neighbors don't have response time expectations

Re: Mahout-279/kmeans++

2012-08-29 Thread Ted Dunning
distinct (albeit the data point is the same as other points > in the set) will this keep the algorithm from dropping non-distinct > vectors/data points (which is what I THINK but have yet to verify is what > is going on)? > >> > >> Thanks, > >> > >> Mattie &

Re: Voronoi

2012-08-29 Thread Ted Dunning
Karl, I don't think that I understand your request. What I think I hear is that you want an implementation (with unknown inputs and outputs) that encodes a Voronoi tesselation using boundary vertices instead of centroids. Is that correct? If so, it is relatively easy to go from centroid form to

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
, Whitmore, Mattie wrote: > I need to be using the matrices for BallKmeans. Can matrices be named? By > this I mean can I assign a column of my matrix to be the "name" of each row? > > Thanks! > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.co

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
But columns aren't what I would expect you to want labeled. I think that row labels might be nicer. Happily, each named vector has a name for the entire vector as well. On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning wrote: > The input to the BallKmeans is actually not a matrix.

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
s for the guidance! > > > -Original Message- > From: Ted Dunning [mailto:ted.dunn...@gmail.com] > Sent: Thursday, August 30, 2012 2:52 PM > To: user@mahout.apache.org > Subject: Re: Mahout-279/kmeans++ > > But columns aren't what I would expect you to want labeled.

Re: Mahout-279/kmeans++

2012-08-30 Thread Ted Dunning
No. The algorithm works either way. The algorithm doesn't need the full capabilities of a matrix since it just makes a few sequential passes through the data. On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie wrote: > Would the algorithm implement better as if given a matrix? I'm thinking of >

Re: Voronoi

2012-08-31 Thread Ted Dunning
Yes. Essentially this means construct the Voronoi tesellation for all points and for each post code, use the union of the regions for each point in that post code. You will not necessarily have convex hulls for each post-code, but you will have hulls and will almost certainly have a single hull f

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
First, this is a tiny training set. You are well outside the intended application range so you are likely to find less experience in the community in that range. That said, the algorithm should still produce reasonably stable results. Here are a few questions: a) which class are you using to tr

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
longs to. > > b) I am passing through the data once (at least this is what I think). I > folowed the 20newsgroup example code(in java) and dint find that the data > was passed more than once. > Yes I randomize the order every time. > > a) I am using AdaptiveLearningRegression (j

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
> > And randomize the order each time? > > On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood > wrote: > > Cheers ted. Appreciate the input! > > > > Sent from my iPhone > > > > On 31 Aug 2012, at 17:53, Ted Dunning wrote: > > > >> OK. > &g

Re: SGD diferent confusion matrix for each run

2012-08-31 Thread Ted Dunning
] http://en.wikipedia.org/wiki/Bootstrapping_(statistics) On Fri, Aug 31, 2012 at 11:24 PM, Ted Dunning wrote: > That would be best, but practically speaking, randomizing once is usually > OK. With a tiny data set like this that is in memory anyway, I wouldn't > take any chances. > &

Re: SSVD Wrong Singular Vectors

2012-08-31 Thread Ted Dunning
Can you provide your test code? What difference did you observe? Did you account for the fact that your matrix is small enough that it probably wasn't divided correctly? On Sat, Sep 1, 2012 at 1:27 AM, Ahmed Elgohary wrote: > Hi, > > I used mahout's stochastic svd implementation to find the si

Re: SSVD Wrong Singular Vectors

2012-09-01 Thread Ted Dunning
t singular vectors. The only thing is that they > > seem to change the sign between R and Mahout's version but otherwise > > they fit more or less exactly. > > > > So yeah i am seeing some stochastic effects in these for k and p being > > so low -- so are you saying

Re: SSVD Wrong Singular Vectors

2012-09-01 Thread Ted Dunning
x27;s version but otherwise > > they fit more or less exactly. > > > > So yeah i am seeing some stochastic effects in these for k and p being > > so low -- so are you saying your errors are greater than those? I did > > not test sequential version with similar paramet

Re: SSVD error

2012-09-01 Thread Ted Dunning
With 57 crawled docs, you can't reasonably set p > 57. That is your second error. On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel wrote: > I have a small data set that I am using in local mode for debugging > purposes. The data is 57 crawled docs with something like 2200 terms. I run > this through

Re: SSVD error

2012-09-01 Thread Ted Dunning
gt; > On Sep 1, 2012, at 7:53 AM, Ted Dunning wrote: > > With 57 crawled docs, you can't reasonably set p > 57. That is your second > error. > > On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel wrote: > > > I have a small data set that I am using in local mode for

Re: SSVD Wrong Singular Vectors

2012-09-01 Thread Ted Dunning
On Sun, Sep 2, 2012 at 12:26 AM, Ahmed Elgohary wrote: > - I am using k = 30 and p = 2 so (k+p)<99 (Rank(A)) > - I am attaching the csv file of the matrix A > Brilliant. And the attachment actually made it through. > - yes, the difference is significant. Here is the output of the sequential >

Re: SSVD Wrong Singular Vectors

2012-09-02 Thread Ted Dunning
Did Ahmed even use a power iteration? On Sun, Sep 2, 2012 at 1:35 AM, Dmitriy Lyubimov wrote: > but there is still a concern in a sense that power iterations > should've helped more than they did. I'll take a closer look but it > will take me a while to figure if there's something we can improve

Re: SSVD Wrong Singular Vectors

2012-09-03 Thread Ted Dunning
spectrum. Flat spectrum just means you don't have > > trends in those directions, i.e. essentially a random noise. If you > > have random noise, direction of that noise is usually of little > > interest, but because spectrum (i.e. singular values) is measured > > b

Re: SSVD Wrong Singular Vectors

2012-09-04 Thread Ted Dunning
A quick t-test on these differences gives the same results no significant difference. On Mon, Sep 3, 2012 at 11:34 PM, Dmitriy Lyubimov wrote: > Then i subtracted error means between two methods (+ sign means > smaller error for MR version, -sign means smaller error for R > sequential versio

Re: SGD model sizes

2012-09-04 Thread Ted Dunning
The model size is very simple. If you have k categories and m features, the model size will be (k-1) x m x s1 + m * s2 + s3 where s1 is roughly 8 bytes and s2 is about 4 bytes and s3 is probably around 100 bytes. These are approximate numbers and could be off by 2 if I forgot something. The firs

Re: PCA doc question for devs:

2012-09-05 Thread Ted Dunning
Yes. (A-M)V is U \Sigma. You may actually want something like U \sqrt \Sigma instead, though. On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov wrote: > Hello, > > I have a question w.r.t what to advise people in the SSVD manual for PCA. > > So we have > > (A-M) \approx U \Sigma V^t > > and st

Re: SGD Based Recommender Contribution Proposal

2012-09-06 Thread Ted Dunning
This sounds pretty exciting. Beyond that, it is hard to say much. Can you say a bit more about how you would see introducing the code into Mahout? On Thu, Sep 6, 2012 at 9:14 AM, Gokhan Capan wrote: > By the way, I want to mention that my thesis is advised by Ozgur Yilmazel, > who is a foundin

Re: Should I be using OnlineLogisticRegression?

2012-09-06 Thread Ted Dunning
Try transforming them as well, likely with a log if they are positive and have heavily skewed values. Can you suck the data into R and paste in the results of summary(x)? (assuming you put the data into the variable x). This should look something like: > summary(x) >v1 v2

Re: Should I be using OnlineLogisticRegression?

2012-09-07 Thread Ted Dunning
t above are welcome...to help me > validate my thought process. > > Thanks for the hints, I will let you know how it turns out. > > Mike > > On Thu, Sep 6, 2012 at 8:14 PM, Ted Dunning wrote: > > > > Try transforming them as well, likely with a log if they are positi

Re: ArrayIndexOutOfBoundsException SparseMatrix

2012-09-09 Thread Ted Dunning
You are using lots of threads but the sparse matrix structure is not thread safe. Setting a value in the SparseMatrix causes mutation to internal data structures. If you can have each thread do all the updates for a single thread, that would be much better. Another option is to synchronize on th

Re: SGD Based Recommender Contribution Proposal

2012-09-09 Thread Ted Dunning
Great. If the update has a huge impact on existing code, can you break it into manageable pieces? If it is just an addition, having a big blob of stuff is probably fine. On Sun, Sep 9, 2012 at 7:01 AM, Gokhan Capan wrote: > On Fri, Sep 7, 2012 at 12:48 AM, Ted Dunning > wrote: >

Re: ArrayIndexOutOfBoundsException SparseMatrix

2012-09-10 Thread Ted Dunning
Multi-threading at the cell level will not likely help. Multi-threading at the row level might help. I would recommend that you use a threaded pool executor and feed the rows into the pool. You won't need locks this way and you will maximize your use of your cores. The basic code would look rou

Re: Is mahout kmeans slow ?

2012-09-12 Thread Ted Dunning
Yes. I have been working (slowly) on moving some very fast single pass clustering into Mahout. My work in progress currently does very fast clustering of small dense vectors and it should scale to sparse vectors fairly well with some small changes. See https://github.com/tdunning/knn for more in

Re: Is mahout kmeans slow ?

2012-09-12 Thread Ted Dunning
Also, with 500MB of data, this is likely to only take a few minutes on a single machine with the new clustering stuff. It is hard to estimate precisely, however, due to the difference between dense and sparse cases. On Wed, Sep 12, 2012 at 8:42 PM, Pat Ferrel wrote: > 200 iterations? > > What i

Re: Building Mahout

2012-09-13 Thread Ted Dunning
Yes. It is a grave embarrassment to us, but not a functional requirement. On Thu, Sep 13, 2012 at 6:42 AM, I-Scarlatti, David < david.scarla...@boeing.com> wrote: > Ok. So tests are just tests... not needed for having mahout running > > Thanks! > > > -Original Message- > From: Parito

Re: how to work with ARFF files using Mahout clustering

2012-09-15 Thread Ted Dunning
Hi Ted, > >> > >> Sorry to bother you again. > >> > >> One quick question: Does Mahout support SVM, what is the Java class > name ? > >> Any inputs on its stability and performance ? > >> > >> > >> Thanks > >> Ra

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
And if you want the reduced rank representation of A, you have it already with A_k = U_k S_k V_k' Assume that A is n x m in size. This means that U_k is n x k and V_k is m x k The rank reduced projection of an n x 1 column vector is u_k = U_k U_k' u Beware that v_k is probably not spa

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
pends on V? > > On Sun, Sep 16, 2012 at 5:33 PM, Ted Dunning > wrote: > > And if you want the reduced rank representation of A, you have it already > > with > > > > A_k = U_k S_k V_k' > > > > Assume that A is n x m in size. This means that U_

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
. (Try to figure out Figure > 1.) And it proceeds in its analysis by basically saying that the > projection is Uk' times the new vector, so, I never understood this > expression. > > On Sun, Sep 16, 2012 at 7:13 PM, Ted Dunning > wrote: > > A is in there implicitly. > &

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
u_/A If you shove u through U_k U_k' you get this: U_k U_k' u = U_k U_k' (u_A + u_/A) = U_k U_k' (u_A) + 0 = u_A This is another way of showing that U_k U_k' projects a vector into span A. On Sun, Sep 16, 2012 at 12:55 PM, Ted Dunning wrote: > U_k ' U_k =

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
are talking about expressing things in terms of the latent variables. > On Sun, Sep 16, 2012 at 8:55 PM, Ted Dunning > wrote: > > U_k ' U_k = I > > > > U_k U_k ' != I >

Re: Using SVD-conditioned matrix

2012-09-16 Thread Ted Dunning
x27;t even think that your claim that decreasing k increases recall is correct. > On Sun, Sep 16, 2012 at 4:11 PM, Ted Dunning > wrote: > > On Sun, Sep 16, 2012 at 1:49 PM, Sean Owen wrote: > > > >> Oh right. It's the columns that are orthogonal. Cancel that. >

Re: The default category of a binary classifier

2012-09-19 Thread Ted Dunning
If a classifier is presented text with no words in common with the training data, it will give you back the most common category in the training data. That said, it is likely to be quite rare when a new document consists *entirely* of new words. Any overlap with trained vocabulary is likely to ov

Re: The default category of a binary classifier

2012-09-19 Thread Ted Dunning
PM, Lance Norskog wrote: > Shouldn't this be 'unclassified'? I think I have seen data in the > unclassified buckets with both Bayes and SGD. > > ----- Original Message - > | From: "Ted Dunning" > | To: user@mahout.apache.org > | Sent: Wednesday, Se

Re: hadoop-0.19 and mahout 0.7: throwing incompatible errors, how can I fix it?

2012-09-21 Thread Ted Dunning
On the other hand, the only way that I have been able to do a major version upgrade of Hadoop is to start a new company. It is really hard to change code and platform at the same time. If you don't have enough hardware to have two clusters temporarily, things will be really hard moving off of 0.1

Re: rate option of trainLogistic command

2012-09-21 Thread Ted Dunning
This changes the initial learning rate. CHanging this can definitely change convergence properties. On Fri, Sep 21, 2012 at 9:33 AM, Watson Watson wrote: > Hi, > My question is why changing the rate parameter we always change the > coefficients (results of RunLogistic)? > > I encounter the enig

Re: SGD AdaptiveLogisticRegression vs OnlineLogisticRegression

2012-09-23 Thread Ted Dunning
I think that there is an excessive stability issue, actually. What seems to happen is that the adaptive part locks down the learning rate too quickly. This is related to several other issues: - the cross fold learning paradigm is kind of dangerous since it depends on the user not having duplicat

Re: Combiner applied on multiple map task outputs (like in Mahout SVD)

2012-09-27 Thread Ted Dunning
Combiners can be called zero or more times. That can happen on the map side or on the reduce side. On Thu, Sep 27, 2012 at 4:56 AM, Sigurd Spieckermann < sigurd.spieckerm...@gmail.com> wrote: > @Jake: Could you please elaborate on how exactly the combiner can be called > before the reducer gets

Re: Evolution of ratings over time

2012-09-30 Thread Ted Dunning
Other experiments have shown that 60-80% of perception of music "likes" is due to social factors. Factoring this out may or may not be a good thing. My feeling is that if you are trying to make people happy with what you recommend then you need to go with whatever makes them happy whether it is i

Re: K-Means as a surrogate for Matrix Factorization

2012-10-05 Thread Ted Dunning
Johannes, Funny you should mention matrix factorization and k-means at the same moment. I am talking this afternoon in Oxford about just this topic. Yes, you can use the proximity to near clusters as a useful modeling feature, but as Sean said, the cost of matrix factorization should not be the

Re: K-Means as a surrogate for Matrix Factorization

2012-10-05 Thread Ted Dunning
On Fri, Oct 5, 2012 at 4:57 PM, Johannes Schulte wrote: > Hi Ted, > > thanks for the hints. I am however wondering what the reverse projection > would be needed for. Do you mean for explaining stuff only? Or validating a > model manually? > Or for converting recommendations back to items. > Al

Re: K-Means as a surrogate for Matrix Factorization

2012-10-07 Thread Ted Dunning
e a more sparse feature > vector or pre clustering. It probably depends :) > > Thanks for the feedback Ted! > > I will continue my quest how to construct a ctr prediction for a > recommendation delivery. Maybe I should have pointed that goal out before. > > On Fri, Oct 5,

Re: Tuning OnlineLogisticRegression Algo

2012-10-09 Thread Ted Dunning
See this page: http://leon.bottou.org/research/stochastic Google is your friend. This API is, however, not particularly friendly. Therefore, you will have to read about the basics and be able to figure these things out from first principles. There is some documentation in the code. You can al

Re: ** Problem using SGD and iris arff as test set **

2012-10-10 Thread Ted Dunning
Sgd is more suitable for large data. I will take a look later today. Sent from my iPhone On Oct 9, 2012, at 11:29 PM, Rajesh Nikam wrote: > Hi Ted, > > Putting specific question with data for getting problem with SGD. > > I am using Iris Plants Database from Michael Marshall. PFA iris.arff

Re: mahout-error in virtual machine

2012-10-10 Thread Ted Dunning
This might work, but the messages indicate that the environment is seriously messed up. Just getting the code isn't going to help. The tests are indicating that there is a real problem (and it isn't likely Mahout). That problem needs fixing and once fixed running the tests isn't a bad thing. On

Re: ** Problem using SGD and iris arff as test set **

2012-10-10 Thread Ted Dunning
ks > Rajesh > > > On Wed, Oct 10, 2012 at 8:08 PM, Ted Dunning > wrote: > > > Sgd is more suitable for large data. I will take a look later today. > > > > Sent from my iPhone > > > > On Oct 9, 2012, at 11:29 PM, Rajesh Nikam wrote: > >

Re: Create vector from text

2012-10-11 Thread Ted Dunning
You have to tokenize your text and then use some form of vector encoding. If you have a known dictionary of all interesting words, you can simply make a vector as long as the number of words in your dictionary and put a 1 in the right place. If you don't want to do that either because you don't k

Re: ** Problem using SGD and iris arff as test set **

2012-10-11 Thread Ted Dunning
Not sure just off=hand. Need to look in more detail in a debugger. Need to find time to do that. On Thu, Oct 11, 2012 at 1:58 AM, Rajesh Nikam wrote: > what could be the problem with data formatting ? > Could you please update on the same. > > On Thu, Oct 11, 2012 at 11:31 AM,

Re: SGD: Logistic regression package in Mahout

2012-10-15 Thread Ted Dunning
I would love to help and will before long. Just can't do it in the first part of this week. On Mon, Oct 15, 2012 at 6:28 AM, Rajesh Nikam wrote: > Hello, > > I have asked below question on issue with using sgd on mahout forum. > > Similar issue with sgd is reported by > > http://stackoverflow.c

Re: SGD: Logistic regression package in Mahout

2012-10-16 Thread Ted Dunning
ion: [[*26563.0, 23006.0*], [0.0, 0.0]] > entropy: [[-0.0, -0.0], [-46.1, -21.4]] > > I am not sure why this is failing all the time. > > Looking forward for your reply. > > Thanks > Rajesh > > > > On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning > wrote: > >

Re: Pseudo-Inverse map reduce implementation

2012-10-18 Thread Ted Dunning
Computing the svd with the stochastic projection is your best bet. Sent from my iPhone On Oct 17, 2012, at 10:42 PM, Ranjith Uthaman wrote: > Hi, > > Does map reduce implementation of Pseudo-Inverse of a matrix exist in the > current Mahout framework? What are the various ways to achieve it

Re: If you're at Hadoop World this year

2012-10-21 Thread Ted Dunning
If we have descended to personal advertising, then I should mention that I am speaking as well. http://strataconf.com/stratany2012/public/schedule/speaker/126559 I will also have office hours afterwards during which the topic is unlimited. Drop by! On Sun, Oct 21, 2012 at 11:20 AM, Josh Patters

  1   2   3   4   5   6   7   8   9   10   >