Re: Detecting rank-deficiency, or worse, via QR decomposition

2013-07-18 Thread Ted Dunning
On Thu, Jul 18, 2013 at 3:35 PM, Dmitriy Lyubimov wrote: > > It is hard to frame HH in a row major fashion. I might be able to figure > > out a Given's rotation method that is row oriented. > > > > Mahout SSVD already has the row-wise Givens rotation solver . I guess i can > revive that as a sta

Re: Keeping track of revisions of models?

2013-07-18 Thread Ted Dunning
, why this would be very hard with git (plus there is no added > value). > > This might not be the best solution but it's a cheap way to see model > performance over time and better than no history > > > On Thu, Jul 18, 2013 at 7:15 AM, Ted Dunning > wrote: > > &

Re: Keeping track of revisions of models?

2013-07-17 Thread Ted Dunning
Keeping old models is one thing. Keeping track of exactly which data you trained with is another thing. Since you often need access to both old and new models at the same time, it is common to simply burn a serial number into the file containing the model and simply keep all of them. You need to

Re: Using Mahout for low-volume data

2013-07-15 Thread Ted Dunning
I think so, but I cannot say that I know so. On Mon, Jul 15, 2013 at 8:37 AM, Koobas wrote: > Is a factorizing recommender a better idea for low volume data in general? > > > On Mon, Jul 15, 2013 at 11:35 AM, Ted Dunning > wrote: > > > With such small data, this soun

Re: Using Mahout for low-volume data

2013-07-15 Thread Ted Dunning
d 100k users and around 1k items. The data > is sparse (I have a boolean affinity - the user either bought the item or > did not) > > PS: I have been playing around with a sample code, using Loglikelihood > Similarity to get a 24% precision, is this a par score? > > > >

Re: Using Mahout for low-volume data

2013-07-14 Thread Ted Dunning
Mahout will work fine for smaller data sizes. Collaborative filtering can be difficult in general with small data, however. How many users and how many items? How many actions? On Sun, Jul 14, 2013 at 10:22 PM, Jayesh wrote: > Hello, > > I am exploring the collaborative filtering algorithms

Re: Semi-supervised Learning on Mahout

2013-07-14 Thread Ted Dunning
I have experimented with this, but never used it in anger. The method I used was simply to add the target variable while clustering. Results were interesting. A much more interesting approach is detailed here: http://www.cs.cmu.edu/~epxing/papers/Old_papers/xing_nips02_metric.pdf This also loo

Re: Does Mahout decision forest support regression?

2013-07-10 Thread Ted Dunning
How did you do it? Could you post some explanation / description of your method? On Wed, Jul 10, 2013 at 2:56 PM, Som Satpathy wrote: > I am able to get regression work via Mahout 0.7's random forest. > > Thanks, > Som > > On Fri, Apr 5, 2013 at 4:48 PM, S. Zhou wrote: > > > I am using Mahout

Re: Preserve contents of keys after running k-means

2013-07-05 Thread Ted Dunning
with > that > > in the input. > > > > We may do our dumb idea for now until we get a better handle on how the > > output is written. > > > > Thanks! > > > > > > On Fri, Jul 5, 2013 at 1:57 PM, Ted Dunning > wrote: > > > >> A

Re: Preserve contents of keys after running k-means

2013-07-05 Thread Ted Dunning
Andrew, That is a pretty clever solution. I think that you can get by with a simpler solution by noting how the internal id's are assigned (sequentially, I think). On Fri, Jul 5, 2013 at 1:53 PM, Andrew Musselman wrote: > So how are people working around this without patching 0.7? Downgradi

Re: Mahout 0.7 not working with Hadoop 1.1.1

2013-07-03 Thread Ted Dunning
Use the trunk version. 0.7 is quite old and will not have any fixes applied to it. The current version (will soon be released as 0.8) should work. On Wed, Jul 3, 2013 at 9:47 PM, Apurv Khare wrote: > Please can anyone look into this issue. > > I didn't get any respond. > > -Original Messag

Re: Need help on Mahout

2013-07-03 Thread Ted Dunning
Sugato, Some forms of Hadoop support access to the cluster via NFS [1]. That allows programs like trainlogistic to read data from a Hadoop cluster. Sebastian's suggestion works fine for other Hadoop distributions if you don't mind the copy. I work for the company referenced so keep that in kind

Re: significance of FEATURES in SGD

2013-07-03 Thread Ted Dunning
The dimensionality of the feature vector definitely has a large impact on accuracy as well as on the cost of the learning process. I would be very surprised if you get good accuracy with a feature vector with dimension 100. Even 10,000 may be a bit small but with multiple probes it may well work.

Re: mahout logistic regression

2013-07-01 Thread Ted Dunning
pdate > rule is used in the code... > > Regards, > > On Mon, Jul 1, 2013 at 3:22 AM, Ted Dunning wrote: > >> Follow into the regression code itself and check the references. >> >> >> On Fri, Jun 28, 2013 at 3:35 PM, qiaoresearcher >> wrote: >&

Re: mahout logistic regression

2013-07-01 Thread Ted Dunning
Follow into the regression code itself and check the references. On Fri, Jun 28, 2013 at 3:35 PM, qiaoresearcher wrote: > The logistic regression code is difficult to follow: the trainlogistic and > runlogistic part > > how the likelihood is calculated, how the weights is updated, etc > > does a

Re: Parallel implemented classifiers in Mahout

2013-06-25 Thread Ted Dunning
SGD is sequential, but if you have millions of examples to classify, you can run many copies of the SGD model in parallel. This is a trivial form of parallelism, but it is often encountered in practice. On Tue, Jun 25, 2013 at 5:31 PM, Suneel Marthi wrote: > ... and SGD is sequential. > > > >

Re: How to Analyse K-mean Clustering output

2013-06-25 Thread Ted Dunning
On Tue, Jun 25, 2013 at 4:53 AM, Apurv Khare wrote: > double[] d = new double[c.length]; > for (int i = 0; i < c.length; i++) > d[i] = Double.parseDouble(c[i]); > Vector vec = new > RandomAccessSparseVector(c

Re: database support for clustering

2013-06-25 Thread Ted Dunning
t; > > On Mon, Jun 24, 2013 at 12:55 PM, Ted Dunning > wrote: > > > Better would be to build a Hive UDF that vectorizes your data directly > from > > the Hive table and produces a sequence file with vectors ready to > cluster. > > Then use the streaming k-mean

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Ted Dunning
This is a common chestnut that gets trotted out commonly, but I doubt that the effects that the OP was worried about where on the same scale. Non-commutativity of FP arithmetic on doubles rarely has a very large effect. On Mon, Jun 24, 2013 at 11:17 PM, Michael Kazekin wrote: > Any algorithm is

Re: Consistent repeatable results for distributed ALS-WR recommender

2013-06-24 Thread Ted Dunning
See org.apache.mahout.common.RandomUtils#useTestSeed It provides the ability to freeze the initial seed. Normally this is only used during testing, but you could use it. On Mon, Jun 24, 2013 at 8:44 PM, Michael Kazekin wrote: > Thanks a lot! > Do you know by any chance what are the underlying

Re: database support for clustering

2013-06-24 Thread Ted Dunning
Better would be to build a Hive UDF that vectorizes your data directly from the Hive table and produces a sequence file with vectors ready to cluster. Then use the streaming k-means stuff. On Mon, Jun 24, 2013 at 4:43 PM, Chirag Lakhani wrote: > What data base interfaces are there for Mahout?

Re: How to Analyse K-mean Clustering output

2013-06-24 Thread Ted Dunning
What code? On Mon, Jun 24, 2013 at 8:00 AM, Apurv Khare wrote: > Hi, > > I am using clustering for one of my POC. > > ** ** > > My data looks like : > > ** ** > > Id > > Gender > > Education > > Occupation > > Income > > Age > > State > > Marital Status

Re: Query in Mahout

2013-06-22 Thread Ted Dunning
Also, if you use the search engine method for deploying your engine, you can simply put categories on your items and use a boolean search to limit the results of recommendations. The same is useful for geo-limited recommendations. On Fri, Jun 21, 2013 at 9:28 PM, Sebastian Schelter wrote: > Hi

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
See https://github.com/tdunning/in-memory-cooccurrence for an in-memory implementation. Should just require three or so lines of code. On Fri, Jun 21, 2013 at 11:23 AM, Sebastian Schelter wrote: > > Not that it much matters, I tend to filter out user x item entries based > on > > the item *an

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
rs is fine. The reason that it doesn't much matter is that very few elements are filtered out. > > On Fri, Jun 21, 2013 at 12:35 PM, Ted Dunning > wrote: > > > Well, you are still stuck with the problem that pulling more bits out of > > the small count data is a

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
mind be a "smoothed" probability. > > Then, I'd get the top k per item from these. > > > > On Fri, Jun 21, 2013 at 11:45 AM, Ted Dunning > wrote: > > > On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon < > dangeorge.fili...@gmail.com > > >wrote: >

Re: Log-likelihood ratio test as a probability

2013-06-21 Thread Ted Dunning
On Fri, Jun 21, 2013 at 8:25 AM, Dan Filimon wrote: > Thanks for the reference! I'll take a look at chapter 7, but let me first > describe what I'm trying to achieve. > > I'm trying to identify interesting pairs, the anomalous co-occurrences with > the LLR. I'm doing this for a day's data and I wa

Re: Log-likelihood ratio test as a probability

2013-06-20 Thread Ted Dunning
I think that this is a really bad thing to do. The LLR is really good to find interesting things. Once you have done that, directly using the LLR in any form to produce a weight reduces the method to something akin to Naive Bayes. This is bad generally and very, very bad in the cases of smal cou

Re: K Mean Clustering on Two columns`

2013-06-18 Thread Ted Dunning
For low dimension problems with limited data, you will be much happier with something like R for clustering and visualization. On Tue, Jun 18, 2013 at 11:52 AM, syed kather wrote: > Hi Team >How to do the K Mean Clustering on 2 selected Columns > > > > Line No,age,income,sex,city > 1,22,15

Re: Negative Preferences in a Recommender

2013-06-18 Thread Ted Dunning
I have found that in practice, don't-like is very close to like. That is, things that somebody doesn't like are very closely related to the things that they do like. Things that are quite distant wind up as don't-care, not don't-like. This makes most simple approaches to modeling polar preferenc

Re: Mahout compatibility with Hadoop

2013-06-17 Thread Ted Dunning
Not sure which distributions you mean when you say "other distributions". MapR includes Mahout 0.7. On Mon, Jun 17, 2013 at 3:56 PM, Razon, Oren wrote: > Thanks Sebastian & Sean. I know Cloudera and other distributions until > lately supported only Mahout 0.5 which made me suspect. > I will

Re: Applying clustering techique

2013-06-14 Thread Ted Dunning
g > be feasible. > > > On Fri, Jun 14, 2013 at 1:27 AM, Ted Dunning > wrote: > > > [image: Boxbe] <https://www.boxbe.com/overview> This message is > eligible > > for Automatic Cleanup! (ted.dunn...@gmail.com) Add cleanup rule< > https://www.boxbe.com/popu

Re: Applying clustering techique

2013-06-13 Thread Ted Dunning
ser id. Should be pretty straightforward. > > On Jun 13, 2013, at 10:05 AM, Neetha wrote: > > > Thank you, for the reply. How can we group the user. > > > > > > On Thu, Jun 13, 2013 at 3:41 PM, Ted Dunning > wrote: > > > >> [image: Boxbe] &l

Re: LDA/CVB Performance

2013-06-13 Thread Ted Dunning
On Thu, Jun 13, 2013 at 6:50 PM, Jake Mannix wrote: > Andy, note that he said he's running with a 1.6M-term dictionary. That's > going > to be 2 * 200 * 1.6M * 8B = 5.1GB for just the term-topic matrices. Still > not hitting > 8GB, but getting closer. > It will likely be even worse unless this

Re: Applying clustering techique

2013-06-13 Thread Ted Dunning
You need to group by user before converting to vector to get sensible clustering. On Wed, Jun 12, 2013 at 1:06 PM, Grant Ingersoll wrote: > The CSVVectorIterator in the Integration package will take in a CSV file > and produce vectors. It assumes that each row is the equivalent of a > DenseVect

Re: Mahout CVB parameters

2013-06-13 Thread Ted Dunning
> > Hi Ted, > > > > My assumption is that there are lot of concepts (keywords/tags for the > > document) usually present in a single document and in 8K documents, you > > might find many unique concepts. We have also done some analysis by > > manually going over

Re: Mahout CVB parameters

2013-06-12 Thread Ted Dunning
Why does document concept require such a large K? On Wed, Jun 12, 2013 at 7:08 PM, Ankur Desai -X (ankurdes - SATYAM COMPUTER SERVICES LIMITED at Cisco) wrote: > Thanks Jake for your response. I am trying to get concepts out of the > documents and for this I want the K to be large around 500

Re: mahout random forest variable importance implementation

2013-06-12 Thread Ted Dunning
A quick thought I had on this was to emit every record twice. Once with a sensible key and once with the hash of that key. Then in the reducer buffer up randomized or straight records as needed. This doesn't resample each variable independently but it should work just about as well. On Wed, Ju

Re: Mahout SSVD is too slow for highly dimensional data

2013-06-12 Thread Ted Dunning
take. it still should be cpu-bound > > though regardless. > > > > A much better trade-off is to have fewer values but more precision in > them > > with a power iteration (-q 1). Power iteration step (ABt) will definitely > > have a hard time to multiply with k=1000 j

Re: Mahout SSVD is too slow for highly dimensional data

2013-06-11 Thread Ted Dunning
Don't do that. Why do you think you need 1000 singular values? Have you tried with k=100, p=15? Quite serious, I would expect that you would literally get just as good results for almost any real application with 100 singular vectors and 900 orthogonal noise vectors. On Tue, Jun 11, 2013 at 9:

Re: Why are clustering emails not clustering similar stuff?

2013-06-08 Thread Ted Dunning
How are you verifying your vectorization? What do you use for weighting of words? Have you tested the distance between the notifications and other documents? Are closely duplicate documents close to each other? Sent from my iPhone On Jun 6, 2013, at 7:47, Jesvin Jose wrote: > I tried to cl

Re: Processing 50 millions of file for LDA

2013-06-04 Thread Ted Dunning
Nishant, It is hard to advise on detailed trade-offs for your case but I am pretty sure that there are other options than S3 which is, as you say, very slow in terms of latency due to transferring lots of small objects. One alternative, for instance, would be to use a long-lived MapR cluster to s

Re: FP Growth

2013-06-02 Thread Ted Dunning
It is also plausible to create a github repo for these "removed" pieces. Or use Apache extras. Or simply cut them off. If we do go with Stevo's suggestion, I would want to have a large print README pointing out that it is unlike that any of this code would work on the current version of Mahout.

Re: Blending initial recommendations for cross recommendation

2013-05-31 Thread Ted Dunning
ommendations on a different scale. So how do you combine them? > > > On Fri, May 31, 2013 at 3:07 PM, Dominik Hübner >wrote: > > > Hey, > > I have implemented a cross recommender based on the approach Ted Dunning > > proposed (cannot find the original post, but here i

Re: Multiple document collections using SparseVectorsFromSequenceFiles

2013-05-31 Thread Ted Dunning
n Conwell > To: user > Sent: Wednesday, May 29, 2013 11:56 AM > Subject: Re: Multiple document collections using > SparseVectorsFromSequenceFiles > > > Cool. I'll start setting this up. > > > On Wed, May 29, 2013 at 6:22 AM, Ted Dunning > wrote: > > >

Re: bottom up clustering

2013-05-30 Thread Ted Dunning
Rajesh The streaming k-means implementation is very much like what you are asking for. The first pass is to cluster into many, many clusters and then cluster those clusters. Sent from my iPhone On May 30, 2013, at 11:20, Rajesh Nikam wrote: > Hello Suneel, > > I got it. Next step to cano

Re: FeatureVectorEncoder Framework Signatures

2013-05-29 Thread Ted Dunning
Offset and length is a grand idea. You can already restrict what part of the vector the results go into. Just pass in v.viewpart() instead of v. (I am on a phone. The real method is named something like that) Sent from my iPhone On May 28, 2013, at 21:52, Johannes Schulte wrote: > Hi, > >

Re: Multiple document collections using SparseVectorsFromSequenceFiles

2013-05-29 Thread Ted Dunning
On Tue, May 28, 2013 at 12:55 PM, John Conwell wrote: > So whats the first steps? > Write up what you need and attach the resulting document to a JIRA. You might consider creating a google document that allows others to comment in order to build consensus about the requirements.

Re: Handling unbalanced datasets in Mahout text classsification

2013-05-27 Thread Ted Dunning
Complementary Naive Bayes uses the negative signals to help in cases like this. See Rennie's papers. http://qwone.com/~jason/papers/sm-thesis.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/icml2003_RennieSTK03.pdf On Sun, May 26, 2013 at 10:22 PM, Chandra Mohan, Ananda Vel Murugan <

Re: Combining Userbased and the item based

2013-05-27 Thread Ted Dunning
Yes. It is feasible to combine different forms of recommendation into a single recommendation. On Sun, May 26, 2013 at 10:57 PM, Neetha wrote: > User based recommendations can favor a combined recommendation with > item-item based on the preference value of each item for a user . is this a >

Re: Multiple document collections using SparseVectorsFromSequenceFiles

2013-05-24 Thread Ted Dunning
SVFSF really is designed for a one-shot sort of processing. The issues arise with all of the corpus frequency cutoffs and such. N-gram detection, frequency cutoffs and so on are all going to be problems with piecewise conversion. If all you use it for is tokenizing, then there isn't a problem.

Re: Interpreting Cluster Dump Metrics

2013-05-24 Thread Ted Dunning
ing the average > distance from the closest centroid for these held-out docs? Averaging > together the ones that are closest to the same centroid, then averaging the > averages for all clusters? > > I don't think I've heard of this before. Seems interesting is there a >

Re: convert input for SVD

2013-05-24 Thread Ted Dunning
Or a newer version. Sent from my iPhone On May 24, 2013, at 6:42, Dmitriy Lyubimov wrote: > documentation was > written. I guess you may obtain an earlier snapshot as interim solution if > it is indeed the case.

Re: Hidden Markov Models and time series - 2 questions

2013-05-23 Thread Ted Dunning
On Wed, May 22, 2013 at 5:20 PM, yikes aroni wrote: > thanks for the reply ... I've discretized the continuous time series > observations and assigned them to symbols. Ahh... excellent. But see below as well. > The number of hidden states is > 2: "out of control" and "not out of control -- 0

Re: Which database should I use with Mahout

2013-05-23 Thread Ted Dunning
so far! > > > On Thu, May 23, 2013 at 2:14 AM, Ted Dunning > wrote: > > > Yes what you are describing with diversification is something that I have > > called anti-flood. It comes from the fact that we really are optimizing a > > portfolio of recommendations

Re: SSVD outputs different things vs R version of SVD

2013-05-23 Thread Ted Dunning
The SVD of a matrix is not unique. You can change the sign and rearrange the singular values at well. Customary practice is to order by the square of the singular value, but that doesn't make the SVD unique. Regarding the number of singular values, R's svd routine computes all of the singular va

Re: convert input for SVD

2013-05-23 Thread Ted Dunning
I have tried check ssvd, I have used arff.vector to covert arff/csv > to > > >>> vector file which is then give as input to ssvd and them dumped U, V > > and > > >>> sigma using vectordump. > > >>> > > >>> I see most of the values dumped are near to 0

Re: Implementing the General Bayesian Inference ML Algorithm

2013-05-23 Thread Ted Dunning
It isn't clear what you want to do. You say general Bayesian inference, but then you seem to refer to a very specific, non-general form of inference. It also seems that you are never considering distributions at all, but merely doing something like Laplace correction[1] to compute the probably th

Re: Which database should I use with Mahout

2013-05-22 Thread Ted Dunning
uot; > from my googling, but that is way above of what i am capable of doing. I > just recognized that both amazon and netflix present multiple > recommendation lists grouped by category, so in a way it's similar to > search engine result clustering. > > > > > &g

Re: Hidden Markov Models and time series - 2 questions

2013-05-22 Thread Ted Dunning
HMM's could be useful, but you have to define things a bit differently. First of all, HMM's want symbolic inputs and want to give you symbolic outputs. You don't get to see the internal state. My first approach would be to use k-means clustering on short sequences of your observed continuous var

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
13 5 4 7 6 8 15 12 11 9 10 36 14 24 70 19 16 [1] 1 2 6 3 5 4 11 22 7 9 250 8 10 15 12 17 13 40 16 14 > Have a nice day > > > > > On Wed, May 22, 2013 at 2:30 AM, Ted Dunning > wrote: > > > I have so far just used the weights that So

Re: Interpreting Cluster Dump Metrics

2013-05-21 Thread Ted Dunning
On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel wrote: > For this sample it looks like about 20-40 clusters is "best"? Looking at > the results for k=40 by eyeball they do seem pretty good. It is really hard to tell with these numbers. IN spite of their heritage, these scaled average distances are

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
Inline On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel wrote: > In the interest of getting some empirical data out about various > architectures: > > On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: > > >> ... > >> You use the user history vector as a query? > > > > The most recent suffix of the

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
I have so far just used the weights that Solr applies natively. In my experience, what makes a recommendation engine work better is, in order of importance, a) dithering so that you gather wider data b) using multiple sources of input c) returning results quickly and reliably d) the actual alg

Re: mahout colt collections

2013-05-21 Thread Ted Dunning
Dan, I think that she did do the attachment and it got filtered away. Sophie, One easy thing to do is to file a JIRA report using https://issues.apache.org/jira/browse/MAHOUT Then you can attach your program to that bug report. Alternatively, you can attach the program to some other service.

Re: convert input for SVD

2013-05-20 Thread Ted Dunning
Are you using Lanczos instead of SSVD for a reason? On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam wrote: > Hello, > > I have arff / csv file containing input data that I want to pass to svd : > Lanczos Singular Value Decomposition. > > Which tool to use to convert it to required format ? > >

Re: Which database should I use with Mahout

2013-05-20 Thread Ted Dunning
gt; 2. Precalculation per user is heavyweight, stale and hard to do if the > context also plays a role (site the use is on e.g because you have to have > the cartesian product of recommendations prepared for every user), while > "real time" approach can handle it > >

Re: Which database should I use with Mahout

2013-05-20 Thread Ted Dunning
Inline answers. On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel wrote: > ... > You use the user history vector as a query? The most recent suffix of the history vector. How much is used varies by the purpose. > This will be a list of item IDs and strength-of-preference values (maybe > 1s for p

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
rowser persistent storage. Another is to use a memory based user profile database. Yet another is to use M7 tables on MapR or HBase on other Hadoop distributions. > On May 19, 2013, at 8:09 PM, Ted Dunning wrote: > > On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote: > > > Two

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 8:04 PM, Pat Ferrel wrote: > Two basic solutions to this are: factorize (reduces 100s of thousands of > items to hundreds of 'features') and continue to calculate recs at runtime, > which you have to do with Myrrix since mahout does not have an in-memory > ALS impl, or mov

Re: Which database should I use with Mahout

2013-05-19 Thread Ted Dunning
On Sun, May 19, 2013 at 6:26 PM, Pat Ferrel wrote: > Using a Hadoop version of a Mahout recommender will create some number of > recs for all users as its output. Sean is talking about Myrrix I think > which uses factorization to get much smaller models and so can calculate > the recs at runtime

Re: mahout colt collections

2013-05-19 Thread Ted Dunning
Sophie, Can you say a bit more about what you want to do? On Sun, May 19, 2013 at 2:22 PM, Sophie Sperner wrote: > Dear, > > I'm experiencing difficulties with > hppclibrary that I'm using. My > algorithms work perfectly fine for small inputs, > but when

Re: More Cross-recommender thoughts

2013-05-17 Thread Ted Dunning
Anonymizing the id's is a good start, especially if you have a relatively small subset of the entire social graph and if the graph is publicly visible in any case. If you have a complete crawl of the graph, then many id's will recoverable by reference back to the public version of the graph. Sinc

Re: Problems with KMeans Clustering - Radius calculation returns incorrect ZERO value in some cases.

2013-05-15 Thread Ted Dunning
Also, if only a single point is assigned to the cluster, the radius of the cluster is, by definition, zero. That isn't a bug. On Wed, May 15, 2013 at 10:15 AM, Jeff Eastman < jeast...@windwardsolutions.com> wrote: > What you have observed is correct. During the final iteration, points are > ob

Re: Removing Training Order Dependency on SGD Classification

2013-05-14 Thread Ted Dunning
On Tue, May 14, 2013 at 12:06 AM, Dan Filimon wrote: > Nice! Thanks for the links Ted! > yeah... I can google! > > > > SGD converges with 1/n [1]. Second order techniques converge > quadratically > > in the number of iterations but each iteration can be quite expensive in > > terms of the numb

Re: Removing Training Order Dependency on SGD Classification

2013-05-13 Thread Ted Dunning
On Mon, May 13, 2013 at 10:48 PM, Dan Filimon wrote: > SGD and batch gradient descent have the same expected errors however. But not the same rate of convergence. SGD converges with 1/n [1]. Second order techniques converge quadratically in the number of iterations but each iteration can be qu

Re: Removing Training Order Dependency on SGD Classification

2013-05-13 Thread Ted Dunning
On Mon, May 13, 2013 at 9:24 PM, Tom Marthaler wrote: > Is there a way to make the feature weights deterministic, no matter the > order of the input training vectors? > Well, not if you use *STOCHASTIC* gradient descent. That is part of the point, actually. But seriously, this is likely to be

Re: Statistical machine learning with Gaussian distributions

2013-05-11 Thread Ted Dunning
On Sat, May 11, 2013 at 9:43 AM, Matthew McClain wrote: > This constraint can be > removed by characterizing each cluster by the mean and covariance of its > samples, and using maximum likelihood in place of the distance measurement > for assigning clusters to samples. > Just a note that ordinary

Re: Statistical machine learning with Gaussian distributions

2013-05-10 Thread Ted Dunning
K-means uses Gaussian errors. The dirichlet clustering can be configured to use Gaussian errors. SVD uses Gaussian errors. QR decomposition can be used to solve problems with Gaussian errors. I think I don't understand what you are asking about. On Fri, May 10, 2013 at 1:10 PM, Matthew McClai

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
t; necessary or if further information can be obtained. > > > On May 6, 2013, at 11:21 PM, Ted Dunning wrote: > > > Are you looking to build a product recommender based on your own design? > > Or do you want to build one based on existing methods? > > > > If you want t

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
On Mon, May 6, 2013 at 12:50 PM, Koobas wrote: > Since Dominik mentioned item-based and ALS, let me throw in a question > here. > I believe that one of the Netflix price solutions combined KNN and ALS. > > 1) What is the best way to combine the results of both? > I think that combinations are im

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
h will work in > some way. > > On May 6, 2013, at 9:09 PM, Ted Dunning wrote: > > > I don't even think that clustering is all that necessary. > > > > The reduced cooccurrence matrix will give you items related to each item. > > > > You can use somet

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
ues/vectors? > > On May 6, 2013, at 8:52 PM, Ted Dunning wrote: > > > On Mon, May 6, 2013 at 11:29 AM, Dominik Hübner >wrote: > > > >> Oh, and I forgot how the views and sales are used to build product > >> vectors. As of now, I implemented binary vectors, vec

Re: Clustering product views and sales

2013-05-06 Thread Ted Dunning
On Mon, May 6, 2013 at 11:29 AM, Dominik Hübner wrote: > Oh, and I forgot how the views and sales are used to build product > vectors. As of now, I implemented binary vectors, vectors counting the > number of views and sales (e.g 1view=1count, 1sale=10counts) and ordinary > vectors ( view => 1, sa

Re: Clustering Categorical Data

2013-05-06 Thread Ted Dunning
and 400 categories. > > > On Mon, May 6, 2013 at 9:17 PM, Ted Dunning wrote: > > > It really depends on your data, but anything that works on text has at > > least a potential for working on categorical data. > > > > It is common to use a 1-of-n encoding fo

Re: Clustering Categorical Data

2013-05-06 Thread Ted Dunning
It really depends on your data, but anything that works on text has at least a potential for working on categorical data. It is common to use a 1-of-n encoding for categorical data and then simply use Euclidean distance with something like k-means. Can you say something about how many variables a

Re: parallelALS and RMSE TEST

2013-05-03 Thread Ted Dunning
WIthout more information it is impossible to comment. What experiments? On Fri, May 3, 2013 at 8:45 AM, William wrote: > I'm trying to get some recommendations with three Algorithms: > 1.parallelALS > 2.evaluateFactorization > 3.recommendfactorized > > In my experiments, RMSE value monotonic

Re: Problem in PhD Topic

2013-05-03 Thread Ted Dunning
And your PhD advisor should be able to help. On Fri, May 3, 2013 at 9:38 AM, Sebastian Schelter wrote: > Hi Saeed, > > I don't think that this list can help you very much with finding a PhD > topic. I suggest you identify the top conferences in your field of > interest and have a look at the p

Re: confirm subscribe to user@mahout.apache.org

2013-05-02 Thread Ted Dunning
n Thu, May 2, 2013 at 10:20 AM, Niraj Shrestha wrote: > >> Dear Ted >> Java is 1.5.0_22. Is it error with Java version? >> >> Regards, >> Shrestha >> >> >> On Tue, Apr 30, 2013 at 8:13 PM, Ted Dunning wrote: >> >>> On Tue, Apr

Re: mahout lucene.vector from multiple solrcloud "index" directories for kmeans

2013-05-01 Thread Ted Dunning
Well done and well described. Solr loud is a bit new but the need you expressed is a real one that will appear again. Sent from my iPhone On May 1, 2013, at 15:47, Sebastián Ramírez wrote: > Well, I found a simple (maybe dirty) solution for my problem. > > I write it here just for the

Re: Time Based Recommender System

2013-05-01 Thread Ted Dunning
s one learns from Bayesian Bandits would be very helpful in this respect. > On Tue, Apr 30, 2013 at 11:46 AM, Ted Dunning > wrote: > > > Keep in mind that time dynamics generally have benefit for predicting > > ratings. The point is that the average rating for a person goes u

Re: confirm subscribe to user@mahout.apache.org

2013-04-30 Thread Ted Dunning
On Tue, Apr 30, 2013 at 8:55 AM, Niraj Shrestha wrote: > Bad version number in .class file > Which version of Java are you using?

Re: Time Based Recommender System

2013-04-30 Thread Ted Dunning
Keep in mind that time dynamics generally have benefit for predicting ratings. The point is that the average rating for a person goes up and down over time even if their general taste doesn't change. Likewise for an item. If you use implicit feedback and recommend based on recent behavior most o

Re: Mahout lucene UTFDataFormatException: encoded string too long:

2013-04-24 Thread Ted Dunning
This sounds pretty fishy. What this is saying is that you have a document in your index whose name is longer than 65,535 characters. That doesn't sound very plausible. Don't you have a more appropriate ID column? The problem starts where you say "--idField text". Pick a better field. On Wed

Re: Feature reduction for LibLinear weights

2013-04-24 Thread Ted Dunning
Glad to be able to help. Double hashing would probably allow you to preserve full accuracy at higher compression, but if you are happy, then you might as well be done. On Wed, Apr 24, 2013 at 1:56 PM, Ken Krugler wrote: > Hi Ted, > > On Apr 13, 2013, at 8:46pm, Ted Dunning wrote: >

Re: Is Feature Hashing appropriate for document to document similarity calculations?

2013-04-24 Thread Ted Dunning
Hashed feature vectors are an excellent choice for the unknown vocabulary problem. One problem you will have is that the static weighting won't by default weight rare words more highly than common words. One way to deal with this is to build a dictionary on a small subset of documents and assume

Re: Need help lanczos algorithm

2013-04-24 Thread Ted Dunning
On Wed, Apr 24, 2013 at 2:41 AM, Yogalakshmi J wrote: > 2. I want to make use of the Lanczos algorithm that is available. I want to > create a recommender, that will factorize the matrix using Lanczos > algorithm. > Lanczos is deprecated. Look for the SSVD implementation. Or for recommendation,

Re: DenseRowMatrix?

2013-04-18 Thread Ted Dunning
See MAHOUT-1193. Feel free to grab that code, torture it, and tell us if it is any good. If it is useful, we can commit it. On Thu, Apr 18, 2013 at 3:34 PM, Ted Dunning wrote: > I hadn't planned to since I don't have cycles to properly test it. > > Let me see if I can di

Re: DenseRowMatrix?

2013-04-18 Thread Ted Dunning
gt; > > On Thu, Apr 18, 2013 at 5:18 PM, Ted Dunning > wrote: > > > Actually, I built a BlockSparseMatrix some time ago that does this. > > > > It was just like our current row sparse matrix except that it allowed for > > blocks instead of rows. > > > >

Re: DenseRowMatrix?

2013-04-18 Thread Ted Dunning
Actually, I built a BlockSparseMatrix some time ago that does this. It was just like our current row sparse matrix except that it allowed for blocks instead of rows. It did assume sparse matrices where the data existed. Worked a charm on the app in question. It doesn't appear that I committed i

<    1   2   3   4   5   6   7   8   9   10   >