windowSize) . Would you mind telling more about
that? Thanks!
On Sat, Dec 15, 2012 at 2:44 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
I would recommend testing with OnlineLogisticRegression first.
The AdaptiveLogisticRegression has a tendency to freeze on sub-optimal
parameter
On Thu, Dec 13, 2012 at 2:29 PM, Brandon Root brandonr...@gmail.com wrote:
This is a question regarding the new KNN library that Ted Dunning and Dan
Filimon are working on (as I understand it'll be in Mahout 0.8) so I hope
this is the appropriate list for this question instead of mahout-dev
What Dan says here is correct. The lack of dependence on k in the current
code is definitely a problem.
The work-around is to set the maxClusters to the point that the log factor
should have grown to. That sucks so we should fix the heuristic sizing
along the lines that Dan says. There should
If your input files are in S3 then the map-reduce steps that mahout spawns
can access them without problems.
In order to run Mahout programs, you will need to install mahout. There
are command line programs in $MAHOUT_HOME/bin that will do what you need.
On Thu, Dec 13, 2012 at 10:58 AM, hellen
You are trying to run this job as a single step in an EMR flow. Mahout's
command line programs assume that you are running against a live cluster
that will hang around (since many mahout steps involve more than one
map-reduce).
It would probably be best to separate the creation of the cluster
?
From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org; hellen maziku nahe...@yahoo.com
Sent: Wednesday, December 12, 2012 9:48 AM
Subject: Re: Creating vectors from lucene index on EMR via the CLI
You are trying to run this job as a single step in an EMR
-mapreduce --create --alive--log-uri
s3n://mahout-output/logs/ --name dict_vectorize
doesn't that mean that the keep alive is set?
From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org; hellen maziku nahe...@yahoo.com
Sent: Wednesday, December
?
From: Ted Dunning ted.dunn...@gmail.com
To: user@mahout.apache.org; hellen maziku nahe...@yahoo.com
Sent: Wednesday, December 12, 2012 10:56 AM
Subject: Re: Creating vectors from lucene index on EMR via the CLI
I would still recommend that you switch to using the mahout programs
Yep.
On Sun, Dec 9, 2012 at 11:33 PM, Marty Kube
martyk...@beavercreekconsulting.com wrote:
Because it uses Java pointers instead of offsets. The mmap'ed structure
could be mapped into memory at any address and thus must be position
independent.
Okay, I think I get the point here.
in the cluster as a normal file which can then be mapped.
On 12/08/2012 03:43 AM, Ted Dunning wrote:
There are several approaches that might help:
1) use shared memory via mmap to store the forest. This allows multiple
mapper threads to access the same forest. The current Mahout in-memory
Yeah... right now you have the full cross product, but one side only has
one element so the product is trivial.
It isn't that much worse if that side has a few elements.
On Sat, Dec 8, 2012 at 9:49 PM, Marty Kube
martyk...@beavercreekconsulting.com wrote:
#2 Might be a nice general approach.
There isn't a clever way to find the medoid in Mahout.
Finding the n nearest elements can be done using a Searcher. The Brute
implementation should suffice.
On Thu, Dec 6, 2012 at 10:16 AM, Stefan Kreuzer stefankreuze...@aol.dewrote:
Hello,
when inspecting a cluster of sparse vectors, what
the link [1].
[1] https://github.com/dfilimon/knn/wiki/skm-visualization
On Thu, Dec 6, 2012 at 2:01 AM, Ted Dunning ted.dunn...@gmail.com wrote:
Still not that odd if several clusters are getting squashed. This can
happen if the threshold increases too high or if the searcher is unable
Deprecating is a nice first step to let people know where things are headed.
On Thu, Dec 6, 2012 at 4:21 PM, Sebastian Schelter s...@apache.org wrote:
The other three recommenders seem to be used almost never, so I'd like
to remove them, however I wouldn't have a problem with keeping them for
Try the cascaded model. Train the downstream model on data without the
don't-care docs or train it on documents that actually get through the
upstream model.
On Wed, Dec 5, 2012 at 4:50 PM, Raman Srinivasan raman.sriniva...@gmail.com
wrote:
I can exclude the don't care cases from the training
How many clusters are you talking about?
If you pick a modest number then streaming k-means should work well if it
has several times more surrogate points than there are clusters.
Also, typically a hyper-cube test works with very small cluster radius.
Try 0.1 or 0.01. Otherwise, your clusters
even before I can
sub-class them. What's usually a good approach when less than 5% of the
data is meaningful.
On Wed, Dec 5, 2012 at 10:26 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Try the cascaded model. Train the downstream model on data without the
don't-care docs or train
/d224eb7ca7bd6870eaef2e355012cac3aa59f051/src/test/java/org/apache/mahout/knn/cluster/StreamingKMeansTest.java#L104
[3] https://github.com/dfilimon/knn/issues/1
On Thu, Dec 6, 2012 at 1:03 AM, Ted Dunning ted.dunn...@gmail.com wrote:
How many clusters are you talking about?
If you pick
Ahh... this may also be a problem.
You should get better results with a Brute searcher here, but a
ProjectionSearcher with lots of projections may work well.
On Thu, Dec 6, 2012 at 12:22 AM, Dan Filimon dangeorge.fili...@gmail.comwrote:
So, yes, it's probably a bug of some kind since I end up
dangeorge.fili...@gmail.comwrote:
But the weight referred to is the distance between a centroid and the
mean of a distribution (a cube vertice).
This should still be very small (also BallKMeans gets it right).
On Thu, Dec 6, 2012 at 1:32 AM, Ted Dunning ted.dunn...@gmail.com wrote:
IN order to succeed
On Wed, Dec 5, 2012 at 5:29 PM, Koobas koo...@gmail.com wrote:
...
Now yet another naive question.
Ted is probably going to go ballistic ;)
I hope not.
Assuming that simple overlap methods suck,
is there still a metric that works better than others
(i.e. Tanimoto vs. Jaccard vs
The minhash algorithm itself should work as well with non-English text. It
is likely that the input phases where the text is analyzed would not work
correctly, however.
On Tue, Dec 4, 2012 at 6:05 PM, Varun Thacker varunthacker1...@gmail.comwrote:
I'd tried out the MinHash algorithm in mahout
Bernát
I am guessing from the fact that you have accents in your name that you may
be in Europe.
If so, it is possible that there is a confusion about the decimal point
that Mahout uses and the one that you use. Is it possible that you have
decimal numbers like 3,1 instead of 3.1?
On Tue, Dec
What Kate says is good advice. You can have considerable amounts of bias,
but you may be telling the model something about the relative cost of
errors and that can result in things happening that you don't like.
As you noted, your model could have gotten 95% correct by simply saying
DON'T CARE
Also, you have to separate UI considerations from algorithm considerations.
What algorithm populates the recommendations is the recommender algorithm.
It has two responsibilities... first, find items that the users will like
and second, pick out a variety of less certain items to learn about.
On Wed, Dec 5, 2012 at 6:57 AM, Paulo Villegas paulo.vl...@gmail.comwrote:
On 05/12/12 00:53, Ted Dunning wrote:
Also, you have to separate UI considerations from algorithm
considerations.
What algorithm populates the recommendations is the recommender
algorithm.
It has two
On Mon, Dec 3, 2012 at 3:06 AM, Koobas koo...@gmail.com wrote:
Thank you very much.
The pointer to Myrrix is a very useful piece of information.
Myrrix, however, relies on an iterative sparse matrix factorization to do
PCA.
I want to produce Amazon-like recommendations.
I.e., 70% of users
Also, don't make algorithm choices based on small data samples. Bigger
data will change the ordering of which algorithms work well.
On Mon, Dec 3, 2012 at 10:04 PM, Sean Owen sro...@gmail.com wrote:
You may do better with a latent feature approach -- working in lower
dimensional space won't
don't have the same cardinality, so vector1.plus(vector2) does
not work.
Is there a way to resize a given vector? Sorry I am a complette
Mahout-noob.
-Ursprüngliche Mitteilung-
Von: Ted Dunning ted.dunn...@gmail.com
An: user user@mahout.apache.org
Verschickt: Do, 29 Nov 2012 11:45
Robert's analysis is correct.
This would be worthy of a comment at the least.
On Wed, Nov 28, 2012 at 11:53 AM, Lancaster, Robert (Orbitz)
robert.lancas...@orbitz.com wrote:
graidentBase is coming from:
double gradientBase = gradient.get(i);
Prior to that:
Vector gradient =
+1
On Wed, Nov 28, 2012 at 12:56 PM, Jake Mannix jake.man...@gmail.com wrote:
or maybe call the variable negativeGradient, instead?
and, in
that case, bash would be too slow, wouldn't it?
On Tue, Nov 27, 2012 at 12:54 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
How many data points are you clustering? How many dimensions?
On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal
eduard.gamo...@gmail.comwrote:
Hi,
I'm
How many data points are you clustering? How many dimensions?
On Mon, Nov 26, 2012 at 2:33 PM, Eduard Gamonal eduard.gamo...@gmail.comwrote:
Hi,
I'm doing a MSc at Northeastern and I'm working on analyzing some US
election polls with kmeans.
I'm a beginner with both Mahout and Hadoop. I've
That implementation is deprecated. The SSVD implement should be used
instead.
On Thu, Nov 22, 2012 at 9:58 AM, Abramov Pavel p.abra...@rambler-co.ruwrote:
Hi,
Here is step by step manual for Lanczos implementation:
https://cwiki.apache.org/MAHOUT/dimensional-reduction.html
Pavel
...@gmail.com wrote:
That's kind of what it does now... though it weights everything as 1. Not
so smart, but for sparse-ish data is not far off from a smarter answer.
On Thu, Nov 15, 2012 at 6:47 PM, Ted Dunning ted.dunn...@gmail.com wrote:
My own preference (pun intended) is to use log
Why do you have maven.glassfish.org in your repo path?
On Fri, Nov 9, 2012 at 7:17 PM, Lance Norskog goks...@gmail.com wrote:
I'm getting this from the current git checkout. There are 301
(redirections) but there is nothing at the target either.
Downloading:
If you want k-means speed see the new k-means code:
https://github.com/tdunning/knn
Can you describe your data a bit?
On Sat, Nov 10, 2012 at 11:22 AM, pricila rr pricila...@gmail.com wrote:
I am running kmeans algorithm.
Increasing the number of tasktrackers and datanodes, increase the
There is additional confusion typically because supervised and unsupervised
methods are commonly used together. For instance, clustering
(unsupervised) can be used to generate cluster proximity features that are
then used as features for classification (supervised).
Another example might be
On Mon, Nov 5, 2012 at 9:16 PM, Johannes Schulte johannes.schu...@gmail.com
wrote:
is it possible you are mixing up payloads and stored fields? The latter
ones are not indexed and can only be used for the top n results. Maybe
we're talking about different things..
I think I did mix these
perform best. Maybe
this
blog article by netflix is a good start
http://techblog.netflix.com/2012/06/netflix-recommendations-beyond-5-stars.html
Cheers,
Johannes
On Fri, Nov 2, 2012 at 6:21 AM, Ted Dunning ted.dunn...@gmail.com
wrote:
Speaking with no principles in hand
On Mon, Nov 5, 2012 at 12:06 PM, Johannes Schulte
johannes.schu...@gmail.com wrote:
do you really mean payloads? Because i consider them part of the index as
they are stored per position and can be accessed during scoring.
I had the impression that they were not indexed. They are
Your mileage will vary.
It is often helpful to classify small parts of large articles and then
somehow deal with these multiple classifications at the full document level.
Sometimes it is not helpful, especially if the small parts get too small.
Try it both ways. My tendency is to prefer to
for your reply.
Thanks
Rajesh
On Wed, Oct 31, 2012 at 11:00 AM, Rajesh Nikam rajeshni...@gmail.com
wrote:
Hi Ted,
Thanks for reply. I will wait for JIRA and hope to get rid of any
encoding
issue.
Thanks,
Rajesh
On Oct 31, 2012 5:24 AM, Ted Dunning
wrong.
You are right. It does make things harder. It can also make them better.
On Thu, Nov 1, 2012 at 9:39 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Your mileage will vary.
It is often helpful to classify small parts of large articles and then
somehow deal with these multiple
as class_1.
AUC = 0.50
confusion: [[*26563.0, 23006.0*], [0.0, 0.0]]
entropy: [[-0.0, -0.0], [-46.1, -21.4]]
I am not sure why this is failing all the time.
Looking forward for your reply.
Thanks
Rajesh
On Tue, Oct 16, 2012 at 3:57 AM, Ted Dunning ted.dunn
It is a nice writeup, but the Mahout comparison was a bit of a strawman. I
wish I could go to their talk, but I was in office hours right then.
Coincidentally, I was advising somebody that an excellent way to deploy a
recommendation system is on top of Solr. As the regulars here will know, I
If we have descended to personal advertising, then I should mention that I
am speaking as well.
http://strataconf.com/stratany2012/public/schedule/speaker/126559
I will also have office hours afterwards during which the topic is
unlimited. Drop by!
On Sun, Oct 21, 2012 at 11:20 AM, Josh
Sounds good!
On Sun, Oct 21, 2012 at 12:59 PM, Grant Ingersoll gsing...@apache.orgwrote:
I'll be at Strata, too, but not speaking... sounds like we have the
makings of an informal Mahout gathering?
On Oct 21, 2012, at 2:42 PM, Ted Dunning wrote:
If we have descended to personal
, 2012 at 8:08 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
Sgd is more suitable for large data. I will take a look later today.
Sent from my iPhone
On Oct 9, 2012, at 11:29 PM, Rajesh Nikam rajeshni...@gmail.com wrote:
Hi Ted,
Putting specific question with data for getting
You have to tokenize your text and then use some form of vector encoding.
If you have a known dictionary of all interesting words, you can simply
make a vector as long as the number of words in your dictionary and put a 1
in the right place.
If you don't want to do that either because you don't
, Ted Dunning ted.dunn...@gmail.com
wrote:
My first thought was that we needed several passes, but that is clearly
wrong.
I think that the problem is in the data formatting and conversion
somehow.
Haven't had time to dope this out just yet. The iris data should
converge
trivially
Sgd is more suitable for large data. I will take a look later today.
Sent from my iPhone
On Oct 9, 2012, at 11:29 PM, Rajesh Nikam rajeshni...@gmail.com wrote:
Hi Ted,
Putting specific question with data for getting problem with SGD.
I am using Iris Plants Database from Michael
This might work, but the messages indicate that the environment is
seriously messed up. Just getting the code isn't going to help. The tests
are indicating that there is a real problem (and it isn't likely Mahout).
That problem needs fixing and once fixed running the tests isn't a bad
thing.
See this page: http://leon.bottou.org/research/stochastic
Google is your friend.
This API is, however, not particularly friendly. Therefore, you will have
to read about the basics and be able to figure these things out from first
principles. There is some documentation in the code. You can
Other experiments have shown that 60-80% of perception of music likes is
due to social factors.
Factoring this out may or may not be a good thing. My feeling is that if
you are trying to make people happy with what you recommend then you need
to go with whatever makes them happy whether it is
Combiners can be called zero or more times. That can happen on the map
side or on the reduce side.
On Thu, Sep 27, 2012 at 4:56 AM, Sigurd Spieckermann
sigurd.spieckerm...@gmail.com wrote:
@Jake: Could you please elaborate on how exactly the combiner can be called
before the reducer gets the
I think that there is an excessive stability issue, actually.
What seems to happen is that the adaptive part locks down the learning rate
too quickly.
This is related to several other issues:
- the cross fold learning paradigm is kind of dangerous since it depends on
the user not having
On the other hand, the only way that I have been able to do a major version
upgrade of Hadoop is to start a new company.
It is really hard to change code and platform at the same time. If you
don't have enough hardware to have two clusters temporarily, things will be
really hard moving off of
This changes the initial learning rate. CHanging this can definitely
change convergence properties.
On Fri, Sep 21, 2012 at 9:33 AM, Watson Watson watso...@gmail.com wrote:
Hi,
My question is why changing the rate parameter we always change the
coefficients (results of RunLogistic)?
I
If a classifier is presented text with no words in common with the training
data, it will give you back the most common category in the training data.
That said, it is likely to be quite rare when a new document consists
*entirely* of new words. Any overlap with trained vocabulary is likely to
goks...@gmail.com wrote:
Shouldn't this be 'unclassified'? I think I have seen data in the
unclassified buckets with both Bayes and SGD.
- Original Message -
| From: Ted Dunning ted.dunn...@gmail.com
| To: user@mahout.apache.org
| Sent: Wednesday, September 19, 2012 2:54:25 PM
And if you want the reduced rank representation of A, you have it already
with
A_k = U_k S_k V_k'
Assume that A is n x m in size. This means that U_k is n x k and V_k is m
x k
The rank reduced projection of an n x 1 column vector is
u_k = U_k U_k' u
Beware that v_k is probably not
?
On Sun, Sep 16, 2012 at 5:33 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
And if you want the reduced rank representation of A, you have it already
with
A_k = U_k S_k V_k'
Assume that A is n x m in size. This means that U_k is n x k and V_k is
m
x k
The rank reduced
by basically saying that the
projection is Uk' times the new vector, so, I never understood this
expression.
On Sun, Sep 16, 2012 at 7:13 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
A is in there implicitly.
U_k provides a basis of the row space and V_k provides a basis of the
column space
/A
If you shove u through U_k U_k' you get this:
U_k U_k' u = U_k U_k' (u_A + u_/A) = U_k U_k' (u_A) + 0 = u_A
This is another way of showing that U_k U_k' projects a vector into span A.
On Sun, Sep 16, 2012 at 12:55 PM, Ted Dunning ted.dunn...@gmail.com wrote:
U_k ' U_k = I
U_k U_k ' != I
in terms of the
latent variables.
On Sun, Sep 16, 2012 at 8:55 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
U_k ' U_k = I
U_k U_k ' != I
projecting
back
into span A and you are talking about expressing things in terms of the
latent variables.
On Sun, Sep 16, 2012 at 8:55 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
U_k ' U_k = I
U_k U_k ' != I
--
Lance Norskog
goks...@gmail.com
Rajesh
On Thu, Sep 13, 2012 at 8:53 PM, Ted Dunning tdunn...@maprtech.com
wrote:
Send this to the mailing list.
On Thu, Sep 13, 2012 at 7:35 AM, Rajesh Nikam rajeshni...@gmail.com
wrote:
Hi Ted,
I have data in WEKA ARFF format.
What to how to use this ARFF formatted
Yes. It is a grave embarrassment to us, but not a functional requirement.
On Thu, Sep 13, 2012 at 6:42 AM, I-Scarlatti, David
david.scarla...@boeing.com wrote:
Ok. So tests are just tests... not needed for having mahout running
Thanks!
-Original Message-
From: Paritosh
Yes.
I have been working (slowly) on moving some very fast single pass
clustering into Mahout. My work in progress currently does very fast
clustering of small dense vectors and it should scale to sparse vectors
fairly well with some small changes.
See https://github.com/tdunning/knn for more
Also, with 500MB of data, this is likely to only take a few minutes on a
single machine with the new clustering stuff. It is hard to estimate
precisely, however, due to the difference between dense and sparse cases.
On Wed, Sep 12, 2012 at 8:42 PM, Pat Ferrel pat.fer...@gmail.com wrote:
200
You are using lots of threads but the sparse matrix structure is not thread
safe. Setting a value in the SparseMatrix causes mutation to internal data
structures.
If you can have each thread do all the updates for a single thread, that
would be much better. Another option is to synchronize on
Great.
If the update has a huge impact on existing code, can you break it into
manageable pieces?
If it is just an addition, having a big blob of stuff is probably fine.
On Sun, Sep 9, 2012 at 7:01 AM, Gokhan Capan gkhn...@gmail.com wrote:
On Fri, Sep 7, 2012 at 12:48 AM, Ted Dunning ted.dunn
how it turns out.
Mike
On Thu, Sep 6, 2012 at 8:14 PM, Ted Dunning ted.dunn...@gmail.com wrote:
Try transforming them as well, likely with a log if they are positive and
have heavily skewed values.
Can you suck the data into R and paste in the results of summary(x)?
(assuming you put
This sounds pretty exciting. Beyond that, it is hard to say much.
Can you say a bit more about how you would see introducing the code into
Mahout?
On Thu, Sep 6, 2012 at 9:14 AM, Gokhan Capan gkhn...@gmail.com wrote:
By the way, I want to mention that my thesis is advised by Ozgur Yilmazel,
Try transforming them as well, likely with a log if they are positive and
have heavily skewed values.
Can you suck the data into R and paste in the results of summary(x)?
(assuming you put the data into the variable x). This should look
something like:
summary(x)
v1 v2
Yes. (A-M)V is U \Sigma. You may actually want something like U \sqrt
\Sigma instead, though.
On Wed, Sep 5, 2012 at 4:10 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Hello,
I have a question w.r.t what to advise people in the SSVD manual for PCA.
So we have
(A-M) \approx U \Sigma V^t
A quick t-test on these differences gives the same results no
significant difference.
On Mon, Sep 3, 2012 at 11:34 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:
Then i subtracted error means between two methods (+ sign means
smaller error for MR version, -sign means smaller error for R
results and errors so it doesn't make sense to make any error
comparison just between single runs of the variations. Instead, it
makes sense to compare error mean and variations on a better number of
runs.
-d
On Sun, Sep 2, 2012 at 12:00 AM, Ted Dunning ted.dunn...@gmail.com
wrote
Did Ahmed even use a power iteration?
On Sun, Sep 2, 2012 at 1:35 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:
but there is still a concern in a sense that power iterations
should've helped more than they did. I'll take a closer look but it
will take me a while to figure if there's something
with similar parameters.
One significant difference between MR and sequential version is that
sequential version is using ternary random matrix (instead of uniform
one), perhaps that may affect accuracy a little bit.
On Fri, Aug 31, 2012 at 10:55 PM, Ted Dunning ted.dunn...@gmail.com
wrote
is that
sequential version is using ternary random matrix (instead of uniform
one), perhaps that may affect accuracy a little bit.
On Fri, Aug 31, 2012 at 10:55 PM, Ted Dunning ted.dunn...@gmail.com
wrote:
Can you provide your test code?
What difference did you observe?
Did you account
With 57 crawled docs, you can't reasonably set p 57. That is your second
error.
On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel pat.fer...@gmail.com wrote:
I have a small data set that I am using in local mode for debugging
purposes. The data is 57 crawled docs with something like 2200 terms. I
, at 7:53 AM, Ted Dunning ted.dunn...@gmail.com wrote:
With 57 crawled docs, you can't reasonably set p 57. That is your second
error.
On Sat, Sep 1, 2012 at 10:32 AM, Pat Ferrel pat.fer...@gmail.com wrote:
I have a small data set that I am using in local mode for debugging
purposes
On Sun, Sep 2, 2012 at 12:26 AM, Ahmed Elgohary aagoh...@gmail.com wrote:
- I am using k = 30 and p = 2 so (k+p)99 (Rank(A))
- I am attaching the csv file of the matrix A
Brilliant. And the attachment actually made it through.
- yes, the difference is significant. Here is the output of
Yes. Essentially this means construct the Voronoi tesellation for all
points and for each post code, use the union of the regions for each point
in that post code. You will not necessarily have convex hulls for each
post-code, but you will have hulls and will almost certainly have a single
hull
First, this is a tiny training set. You are well outside the intended
application range so you are likely to find less experience in the
community in that range. That said, the algorithm should still produce
reasonably stable results.
Here are a few questions:
a) which class are you using to
) and dint find that the data
was passed more than once.
Yes I randomize the order every time.
a) I am using AdaptiveLearningRegression (just like 20newsgroup).
Thanks!
On Aug 31, 2012, at 2:27 PM, Ted Dunning wrote:
First, this is a tiny training set. You are well outside the intended
.
And randomize the order each time?
On Fri, Aug 31, 2012 at 9:04 AM, Salman Mahmood sal...@influestor.com
wrote:
Cheers ted. Appreciate the input!
Sent from my iPhone
On 31 Aug 2012, at 17:53, Ted Dunning ted.dunn...@gmail.com wrote:
OK.
Try passing through the data 100 times
] http://en.wikipedia.org/wiki/Bootstrapping_(statistics)
On Fri, Aug 31, 2012 at 11:24 PM, Ted Dunning ted.dunn...@gmail.com wrote:
That would be best, but practically speaking, randomizing once is usually
OK. With a tiny data set like this that is in memory anyway, I wouldn't
take any chances
Can you provide your test code?
What difference did you observe?
Did you account for the fact that your matrix is small enough that it
probably wasn't divided correctly?
On Sat, Sep 1, 2012 at 1:27 AM, Ahmed Elgohary aagoh...@gmail.com wrote:
Hi,
I used mahout's stochastic svd
:53 PM, Whitmore, Mattie mwhit...@harris.comwrote:
I need to be using the matrices for BallKmeans. Can matrices be named? By
this I mean can I assign a column of my matrix to be the name of each row?
Thanks!
-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent
But columns aren't what I would expect you to want labeled. I think that
row labels might be nicer. Happily, each named vector has a name for the
entire vector as well.
On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning ted.dunn...@gmail.com wrote:
The input to the BallKmeans is actually
Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Thursday, August 30, 2012 2:52 PM
To: user@mahout.apache.org
Subject: Re: Mahout-279/kmeans++
But columns aren't what I would expect you to want labeled. I think that
row labels might be nicer. Happily, each named vector
No. The algorithm works either way. The algorithm doesn't need the full
capabilities of a matrix since it just makes a few sequential passes
through the data.
On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie mwhit...@harris.comwrote:
Would the algorithm implement better as if given a matrix?
It isn't a big deal to increase the Znode size, but it is bad practice. ZK
isn't a file store. It is a coordination server. The size limit is
intended to prevent large operations slowing down other operations. If you
aren't sharing your ZK or your neighbors don't have response time
Karl,
I don't think that I understand your request.
What I think I hear is that you want an implementation (with unknown inputs
and outputs) that encodes a Voronoi tesselation using boundary vertices
instead of centroids.
Is that correct?
If so, it is relatively easy to go from centroid form
These are fairly straightforward to generate from random data.
Not particularly realistic, but highly parametrizable.
RCV1 should be almost in that range. I think that the recent KDD music
classification exercise would be in that range if viewed as a
classification exercise. See
The single most effective thing you can do with malicious users like this
is to let them think that they have won. In the ideal case, you can detect
simple click frauds and maintain a per user play adjustment so that they
see the fraudulent stats and everybody else sees the corrected stats. If
Obviously, you need to refer also to scores of other items as well.
One handy stat is AUC whcih you can compute by averaging to get the
probability that a relevant (viewed) item has a higher recommendation score
than a non-relevant (not viewed) item.
On Sun, Aug 26, 2012 at 5:55 PM, Sean Owen
701 - 800 of 1929 matches
Mail list logo